Intel C++ Intrinsic Reference. Document Number: US

Intel® C++ Intrinsic Reference Document Number: 312482-003US Disclaimer and Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION...
Author: Felix Hart
45 downloads 1 Views 779KB Size
Intel® C++ Intrinsic Reference Document Number: 312482-003US

Disclaimer and Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino logo, Core Inside, FlashFile, i960, InstantIP, Intel, Intel logo, Intel386, Intel486, Intel740, IntelDX2, IntelDX4, IntelSX2, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, IPLink, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Copyright (C) 1996–2007, Intel Corporation. All rights reserved. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P.

Table Of Contents Overview: Intrinsics Reference......................................................................... 1 Intrinsics for Intel® C++ Compilers ............................................................... 1 Availability of Intrinsics on Intel Processors ..................................................... 1 Details about Intrinsics ................................................................................... 2 Registers ................................................................................................... 2 Data Types................................................................................................. 2 New Data Types Available .......................................................................... 2 __m64 Data Type ..................................................................................... 3 __m128 Data Types.................................................................................. 3 Data Types Usage Guidelines ..................................................................... 3 Accessing __m128i Data............................................................................ 3 Naming and Usage Syntax .............................................................................. 5 References .................................................................................................... 7 Intrinsics for Use across All IA.......................................................................... 8 Overview: Intrinsics for All IA........................................................................ 8 Integer Arithmetic Intrinsics ......................................................................... 8 Floating-point Intrinsics................................................................................ 9 String and Block Copy Intrinsics ...................................................................11 Miscellaneous Intrinsics...............................................................................12 MMX(TM) Technology Intrinsics .......................................................................15 Overview: MMX(TM) Technology Intrinsics .....................................................15 The EMMS Instruction: Why You Need It........................................................15 Why You Need EMMS to Reset After an MMX(TM) Instruction .........................15 EMMS Usage Guidelines ..............................................................................16 iii

Table Of Contents MMX(TM) Technology General Support Intrinsics.............................................16 MMX(TM) Technology Packed Arithmetic Intrinsics ..........................................18 MMX(TM) Technology Shift Intrinsics.............................................................20 MMX(TM) Technology Logical Intrinsics..........................................................23 MMX(TM) Technology Compare Intrinsics .......................................................23 MMX(TM) Technology Set Intrinsics...............................................................24 MMX(TM) Technology Intrinsics on IA-64 Architecture .....................................27 Data Types................................................................................................27 Streaming SIMD Extensions............................................................................28 Overview: Streaming SIMD Extensions ..........................................................28 Floating-point Intrinsics for Streaming SIMD Extensions...................................28 Arithmetic Operations for Streaming SIMD Extensions .....................................28 Logical Operations for Streaming SIMD Extensions ..........................................32 Comparisons for Streaming SIMD Extensions .................................................33 Conversion Operations for Streaming SIMD Extensions ....................................42 Load Operations for Streaming SIMD Extensions.............................................46 Set Operations for Streaming SIMD Extensions...............................................47 Store Operations for Streaming SIMD Extensions ............................................49 Cacheability Support Using Streaming SIMD Extensions ...................................50 Integer Intrinsics Using Streaming SIMD Extensions........................................51 Intrinsics to Read and Write Registers for Streaming SIMD Extensions ...............54 Miscellaneous Intrinsics Using Streaming SIMD Extensions ...............................55 Using Streaming SIMD Extensions on IA-64 Architecture..................................56 Data Types .............................................................................................57 Compatibility versus Performance ..............................................................57 iv

Table Of Contents Macro Functions............................................................................................59 Macro Function for Shuffle Using Streaming SIMD Extensions ...........................59 Shuffle Function Macro .............................................................................59 View of Original and Result Words with Shuffle Function Macro.......................59 Macro Functions to Read and Write the Control Registers .................................59 Exception State Macros with _MM_EXCEPT_DIV_ZERO..................................60 Macro Function for Matrix Transposition.........................................................61 Matrix Transposition Using _MM_TRANSPOSE4_PS Macro ..............................61 Streaming SIMD Extensions 2 .........................................................................62 Overview: Streaming SIMD Extensions 2 .......................................................62 Floating-point Intrinsics...............................................................................63 Floating-point Arithmetic Operations for Streaming SIMD Extensions 2............63 Floating-point Logical Operations for Streaming SIMD Extensions 2 ................66 Floating-point Comparison Operations for Streaming SIMD Extensions 2..........67 Floating-point Conversion Operations for Streaming SIMD Extensions 2...........74 Floating-point Load Operations for Streaming SIMD Extensions 2 ...................78 Floating-point Set Operations for Streaming SIMD Extensions 2 .....................80 Floating-point Store Operations for Streaming SIMD Extensions 2 ..................81 Integer Intrinsics........................................................................................83 Integer Arithmetic Operations for Streaming SIMD Extensions 2.....................83 Integer Logical Operations for Streaming SIMD Extensions 2 .........................90 Integer Shift Operations for Streaming SIMD Extensions 2 ............................91 Integer Comparison Operations for Streaming SIMD Extensions 2...................95 Integer Conversion Operations for Streaming SIMD Extensions 2....................98 Integer Move Operations for Streaming SIMD Extensions 2............................99 v

Table Of Contents Integer Load Operations for Streaming SIMD Extensions 2 ..........................100 Integer Set Operations for SSE2 ..............................................................101 Integer Store Operations for Streaming SIMD Extensions 2 .........................104 Miscellaneous Functions and Intrinsics ...........................................................106 Cacheability Support Operations for Streaming SIMD Extensions 2 ..................106 Miscellaneous Operations for Streaming SIMD Extensions 2 .........................107 Intrinsics for Casting Support ..................................................................112 Pause Intrinsic for Streaming SIMD Extensions 2........................................112 Macro Function for Shuffle ......................................................................113 Shuffle Function Macro ...........................................................................113 View of Original and Result Words with Shuffle Function Macro.....................113 Streaming SIMD Extensions 3 .......................................................................115 Overview: Streaming SIMD Extensions 3 .....................................................115 Integer Vector Intrinsics for Streaming SIMD Extensions 3 .............................115 Single-precision Floating-point Vector Intrinsics for Streaming SIMD Extensions 3 .............................................................................................................115 Double-precision Floating-point Vector Intrinsics for Streaming SIMD Extensions 3 .............................................................................................................117 Macro Functions for Streaming SIMD Extensions 3 ........................................118 Miscellaneous Intrinsics for Streaming SIMD Extensions 3 ..............................118 Supplemental Streaming SIMD Extensions 3 ...................................................120 Overview: Supplemental Streaming SIMD Extensions 3 .................................120 Addition Intrinsics ....................................................................................120 Subtraction Intrinsics................................................................................122 Multiplication Intrinsics..............................................................................123 Absolute Value Intrinsics ...........................................................................124 vi

Table Of Contents Shuffle Intrinsics for Streaming SIMD Extensions 3 .......................................126 Concatenate Intrinsics ..............................................................................127 Negation Intrinsics ...................................................................................127 Streaming SIMD Extensions 4 .......................................................................131 Overview: Streaming SIMD Extensions 4 .....................................................131 Streaming SIMD Extensions 4 Vectorizing Compiler and Media Accelerators ......131 Overview: Streaming SIMD Extensions 4 Vectorizing Compiler and Media Accelerators..........................................................................................131 Packed Blending Intrinsics for Streaming SIMD Extensions 4........................131 Floating Point Dot Product Intrinsics for Streaming SIMD Extensions 4 ..........132 Packed Format Conversion Intrinsics for Streaming SIMD Extensions 4..........132 Packed Integer Min/Max Intrinsics for Streaming SIMD Extensions 4 .............134 Floating Point Rounding Intrinsics for Streaming SIMD Extensions 4..............135 DWORD Multiply Intrinsics for Streaming SIMD Extensions 4........................136 Register Insertion/Extraction Intrinsics for Streaming SIMD Extensions 4 ......136 Test Intrinsics for Streaming SIMD Extensions 4 ........................................137 Packed DWORD to Unsigned WORD Intrinsic for Streaming SIMD Extensions 4 ..........................................................................................................138 Packed Compare for Equal for Streaming SIMD Extensions 4 .......................138 Cacheability Support Intrinsic for Streaming SIMD Extensions 4 ...................138 Streaming SIMD Extensions 4 Efficient Accelerated String and Text Processing..138 Overview: Streaming SIMD Extensions 4 Efficient Accelerated String and Text Processing ............................................................................................138 Packed Comparison Intrinsics for Streaming SIMD Extensions 4 ...................139 Application Targeted Accelerators Intrinsics ...............................................141 Intrinsics for IA-64 Instructions.....................................................................143

vii

Table Of Contents Overview: Intrinsics for IA-64 Instructions ...................................................143 Native Intrinsics for IA-64 Instructions ........................................................143 Integer Operations ................................................................................143 FSR Operations .....................................................................................144 Lock and Atomic Operation Related Intrinsics ...............................................145 Lock and Atomic Operation Related Intrinsics ...............................................148 Load and Store ........................................................................................151 Operating System Related Intrinsics............................................................152 Conversion Intrinsics ................................................................................155 Register Names for getReg() and setReg() ...................................................155 General Integer Registers .......................................................................156 Application Registers..............................................................................156 Control Registers ...................................................................................157 Indirect Registers for getIndReg() and setIndReg() ....................................158 Multimedia Additions.................................................................................158 Table 1. Values of n for m64_mux1 Operation ...........................................161 Synchronization Primitives .........................................................................164 Atomic Fetch-and-op Operations ..............................................................164 Atomic Op-and-fetch Operations ..............................................................164 Atomic Compare-and-swap Operations .....................................................165 Atomic Synchronize Operation .................................................................165 Atomic Lock-test-and-set Operation .........................................................165 Atomic Lock-release Operation ................................................................165 Miscellaneous Intrinsics.............................................................................165 Intrinsics for Dual-Core Intel® Itanium® 2 processor 9000 series ................... 166 viii

Table Of Contents Examples ................................................................................................168 Microsoft-compatible Intrinsics for Dual-Core Intel® Itanium® 2 processor 9000 series .....................................................................................................170 Data Alignment, Memory Allocation Intrinsics, and Inline Assembly .................... 174 Overview: Data Alignment, Memory Allocation Intrinsics, and Inline Assembly .. 174 Alignment Support ...................................................................................174 Allocating and Freeing Aligned Memory Blocks ..............................................175 Inline Assembly .......................................................................................176 Microsoft Style Inline Assembly ...............................................................176 GNU*-like Style Inline Assembly (IA-32 architecture and Intel® 64 architecture only) ...................................................................................................176 Example .................................................................................................178 Example .................................................................................................178 Intrinsics Cross-processor Implementation .....................................................182 Overview: Intrinsics Cross-processor Implementation....................................182 Intrinsics For Implementation Across All IA ..................................................182 MMX(TM) Technology Intrinsics Implementation ...........................................185 Key to the table entries ..........................................................................185 Streaming SIMD Extensions Intrinsics Implementation ..................................187 Key to the table entries ..........................................................................187 Streaming SIMD Extensions 2 Intrinsics Implementation ................................191 Index ........................................................................................................193

ix

Intel(R) C++ Intrinsic Reference

Overview: Intrinsics Reference Intrinsics are assembly-coded functions that allow you to use C++ function calls and variables in place of assembly instructions. Intrinsics are expanded inline eliminating function call overhead. Providing the same benefit as using inline assembly, intrinsics improve code readability, assist instruction scheduling, and help reduce debugging. Intrinsics provide access to instructions that cannot be generated using the standard constructs of the C and C++ languages.

Intrinsics for Intel® C++ Compilers The Intel® C++ Compiler enables easy implementation of assembly instructions through the use of intrinsics. Intrinsics are provided for Intel® Streaming SIMD Extensions 4 (SSE4), Supplemental Streaming SIMD Extensions 3 (SSSE3), Streaming SIMD Extensions 3 (SSE3), Streaming SIMD Extensions 2 (SSE2), and Streaming SIMD Extensions (SSE) instructions. The Intel C++ Compiler for IA-64 architecture also provides architecture-specific intrinsics. The Intel C++ Compiler provides intrinsics that work on specific architectures and intrinsics that work across IA-32, Intel® 64, and IA-64 architectures. Most intrinsics map directly to a corresponding assembly instruction, some map to several assembly instructions. The Intel C++ Compiler also supports Microsoft* Visual Studio 2005 intrinsics (for x86 and x64 architectures) to generate instructions on Intel processors based on IA32 and Intel® 64 architectures. For more information on these Microsoft* intrinsics, visit http://msdn2.microsoft.com/en-us/library/26td21ds.aspx.

Availability of Intrinsics on Intel Processors Not all Intel processors support all intrinsics. For information on which intrinsics are supported on Intel processors, visit http://processorfinder.intel.com. The Processor Spec Finder tool links directly to all processor documentation and the data sheets list the features, including intrinsics, supported by each processor.

1

Intel(R) C++ Intrinsics Reference

Details about Intrinsics The MMX(TM) technology and Streaming SIMD Extension (SSE) instructions use the following features: • •

Registers--Enable packed data of up to 128 bits in length for optimal SIMD processing Data Types--Enable packing of up to 16 elements of data in one register

Registers Intel processors provide special register sets. The MMX instructions use eight 64-bit registers (mm0 to mm7) which are aliased on the floating-point stack registers. The Streaming SIMD Extensions use eight 128-bit registers (xmm0 to xmm7). Because each of these registers can hold more than one data element, the processor can process more than one data element simultaneously. This processing capability is also known as single-instruction multiple data processing (SIMD). For each computational and data manipulation instruction in the new extension sets, there is a corresponding C intrinsic that implements that instruction directly. This frees you from managing registers and assembly programming. Further, the compiler optimizes the instruction scheduling so that your executable runs faster. Note The MM and XMM registers are the SIMD registers used by the IA-32 platforms to implement MMX technology and SSE or SSE2 intrinsics. On the IA-64 architecture, the MMX and SSE intrinsics use the 64-bit general registers and the 64-bit significand of the 80-bit floating-point register.

Data Types Intrinsic functions use four new C data types as operands, representing the new registers that are used as the operands to these intrinsic functions. New Data Types Available The following table details for which instructions each of the new data types are available. New Data MMX(TM) Type Technology

Streaming SIMD Streaming SIMD Streaming SIMD Extensions Extensions 2 Extensions 3

__m64

Available

Available

Available

Available

__m128

Not available

Available

Available

Available 2

Intel(R) C++ Intrinsics Reference

__m128d

Not available

Not available

Available

Available

__m128i

Not available

Not available

Available

Available

__m64 Data Type The __m64 data type is used to represent the contents of an MMX register, which is the register that is used by the MMX technology intrinsics. The __m64 data type can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one 64-bit value. __m128 Data Types The __m128 data type is used to represent the contents of a Streaming SIMD Extension register used by the Streaming SIMD Extension intrinsics. The __m128 data type can hold four 32-bit floating-point values. The __m128d data type can hold two 64-bit floating-point values. The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit integer values. The compiler aligns __m128d and _m128i local and global data to 16-byte boundaries on the stack. To align integer, float, or double arrays, you can use the declspec align statement. Data Types Usage Guidelines These data types are not basic ANSI C data types. You must observe the following usage restrictions: • • •

Use data types only on either side of an assignment, as a return value, or as a parameter. You cannot use it with other arithmetic expressions (+, -, etc). Use data types as objects in aggregates, such as unions, to access the byte elements and structures. Use data types only with the respective intrinsics described in this documentation.

Accessing __m128i Data To access 8-bit data: #define _mm_extract_epi8(x, imm) \ ((((imm) & 0x1) == 0) ?

\

_mm_extract_epi16((x), (imm) >> 1) & 0xff : \ _mm_extract_epi16(_mm_srli_epi16((x), 8), (imm) >> 1)) 3

Intel(R) C++ Intrinsics Reference

For 16-bit data, use the following intrinsic: int _mm_extract_epi16(__m128i a, int imm) To access 32-bit data: #define _mm_extract_epi32(x, imm) \ _mm_cvtsi128_si32(_mm_srli_si128((x), 4 * (imm)))

To access 64-bit data (Intel® 64 architecture only): #define _mm_extract_epi64(x, imm) \ _mm_cvtsi128_si64(_mm_srli_si128((x), 8 * (imm)))

4

Intel(R) C++ Intrinsics Reference

Naming and Usage Syntax Most intrinsic names use the following notational convention: _mm__ The following table explains each item in the syntax. Indicates the basic operation of the intrinsic; for example, add for addition and sub for subtraction.

Denotes the type of data the instruction operates on. The first one or two letters of each suffix denote whether the data is packed (p), extended packed (ep), or scalar (s). The remaining letters and numbers denote the type, with notation as follows: • • • • • • • • • • •

s single-precision floating point d double-precision floating point i128 signed 128-bit integer i64 signed 64-bit integer u64 unsigned 64-bit integer i32 signed 32-bit integer u32 unsigned 32-bit integer i16 signed 16-bit integer u16 unsigned 16-bit integer i8 signed 8-bit integer u8 unsigned 8-bit integer

A number appended to a variable name indicates the element of a packed object. For example, r0 is the lowest word of r. Some intrinsics are "composites" because they require more than one instruction to implement them. The packed values are represented in right-to-left order, with the lowest value being used for scalar operations. Consider the following example operation: double a[2] = {1.0, 2.0}; __m128d t = _mm_load_pd(a); The result is the same as either of the following: __m128d t = _mm_set_pd(2.0, 1.0); __m128d t = _mm_setr_pd(1.0, 2.0); In other words, the xmm register that holds the value t appears as follows:

5

Intel(R) C++ Intrinsics Reference The "scalar" element is 1.0. Due to the nature of the instruction, some intrinsics require their arguments to be immediates (constant integer literals).

6

Intel(R) C++ Intrinsics Reference

References See the following publications and internet locations for more information about intrinsics and the Intel architectures that support them. You can find all publications on the Intel website. Internet Location or Publication

Description

developer.intel.com

Technical resource center for hardware designers and developers; contains links to product pages and documentation.

Intel® Itanium® Architecture Contains information and details about Software Developer's Manuals, Volume Itanium instructions. 3: Instruction Set Reference IA-32 Intel® Architecture Software Developer's Manual, Volume 2A: Instruction Set Reference, A-M

Describes the format of the instruction set of IA-32 Intel Architecture and covers the reference pages of instructions from A to M

IA-32 Intel® Architecture Software Developer's Manual, Volume 2B: Instruction Set Reference, N-Z

Describes the format of the instruction set of IA-32 Intel Architecture and covers the reference pages of instructions from N to Z

Intel® Itanium® 2 processor website

Intel website for the Itanium 2 processor; select the "Documentation" tab for documentation.

7

Intel(R) C++ Intrinsics Reference

Intrinsics for Use across All IA Overview: Intrinsics for All IA The intrinsics in this section function across all IA-32 and IA-64-based platforms. They are offered as a convenience to the programmer. They are grouped as follows: • • • •

Integer Arithmetic Intrinsics Floating-Point Intrinsics String and Block Copy Intrinsics Miscellaneous Intrinsics

Integer Arithmetic Intrinsics The following table lists and describes integer arithmetic intrinsics that you can use across all Intel architectures. Intrinsic

Description

int abs(int)

Returns the absolute value of an integer.

long labs(long)

Returns the absolute value of a long integer.

unsigned long _lrotl(unsigned long value, int shift)

Implements 64-bit left rotate of value by shift positions.

unsigned long _lrotr(unsigned long value, int shift)

Implements 64-bit right rotate of value by shift positions.

unsigned int _rotl(unsigned int value, int shift)

Implements 32-bit left rotate of value by shift positions.

unsigned int _rotr(unsigned int value, int shift)

Implements 32-bit right rotate of value by shift positions.

unsigned short _rotwl(unsigned short val, int shift)

Implements 16-bit left rotate of value by shift positions. These intrinsics are not supported on IA-64 platforms.

unsigned short _rotwr(unsigned short val, int shift)

Implements 16-bit right rotate of value by shift positions. These intrinsics are not supported on IA-64 platforms.

8

Intel(R) C++ Intrinsics Reference

Note Passing a constant shift value in the rotate intrinsics results in higher performance.

Floating-point Intrinsics The following table lists and describes floating point intrinsics that you can use across all Intel architectures. Intrinsic

Description

double fabs(double)

Returns the absolute value of a floating-point value.

double log(double)

Returns the natural logarithm ln(x), x>0, with double precision.

float logf(float)

Returns the natural logarithm ln(x), x>0, with single precision.

double log10(double)

Returns the base 10 logarithm log10(x), x>0, with double precision.

float log10f(float)

Returns the base 10 logarithm log10(x), x>0, with single precision.

double exp(double)

Returns the exponential function with double precision.

float expf(float)

Returns the exponential function with single precision.

double pow(double, double)

Returns the value of x to the power y with double precision.

float powf(float, float)

Returns the value of x to the power y with single precision.

double sin(double)

Returns the sine of x with double precision.

float sinf(float)

Returns the sine of x with single precision.

double cos(double)

Returns the cosine of x with double precision.

float cosf(float)

Returns the cosine of x with single precision.

double tan(double)

Returns the tangent of x with double precision.

float tanf(float)

Returns the tangent of x with single precision.

double acos(double)

Returns the inverse cosine of x with double precision

float acosf(float)

Returns the inverse cosine of x with single precision

9

Intel(R) C++ Intrinsics Reference

double acosh(double)

Compute the inverse hyperbolic cosine of the argument with double precision.

float acoshf(float)

Compute the inverse hyperbolic cosine of the argument with single precision.

double asin(double)

Compute inverse sine of the argument with double precision.

float asinf(float)

Compute inverse sine of the argument with single precision.

double asinh(double)

Compute inverse hyperbolic sine of the argument with double precision.

float asinhf(float)

Compute inverse hyperbolic sine of the argument with single precision.

double atan(double)

Compute inverse tangent of the argument with double precision.

float atanf(float)

Compute inverse tangent of the argument with single precision.

double atanh(double)

Compute inverse hyperbolic tangent of the argument with double precision.

float atanhf(float)

Compute inverse hyperbolic tangent of the argument with single precision.

double cabs(double complex z)

Computes absolute value of complex number. The intrinsic argument complex is a complex number made up of two double precision elements, one real and one imaginary. The input parameter z is made up of two values of double precision type passed together as a single argument.

float cabsf(float complex z)

Computes absolute value of complex number. The intrinsic argument complex is a complex number made up of two single precision elements, one real and one imaginary. The input parameter z is made up of two values of single precision type passed together as a single argument.

double ceil(double)

Computes smallest integral value of double precision argument not less than the argument.

float ceilf(float)

Computes smallest integral value of single precision argument not less than the argument.

double cosh(double)

Computes the hyperbolic cosine of double precision argument.

10

Intel(R) C++ Intrinsics Reference

float coshf(float)

Computes the hyperbolic cosine of single precision argument.

float fabsf(float)

Computes absolute value of single precision argument.

double floor(double)

Computes the largest integral value of the double precision argument not greater than the argument.

float floorf(float)

Computes the largest integral value of the single precision argument not greater than the argument.

double fmod(double)

Computes the floating-point remainder of the division of the first argument by the second argument with double precision.

float fmodf(float)

Computes the floating-point remainder of the division of the first argument by the second argument with single precision.

double hypot(double, double)

Computes the length of the hypotenuse of a right angled triangle with double precision.

float hypotf(float, float)

Computes the length of the hypotenuse of a right angled triangle with single precision.

double rint(double)

Computes the integral value represented as double using the IEEE rounding mode.

float rintf(float)

Computes the integral value represented with single precision using the IEEE rounding mode.

double sinh(double)

Computes the hyperbolic sine of the double precision argument.

float sinhf(float)

Computes the hyperbolic sine of the single precision argument.

float sqrtf(float)

Computes the square root of the single precision argument.

double tanh(double)

Computes the hyperbolic tangent of the double precision argument.

float tanhf(float)

Computes the hyperbolic tangent of the single precision argument.

String and Block Copy Intrinsics The following table lists and describes string and block copy intrinsics that you can use across all Intel architectures.

11

Intel(R) C++ Intrinsics Reference The string and block copy intrinsics are not implemented as intrinsics on IA-64 architecture. Intrinsic

Description

char *_strset(char *, _int32)

Sets all characters in a string to a fixed value.

int memcmp(const void *cs, const void *ct, size_t n) Compares two regions of memory. Return ct. void *memcpy(void *s, const void *ct, size_t n)

Copies from memory. Returns s.

void *memset(void * s, int c, size_t n)

Sets memory to a fixed value. Returns s.

char *strcat(char * s, const char * ct)

Appends to a string. Returns s.

int strcmp(const char *, const char *)

Compares two strings. Return ct.

char *strcpy(char * s, const char * ct)

Copies a string. Returns s.

size_t strlen(const char * cs)

Returns the length of string cs.

int strncmp(char *, char *, int)

Compare two strings, but only specified number of characters.

int strncpy(char *, char *, int)

Copies a string, but only specified number of characters.

Miscellaneous Intrinsics The following table lists and describes intrinsics that you can use across all Intel architectures, except where noted. Intrinsic

Description

_abnormal_termination(void)

Can be invoked only by termination handlers. Returns TRUE if the termination handler is 12

Intel(R) C++ Intrinsics Reference

invoked as a result of a premature exit of the corresponding try-finally region. __cpuid

Queries the processor for information about processor type and supported features. The Intel® C++ Compiler supports the Microsoft* implementation of this intrinsic. See the Microsoft documentation for details.

void *_alloca(int)

Allocates memory in the local stack frame. The memory is automatically freed upon return from the function.

int _bit_scan_forward(int x) Returns the bit index of the least significant set bit of x. If x is 0, the result is undefined. int _bit_scan_reverse(int)

Returns the bit index of the most significant set bit of x. If x is 0, the result is undefined.

int _bswap(int)

Reverses the byte order of x. Bits 0-7 are swapped with bits 24-31, and bits 8-15 are swapped with bits 16-23.

_exception_code(void)

Returns the exception code.

_exception_info(void)

Returns the exception information.

void _enable(void)

Enables the interrupt.

void _disable(void)

Disables the interrupt.

int _in_byte(int)

Intrinsic that maps to the IA-32 instruction IN. Transfer data byte from port specified by argument.

int _in_dword(int)

Intrinsic that maps to the IA-32 instruction IN. Transfer double word from port specified by argument.

int _in_word(int)

Intrinsic that maps to the IA-32 instruction IN. Transfer word from port specified by argument.

int _inp(int)

Same as _in_byte

int _inpd(int)

Same as _in_dword

int _inpw(int)

Same as _in_word

int _out_byte(int, int)

Intrinsic that maps to the IA-32 instruction OUT. Transfer data byte in second argument to port specified by first argument.

int _out_dword(int, int)

Intrinsic that maps to the IA-32 instruction OUT. Transfer double word in second argument to port specified by first argument.

int _out_word(int, int)

Intrinsic that maps to the IA-32 instruction OUT. Transfer word in second argument to port 13

Intel(R) C++ Intrinsics Reference

specified by first argument. int _outp(int, int)

Same as _out_byte

int _outpd(int, int)

Same as _out_dword

int _outpw(int, int)

Same as _out_word

int _popcnt32(int x)

Returns the number of set bits in x.

__int64 _rdtsc(void)

Returns the current value of the processor's 64bit time stamp counter. This intrinsic is not implemented on systems based on IA-64 architecture. See Time Stamp for an example of using this intrinsic.

__int64 _rdpmc(int p)

Returns the current value of the 40-bit performance monitoring counter specified by p.

int _setjmp(jmp_buf)

A fast version of setjmp(), which bypasses the termination handling. Saves the callee-save registers, stack pointer and return address. This intrinsic is not implemented on systems based on IA-64 architecture.

14

Intel(R) C++ Intrinsics Reference

MMX(TM) Technology Intrinsics Overview: MMX(TM) Technology Intrinsics MMX™ technology is an extension to the Intel architecture (IA) instruction set. The MMX instruction set adds 57 opcodes and a 64-bit quadword data type, and eight 64bit registers. Each of the eight registers can be directly addressed using the register names mm0 to mm7. The prototypes for MMX technology intrinsics are in the mmintrin.h header file.

The EMMS Instruction: Why You Need It Using EMMS is like emptying a container to accommodate new content. The EMMS instruction clears the MMX™ registers and sets the value of the floating-point tag word to empty. Because floating-point convention specifies that the floating-point stack be cleared after use, you should clear the MMX registers before issuing a floating-point instruction. You should insert the EMMS instruction at the end of all MMX code segments to avoid a floating-point overflow exception. Why You Need EMMS to Reset After an MMX(TM) Instruction

Caution Failure to empty the multimedia state after using an MMX instruction and before using a floating-point instruction can result in unexpected execution or poor performance.

15

Intel(R) C++ Intrinsics Reference

EMMS Usage Guidelines Here are guidelines for when to use the EMMS instruction: •

• • • •



Use _mm_empty() after an MMX™ instruction if the next instruction is a floating-point (FP) instruction. For example, you should use the EMMS instruction before performing calculations on float, double or long double. You must be aware of all situations in which your code generates an MMX instruction: • when using an MMX technology intrinsic • when using Streaming SIMD Extension integer intrinsics that use the __m64 data type • when referencing an __m64 data type variable • when using an MMX instruction through inline assembly Use different functions for operations that use floating point instructions and those that use MMX instructions. This action eliminates the need to empty the multimedia state within the body of a critical loop. Use _mm_empty() during runtime initialization of __m64 and FP data types. This ensures resetting the register between data type transitions. Do not use _mm_empty() before an MMX instruction, since using _mm_empty() before an MMX instruction incurs an operation with no benefit (no-op). Do not use on systems based on IA-64 architecture. There are no special registers (or overlay) for the MMX(TM) instructions or Streaming SIMD Extensions on systems based on IA-64 architecture even though the intrinsics are supported. See the Correct Usage and Incorrect Usage coding examples in the following table.

Incorrect Usage

Correct Usage

__m64 x = _m_paddd(y, z); __m64 x = _m_paddd(y, z); float f = init(); float f = (_mm_empty(), init());

MMX(TM) Technology General Support Intrinsics The prototypes for MMX™ technology intrinsics are in the mmintrin.h header file. Details about each intrinsic follows the table below. Intrinsic Name

Operation

Corresponding MMX Instruction

_mm_empty

Empty MM state

EMMS

_mm_cvtsi32_si64

Convert from int

MOVD

_mm_cvtsi64_si32

Convert to int

MOVD

_mm_cvtsi64_m64

Convert from __int64 MOVQ

_mm_cvtm64_si64

Convert to __int64

MOVQ

_mm_packs_pi16

Pack

PACKSSWB

_mm_packs_pi32

Pack

PACKSSDW 16

Intel(R) C++ Intrinsics Reference

_mm_packs_pu16

Pack

PACKUSWB

_mm_unpackhi_pi8

Interleave

PUNPCKHBW

_mm_unpackhi_pi16 Interleave

PUNPCKHWD

_mm_unpackhi_pi32 Interleave

PUNPCKHDQ

Interleave

PUNPCKLBW

_mm_unpacklo_pi16 Interleave

PUNPCKLWD

_mm_unpacklo_pi32 Interleave

PUNPCKLDQ

_mm_unpacklo_pi8

void _mm_empty(void) Empty the multimedia state.

__m64 _mm_cvtsi32_si64(int i) Convert the integer object i to a 64-bit __m64 object. The integer value is zeroextended to 64 bits. int _mm_cvtsi64_si32(__m64 m) Convert the lower 32 bits of the __m64 object m to an integer. __m64 _mm_cvtsi64_m64(__int64 i) Move the 64-bit integer object i to a __mm64 object __int64 _mm_cvtm64_si64(__m64 m) Move the __m64 object m to a 64-bit integer __m64 _mm_packs_pi16(__m64 m1, __m64 m2) Pack the four 16-bit values from m1 into the lower four 8-bit values of the result with signed saturation, and pack the four 16-bit values from m2 into the upper four 8-bit values of the result with signed saturation. __m64 _mm_packs_pi32(__m64 m1, __m64 m2) Pack the two 32-bit values from m1 into the lower two 16-bit values of the result with signed saturation, and pack the two 32-bit values from m2 into the upper two 16-bit values of the result with signed saturation. __m64 _mm_packs_pu16(__m64 m1, __m64 m2)

17

Intel(R) C++ Intrinsics Reference Pack the four 16-bit values from m1 into the lower four 8-bit values of the result with unsigned saturation, and pack the four 16-bit values from m2 into the upper four 8bit values of the result with unsigned saturation. __m64 _mm_unpackhi_pi8(__m64 m1, __m64 m2) Interleave the four 8-bit values from the high half of m1 with the four values from the high half of m2. The interleaving begins with the data from m1. __m64 _mm_unpackhi_pi16(__m64 m1, __m64 m2) Interleave the two 16-bit values from the high half of m1 with the two values from the high half of m2. The interleaving begins with the data from m1. __m64 _mm_unpackhi_pi32(__m64 m1, __m64 m2) Interleave the 32-bit value from the high half of m1 with the 32-bit value from the high half of m2. The interleaving begins with the data from m1.

__m64 _mm_unpacklo_pi8(__m64 m1, __m64 m2) Interleave the four 8-bit values from the low half of m1 with the four values from the low half of m2. The interleaving begins with the data from m1. __m64 _mm_unpacklo_pi16(__m64 m1, __m64 m2) Interleave the two 16-bit values from the low half of m1 with the two values from the low half of m2. The interleaving begins with the data from m1. __m64 _mm_unpacklo_pi32(__m64 m1, __m64 m2) Interleave the 32-bit value from the low half of m1 with the 32-bit value from the low half of m2. The interleaving begins with the data from m1.

MMX(TM) Technology Packed Arithmetic Intrinsics The prototypes for MMX™ technology intrinsics are in the mmintrin.h header file. Details about each intrinsic follows the table below. Intrinsic Name Operation

Corresponding MMX Instruction

_mm_add_pi8

Addition

PADDB

_mm_add_pi16

Addition

PADDW

_mm_add_pi32

Addition

PADDD

_mm_adds_pi8

Addition

PADDSB

_mm_adds_pi16

Addition

PADDSW 18

Intel(R) C++ Intrinsics Reference

_mm_adds_pu8

Addition

PADDUSB

_mm_adds_pu16

Addition

PADDUSW

_mm_sub_pi8

Subtraction

PSUBB

_mm_sub_pi16

Subtraction

PSUBW

_mm_sub_pi32

Subtraction

PSUBD

_mm_subs_pi8

Subtraction

PSUBSB

_mm_subs_pi16

Subtraction

PSUBSW

_mm_subs_pu8

Subtraction

PSUBUSB

_mm_subs_pu16

Subtraction

PSUBUSW

_mm_madd_pi16

Multiply and add PMADDWD

_mm_mulhi_pi16 Multiplication

PMULHW

_mm_mullo_pi16 Multiplication

PMULLW

__m64 _mm_add_pi8(__m64 m1, __m64 m2) Add the eight 8-bit values in m1 to the eight 8-bit values in m2. __m64 _mm_add_pi16(__m64 m1, __m64 m2) Add the four 16-bit values in m1 to the four 16-bit values in m2. __m64 _mm_add_pi32(__m64 m1, __m64 m2) Add the two 32-bit values in m1 to the two 32-bit values in m2. __m64 _mm_adds_pi8(__m64 m1, __m64 m2) Add the eight signed 8-bit values in m1 to the eight signed 8-bit values in m2 using saturating arithmetic. __m64 _mm_adds_pi16(__m64 m1, __m64 m2) Add the four signed 16-bit values in m1 to the four signed 16-bit values in m2 using saturating arithmetic. __m64 _mm_adds_pu8(__m64 m1, __m64 m2) Add the eight unsigned 8-bit values in m1 to the eight unsigned 8-bit values in m2 and using saturating arithmetic. __m64 _mm_adds_pu16(__m64 m1, __m64 m2) Add the four unsigned 16-bit values in m1 to the four unsigned 16-bit values in m2 using saturating arithmetic. 19

Intel(R) C++ Intrinsics Reference

__m64 _mm_sub_pi8(__m64 m1, __m64 m2) Subtract the eight 8-bit values in m2 from the eight 8-bit values in m1. __m64 _mm_sub_pi16(__m64 m1, __m64 m2) Subtract the four 16-bit values in m2 from the four 16-bit values in m1. __m64 _mm_sub_pi32(__m64 m1, __m64 m2) Subtract the two 32-bit values in m2 from the two 32-bit values in m1. __m64 _mm_subs_pi8(__m64 m1, __m64 m2) Subtract the eight signed 8-bit values in m2 from the eight signed 8-bit values in m1 using saturating arithmetic. __m64 _mm_subs_pi16(__m64 m1, __m64 m2) Subtract the four signed 16-bit values in m2 from the four signed 16-bit values in m1 using saturating arithmetic. __m64 _mm_subs_pu8(__m64 m1, __m64 m2) Subtract the eight unsigned 8-bit values in m2 from the eight unsigned 8-bit values in m1 using saturating arithmetic. __m64 _mm_subs_pu16(__m64 m1, __m64 m2) Subtract the four unsigned 16-bit values in m2 from the four unsigned 16-bit values in m1 using saturating arithmetic. __m64 _mm_madd_pi16(__m64 m1, __m64 m2) Multiply four 16-bit values in m1 by four 16-bit values in m2 producing four 32-bit intermediate results, which are then summed by pairs to produce two 32-bit results. __m64 _mm_mulhi_pi16(__m64 m1, __m64 m2) Multiply four signed 16-bit values in m1 by four signed 16-bit values in m2 and produce the high 16 bits of the four results. __m64 _mm_mullo_pi16(__m64 m1, __m64 m2) Multiply four 16-bit values in m1 by four 16-bit values in m2 and produce the low 16 bits of the four results.

MMX(TM) Technology Shift Intrinsics The prototypes for MMX™ technology intrinsics are in the mmintrin.h header file.

20

Intel(R) C++ Intrinsics Reference Details about each intrinsic follows the table below. Intrinsic Name

Operation

Corresponding MMX Instruction

_mm_sll_pi16

Logical shift left

PSLLW

_mm_slli_pi16 Logical shift left _mm_sll_pi32

Logical shift left

_mm_slli_pi32 Logical shift left _mm_sll_pi64

Logical shift left

_mm_slli_pi64 Logical shift left _mm_sra_pi16

PSLLWI PSLLD PSLLDI PSLLQ PSLLQI

Arithmetic shift right PSRAW

_mm_srai_pi16 Arithmetic shift right PSRAWI _mm_sra_pi32

Arithmetic shift right PSRAD

_mm_srai_pi32 Arithmetic shift right PSRADI _mm_srl_pi16

Logical shift right

_mm_srli_pi16 Logical shift right _mm_srl_pi32

Logical shift right

_mm_srli_pi32 Logical shift right _mm_srl_pi64

Logical shift right

_mm_srli_pi64 Logical shift right

PSRLW PSRLWI PSRLD PSRLDI PSRLQ PSRLQI

__m64 _mm_sll_pi16(__m64 m, __m64 count) Shift four 16-bit values in m left the amount specified by count while shifting in zeros. __m64 _mm_slli_pi16(__m64 m, int count) Shift four 16-bit values in m left the amount specified by count while shifting in zeros. For the best performance, count should be a constant. __m64 _mm_sll_pi32(__m64 m, __m64 count) Shift two 32-bit values in m left the amount specified by count while shifting in zeros. __m64 _mm_slli_pi32(__m64 m, int count) Shift two 32-bit values in m left the amount specified by count while shifting in zeros. For the best performance, count should be a constant.

21

Intel(R) C++ Intrinsics Reference

__m64 _mm_sll_pi64(__m64 m, __m64 count) Shift the 64-bit value in m left the amount specified by count while shifting in zeros. __m64 _mm_slli_pi64(__m64 m, int count) Shift the 64-bit value in m left the amount specified by count while shifting in zeros. For the best performance, count should be a constant. __m64 _mm_sra_pi16(__m64 m, __m64 count) Shift four 16-bit values in m right the amount specified by count while shifting in the sign bit. __m64 _mm_srai_pi16(__m64 m, int count) Shift four 16-bit values in m right the amount specified by count while shifting in the sign bit. For the best performance, count should be a constant. __m64 _mm_sra_pi32(__m64 m, __m64 count) Shift two 32-bit values in m right the amount specified by count while shifting in the sign bit. __m64 _mm_srai_pi32(__m64 m, int count) Shift two 32-bit values in m right the amount specified by count while shifting in the sign bit. For the best performance, count should be a constant. __m64 _mm_srl_pi16(__m64 m, __m64 count) Shift four 16-bit values in m right the amount specified by count while shifting in zeros. __m64 _mm_srli_pi16(__m64 m, int count) Shift four 16-bit values in m right the amount specified by count while shifting in zeros. For the best performance, count should be a constant. __m64 _mm_srl_pi32(__m64 m, __m64 count) Shift two 32-bit values in m right the amount specified by count while shifting in zeros. __m64 _mm_srli_pi32(__m64 m, int count) Shift two 32-bit values in m right the amount specified by count while shifting in zeros. For the best performance, count should be a constant. __m64 _mm_srl_pi64(__m64 m, __m64 count) Shift the 64-bit value in m right the amount specified by count while shifting in zeros. 22

Intel(R) C++ Intrinsics Reference

__m64 _mm_srli_pi64(__m64 m, int count) Shift the 64-bit value in m right the amount specified by count while shifting in zeros. For the best performance, count should be a constant.

MMX(TM) Technology Logical Intrinsics The prototypes for MMX™ technology intrinsics are in the mmintrin.h header file. Details about each intrinsic follows the table below. Intrinsic Name

Operation

_mm_and_si64

Bitwise AND

Corresponding MMX Instruction PAND

_mm_andnot_si64 Bitwise ANDNOT

PANDN

_mm_or_si64

Bitwise OR

_mm_xor_si64

Bitwise Exclusive OR PXOR

POR

__m64 _mm_and_si64(__m64 m1, __m64 m2) Perform a bitwise AND of the 64-bit value in m1 with the 64-bit value in m2. __m64 _mm_andnot_si64(__m64 m1, __m64 m2) Perform a bitwise NOT on the 64-bit value in m1 and use the result in a bitwise AND with the 64-bit value in m2. __m64 _mm_or_si64(__m64 m1, __m64 m2) Perform a bitwise OR of the 64-bit value in m1 with the 64-bit value in m2. __m64 _mm_xor_si64(__m64 m1, __m64 m2) Perform a bitwise XOR of the 64-bit value in m1 with the 64-bit value in m2.

MMX(TM) Technology Compare Intrinsics The prototypes for MMX™ technology intrinsics are in the mmintrin.h header file. The intrinsics in the following table perform compare operations. Details about each intrinsic follows the table below. Intrinsic Name

Operation

Corresponding MMX Instruction

_mm_cmpeq_pi8

Equal

PCMPEQB 23

Intel(R) C++ Intrinsics Reference

_mm_cmpeq_pi16 Equal

PCMPEQW

_mm_cmpeq_pi32 Equal

PCMPEQD

_mm_cmpgt_pi8

Greater Than PCMPGTB

_mm_cmpgt_pi16 Greater Than PCMPGTW _mm_cmpgt_pi32 Greater Than PCMPGTD

__m64 _mm_cmpeq_pi8(__m64 m1, __m64 m2) If the respective 8-bit values in m1 are equal to the respective 8-bit values in m2 set the respective 8-bit resulting values to all ones, otherwise set them to all zeros. __m64 _mm_cmpeq_pi16(__m64 m1, __m64 m2) If the respective 16-bit values in m1 are equal to the respective 16-bit values in m2 set the respective 16-bit resulting values to all ones, otherwise set them to all zeros. __m64 _mm_cmpeq_pi32(__m64 m1, __m64 m2) If the respective 32-bit values in m1 are equal to the respective 32-bit values in m2 set the respective 32-bit resulting values to all ones, otherwise set them to all zeros. __m64 _mm_cmpgt_pi8(__m64 m1, __m64 m2) If the respective 8-bit signed values in m1 are greater than the respective 8-bit signed values in m2 set the respective 8-bit resulting values to all ones, otherwise set them to all zeros. __m64 _mm_cmpgt_pi16(__m64 m1, __m64 m2) If the respective 16-bit signed values in m1 are greater than the respective 16-bit signed values in m2 set the respective 16-bit resulting values to all ones, otherwise set them to all zeros. __m64 _mm_cmpgt_pi32(__m64 m1, __m64 m2) If the respective 32-bit signed values in m1 are greater than the respective 32-bit signed values in m2 set the respective 32-bit resulting values to all ones, otherwise set them all to zeros.

MMX(TM) Technology Set Intrinsics The prototypes for MMX™ technology intrinsics are in the mmintrin.h header file. Details about each intrinsic follows the table below.

24

Intel(R) C++ Intrinsics Reference

Note In the descriptions regarding the bits of the MMX register, bit 0 is the least significant and bit 63 is the most significant. Intrinsic Name

Operation

_mm_setzero_si64 set to zero

Corresponding MMX Instruction PXOR

_mm_set_pi32

set integer values Composite

_mm_set_pi16

set integer values Composite

_mm_set_pi8

set integer values Composite

_mm_set1_pi32

set integer values

_mm_set1_pi16

set integer values Composite

_mm_set1_pi8

set integer values Composite

_mm_setr_pi32

set integer values Composite

_mm_setr_pi16

set integer values Composite

_mm_setr_pi8

set integer values Composite

__m64 _mm_setzero_si64() Sets the 64-bit value to zero. R 0x0 __m64 _mm_set_pi32(int i1, int i0) Sets the 2 signed 32-bit integer values. R0 R1 i0 i1 __m64 _mm_set_pi16(short s3, short s2, short s1, short s0) Sets the 4 signed 16-bit integer values. R0 R1 R2 R3 w0 w1 w2 w3 __m64 _mm_set_pi8(char b7, char b6, char b5, char b4, char b3, char b2, char b1, char b0)

25

Intel(R) C++ Intrinsics Reference Sets the 8 signed 8-bit integer values. R0 R1 ... R7 b0 b1 ... b7 __m64 _mm_set1_pi32(int i) Sets the 2 signed 32-bit integer values to i. R0 R1 i

i

__m64 _mm_set1_pi16(short s) Sets the 4 signed 16-bit integer values to w. R0 R1 R2 R3 w

w

w

w

__m64 _mm_set1_pi8(char b) Sets the 8 signed 8-bit integer values to b R0 R1 ... R7 b

b

... b

__m64 _mm_setr_pi32(int i1, int i0) Sets the 2 signed 32-bit integer values in reverse order. R0 R1 i1 i0 __m64 _mm_setr_pi16(short s3, short s2, short s1, short s0) Sets the 4 signed 16-bit integer values in reverse order. R0 R1 R2 R3 w3 w2 w1 w0 __m64 _mm_setr_pi8(char b7, char b6, char b5, char b4, char b3, char b2, char b1, char b0) Sets the 8 signed 8-bit integer values in reverse order. R0 R1 ... R7 26

Intel(R) C++ Intrinsics Reference

b7 b6 ... b0

MMX(TM) Technology Intrinsics on IA-64 Architecture MMX™ technology intrinsics provide access to the MMX technology instruction set on systems based on IA-64 architecture. To provide source compatibility with the IA-32 architecture, these intrinsics are equivalent both in name and functionality to the set of IA-32-based MMX intrinsics. The prototypes for MMX technology intrinsics are in the mmintrin.h header file.

Data Types The C data type __m64 is used when using MMX technology intrinsics. It can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one 64-bit value. The __m64 data type is not a basic ANSI C data type. Therefore, observe the following usage restrictions: • • •

Use the new data type only on the left-hand side of an assignment, as a return value, or as a parameter. You cannot use it with other arithmetic expressions (" + ", " - ", and so on). Use the new data type as objects in aggregates, such as unions, to access the byte elements and structures; the address of an __m64 object may be taken. Use new data types only with the respective intrinsics described in this documentation.

For complete details of the hardware instructions, see the Intel® Architecture MMX™ Technology Programmer's Reference Manual. For descriptions of data types, see the Intel® Architecture Software Developer's Manual, Volume 2.

27

Intel(R) C++ Intrinsics Reference

Streaming SIMD Extensions Overview: Streaming SIMD Extensions This section describes the C++ language-level features supporting the Streaming SIMD Extensions (SSE) in the Intel® C++ Compiler. These topics explain the following features of the intrinsics: • • • • • • • • • • • • •

Floating Point Intrinsics Arithmetic Operation Intrinsics Logical Operation Intrinsics Comparison Intrinsics Conversion Intrinsics Load Operations Set Operations Store Operations Cacheability Support Integer Intrinsics Intrinsics to Read and Write Registers Miscellaneous Intrinsics Using Streaming SIMD Extensions on Itanium® Architecture

The prototypes for SSE intrinsics are in the xmmintrin.h header file. Note You can also use the single ia32intrin.h header file for any IA-32 intrinsics.

Floating-point Intrinsics for Streaming SIMD Extensions You should be familiar with the hardware features provided by the Streaming SIMD Extensions (SSE) when writing programs with the intrinsics. The following are four important issues to keep in mind: •

• • •

Certain intrinsics, such as _mm_loadr_ps and _mm_cmpgt_ss, are not directly supported by the instruction set. While these intrinsics are convenient programming aids, be mindful that they may consist of more than one machine-language instruction. Floating-point data loaded or stored as __m128 objects must be generally 16byte-aligned. Some intrinsics require that their argument be immediates, that is, constant integers (literals), due to the nature of the instruction. The result of arithmetic operations acting on two NaN (Not a Number) arguments is undefined. Therefore, FP operations using NaN arguments will not match the expected behavior of the corresponding assembly instructions.

Arithmetic Operations for Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. 28

Intel(R) C++ Intrinsics Reference The results of each intrinsic operation are placed in a register. This register is illustrated for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the 4 32-bit pieces of the result register. Details about each intrinsic follows the table below. Intrinsic

Operation

Corresponding SSE Instruction

_mm_add_ss

Addition

ADDSS

_mm_add_ps

Addition

ADDPS

_mm_sub_ss

Subtraction

SUBSS

_mm_sub_ps

Subtraction

SUBPS

_mm_mul_ss

Multiplication

MULSS

_mm_mul_ps

Multiplication

MULPS

_mm_div_ss

Division

DIVSS

_mm_div_ps

Division

DIVPS

_mm_sqrt_ss

Squared Root

SQRTSS

_mm_sqrt_ps

Squared Root

SQRTPS

_mm_rcp_ss

Reciprocal

RCPSS

_mm_rcp_ps

Reciprocal

RCPPS

_mm_rsqrt_ss Reciprocal Squared Root RSQRTSS _mm_rsqrt_ps Reciprocal Squared Root RSQRTPS _mm_min_ss

Computes Minimum

MINSS

_mm_min_ps

Computes Minimum

MINPS

_mm_max_ss

Computes Maximum

MAXSS

_mm_max_ps

Computes Maximum

MAXPS

__m128 _mm_add_ss(__m128 a, __m128 b) Adds the lower single-precision, floating-point (SP FP) values of a and b; the upper 3 SP FP values are passed through from a. R0

R1 R2 R3

a0 + b0 a1 a2 a3 __m128 _mm_add_ps(__m128 a, __m128 b) Adds the four SP FP values of a and b. 29

Intel(R) C++ Intrinsics Reference

R0

R1

R2

R3

a0 +b0 a1 + b1 a2 + b2 a3 + b3 __m128 _mm_sub_ss(__m128 a, __m128 b) Subtracts the lower SP FP values of a and b. The upper 3 SP FP values are passed through from a. R0

R1 R2 R3

a0 - b0 a1 a2 a3 __m128 _mm_sub_ps(__m128 a, __m128 b) Subtracts the four SP FP values of a and b. R0

R1

R2

R3

a0 - b0 a1 - b1 a2 - b2 a3 - b3 __m128 _mm_mul_ss(__m128 a, __m128 b) Multiplies the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a. R0

R1 R2 R3

a0 * b0 a1 a2 a3 __m128 _mm_mul_ps(__m128 a, __m128 b) Multiplies the four SP FP values of a and b. R0

R1

R2

R3

a0 * b0 a1 * b1 a2 * b2 a3 * b3 __m128 _mm_div_ss(__m128 a, __m128 b ) Divides the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a. R0

R1 R2 R3

a0 / b0 a1 a2 a3 __m128 _mm_div_ps(__m128 a, __m128 b) Divides the four SP FP values of a and b.

30

Intel(R) C++ Intrinsics Reference

R0

R1

R2

R3

a0 / b0 a1 / b1 a2 / b2 a3 / b3 __m128 _mm_sqrt_ss(__m128 a) Computes the square root of the lower SP FP value of a ; the upper 3 SP FP values are passed through. R0

R1 R2 R3

sqrt(a0) a1 a2 a3 __m128 _mm_sqrt_ps(__m128 a) Computes the square roots of the four SP FP values of a. R0

R1

R2

R3

sqrt(a0) sqrt(a1) sqrt(a2) sqrt(a3) __m128 _mm_rcp_ss(__m128 a) Computes the approximation of the reciprocal of the lower SP FP value of a; the upper 3 SP FP values are passed through. R0

R1 R2 R3

recip(a0) a1 a2 a3 __m128 _mm_rcp_ps(__m128 a) Computes the approximations of reciprocals of the four SP FP values of a. R0

R1

R2

R3

recip(a0) recip(a1) recip(a2) recip(a3) __m128 _mm_rsqrt_ss(__m128 a) Computes the approximation of the reciprocal of the square root of the lower SP FP value of a; the upper 3 SP FP values are passed through. R0

R1 R2 R3

recip(sqrt(a0)) a1 a2 a3 __m128 _mm_rsqrt_ps(__m128 a) Computes the approximations of the reciprocals of the square roots of the four SP FP values of a.

31

Intel(R) C++ Intrinsics Reference

R0

R1

R2

R3

recip(sqrt(a0)) recip(sqrt(a1)) recip(sqrt(a2)) recip(sqrt(a3)) __m128 _mm_min_ss(__m128 a, __m128 b) Computes the minimum of the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a. R0

R1 R2 R3

min(a0, b0) a1 a2 a3 __m128 _mm_min_ps(__m128 a, __m128 b) Computes the minimum of the four SP FP values of a and b. R0

R1

R2

R3

min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3) __m128 _mm_max_ss(__m128 a, __m128 b) Computes the maximum of the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a. R0

R1 R2 R3

max(a0, b0) a1 a2 a3 __m128 _mm_max_ps(__m128 a, __m128 b) Computes the maximum of the four SP FP values of a and b. R0

R1

R2

R3

max(a0, b0) max(a1, b1) max(a2, b2) max(a3, b3)

Logical Operations for Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. The results of each intrinsic operation are placed in a register. This register is illustrated for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the 4 32-bit pieces of the result register. Details about each intrinsic follows the table below. Intrinsic Name

Operation

Corresponding SSE Instruction

32

Intel(R) C++ Intrinsics Reference

_mm_and_ps

Bitwise AND

ANDPS

_mm_andnot_ps Bitwise ANDNOT

ANDNPS

_mm_or_ps

Bitwise OR

_mm_xor_ps

Bitwise Exclusive OR XORPS

ORPS

__m128 _mm_and_ps(__m128 a, __m128 b) Computes the bitwise AND of the four SP FP values of a and b. R0

R1

R2

R3

a0 & b0 a1 & b1 a2 & b2 a3 & b3 __m128 _mm_andnot_ps(__m128 a, __m128 b) Computes the bitwise AND-NOT of the four SP FP values of a and b. R0

R1

R2

R3

~a0 & b0 ~a1 & b1 ~a2 & b2 ~a3 & b3 __m128 _mm_or_ps(__m128 a, __m128 b) Computes the bitwise OR of the four SP FP values of a and b. R0

R1

R2

R3

a0 | b0 a1 | b1 a2 | b2 a3 | b3 __m128 _mm_xor_ps(__m128 a, __m128 b) Computes bitwise XOR (exclusive-or) of the four SP FP values of a and b. R0

R1

R2

R3

a0 ^ b0 a1 ^ b1 a2 ^ b2 a3 ^ b3

Comparisons for Streaming SIMD Extensions Each comparison intrinsic performs a comparison of a and b. For the packed form, the four SP FP values of a and b are compared, and a 128-bit mask is returned. For the scalar form, the lower SP FP values of a and b are compared, and a 32-bit mask is returned; the upper three SP FP values are passed through from a. The mask is set to 0xffffffff for each element where the comparison is true and 0x0 where the comparison is false. Details about each intrinsic follows the table below. 33

Intel(R) C++ Intrinsics Reference The results of each intrinsic operation are placed in a register. This register is illustrated for each intrinsic with R or R0-R3. R0, R1, R2 and R3 each represent one of the 4 32-bit pieces of the result register. The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. Intrinsic Name

Operation

Corresponding SSE Instruction

_mm_cmpeq_ss

Equal

CMPEQSS

_mm_cmpeq_ps

Equal

CMPEQPS

_mm_cmplt_ss

Less Than

CMPLTSS

_mm_cmplt_ps

Less Than

CMPLTPS

_mm_cmple_ss

Less Than or Equal

CMPLESS

_mm_cmple_ps

Less Than or Equal

CMPLEPS

_mm_cmpgt_ss

Greater Than

CMPLTSS

_mm_cmpgt_ps

Greater Than

CMPLTPS

_mm_cmpge_ss

Greater Than or Equal

CMPLESS

_mm_cmpge_ps

Greater Than or Equal

CMPLEPS

_mm_cmpneq_ss

Not Equal

CMPNEQSS

_mm_cmpneq_ps

Not Equal

CMPNEQPS

_mm_cmpnlt_ss

Not Less Than

CMPNLTSS

34

Intel(R) C++ Intrinsics Reference

_mm_cmpnlt_ps

Not Less Than

CMPNLTPS

_mm_cmpnle_ss

Not Less Than or Equal

CMPNLESS

_mm_cmpnle_ps

Not Less Than or Equal

CMPNLEPS

_mm_cmpngt_ss

Not Greater Than

CMPNLTSS

_mm_cmpngt_ps

Not Greater Than

CMPNLTPS

_mm_cmpnge_ss

Not Greater Than or Equal CMPNLESS

_mm_cmpnge_ps

Not Greater Than or Equal CMPNLEPS

_mm_cmpord_ss

Ordered

CMPORDSS

_mm_cmpord_ps

Ordered

CMPORDPS

_mm_cmpunord_ss Unordered

CMPUNORDSS

_mm_cmpunord_ps Unordered

CMPUNORDPS

_mm_comieq_ss

Equal

COMISS

_mm_comilt_ss

Less Than

COMISS

_mm_comile_ss

Less Than or Equal

COMISS

_mm_comigt_ss

Greater Than

COMISS

_mm_comige_ss

Greater Than or Equal

COMISS

_mm_comineq_ss

Not Equal

COMISS

35

Intel(R) C++ Intrinsics Reference

_mm_ucomieq_ss

Equal

UCOMISS

_mm_ucomilt_ss

Less Than

UCOMISS

_mm_ucomile_ss

Less Than or Equal

UCOMISS

_mm_ucomigt_ss

Greater Than

UCOMISS

_mm_ucomige_ss

Greater Than or Equal

UCOMISS

_mm_ucomineq_ss Not Equal

UCOMISS

__m128 _mm_cmpeq_ss(__m128 a, __m128 b) Compare for equality. R0

R1 R2 R3

(a0 == b0) ? 0xffffffff : 0x0 a1 a2 a3 __m128 _mm_cmpeq_ps(__m128 a, __m128 b) Compare for equality. R0

R1

R2

R3

(a0 == b0) ? 0xffffffff : 0x0

(a1 == b1) ? 0xffffffff : 0x0

(a2 == b2) ? 0xffffffff : 0x0

(a3 == b3) ? 0xffffffff : 0x0

__m128 _mm_cmplt_ss(__m128 a, __m128 b) Compare for less-than. R0

R1 R2 R3

(a0 < b0) ? 0xffffffff : 0x0 a1 a2 a3 __m128 _mm_cmplt_ps(__m128 a, __m128 b) Compare for less-than

36

Intel(R) C++ Intrinsics Reference

R0

R1

R2

R3

(a0 < b0) ? 0xffffffff : 0x0

(a1 < b1) ? 0xffffffff : 0x0

(a2 < b2) ? 0xffffffff : 0x0

(a3 < b3) ? 0xffffffff : 0x0

__m128 _mm_cmple_ss(__m128 a, __m128 b) Compare for less-than-or-equal. R0

R1 R2 R3

(a0 b3) ? 0xffffffff : 0x0

__m128 _mm_cmpge_ss(__m128 a, __m128 b) Compare for greater-than-or-equal. R0

R1 R2 R3

(a0 >= b0) ? 0xffffffff : 0x0 a1 a2 a3 __m128 _mm_cmpge_ps(__m128 a, __m128 b) Compare for greater-than-or-equal.

37

Intel(R) C++ Intrinsics Reference

R0

R1

R2

R3

(a0 >= b0) ? 0xffffffff : 0x0

(a1 >= b1) ? 0xffffffff : 0x0

(a2 >= b2) ? 0xffffffff : 0x0

(a3 >= b3) ? 0xffffffff : 0x0

__m128 _mm_cmpneq_ss(__m128 a, __m128 b) Compare for inequality. R0

R1 R2 R3

(a0 != b0) ? 0xffffffff : 0x0 a1 a2 a3 __m128 _mm_cmpneq_ps(__m128 a, __m128 b) Compare for inequality. R0

R1

R2

R3

(a0 != b0) ? 0xffffffff : 0x0

(a1 != b1) ? 0xffffffff : 0x0

(a2 != b2) ? 0xffffffff : 0x0

(a3 != b3) ? 0xffffffff : 0x0

__m128 _mm_cmpnlt_ss(__m128 a, __m128 b) Compare for not-less-than. R0

R1 R2 R3

!(a0 < b0) ? 0xffffffff : 0x0 a1 a2 a3 __m128 _mm_cmpnlt_ps(__m128 a, __m128 b) Compare for not-less-than. R0

R1

R2

R3

!(a0 < b0) ? 0xffffffff : 0x0

!(a1 < b1) ? 0xffffffff : 0x0

!(a2 < b2) ? 0xffffffff : 0x0

!(a3 < b3) ? 0xffffffff : 0x0

__m128 _mm_cmpnle_ss(__m128 a, __m128 b) Compare for not-less-than-or-equal. R0

R1 R2 R3

!(a0 b3) ? 0xffffffff : 0x0

__m128 _mm_cmpnge_ss(__m128 a, __m128 b) Compare for not-greater-than-or-equal. R0

R1 R2 R3

!(a0 >= b0) ? 0xffffffff : 0x0 a1 a2 a3 __m128 _mm_cmpnge_ps(__m128 a, __m128 b) Compare for not-greater-than-or-equal. R0

R1

R2

R3

!(a0 >= b0) ? 0xffffffff : 0x0

!(a1 >= b1) ? 0xffffffff : 0x0

!(a2 >= b2) ? 0xffffffff : 0x0

!(a3 >= b3) ? 0xffffffff : 0x0

__m128 _mm_cmpord_ss(__m128 a, __m128 b) Compare for ordered. R0

R1 R2 R3

(a0 ord? b0) ? 0xffffffff : 0x0 a1 a2 a3 __m128 _mm_cmpord_ps(__m128 a, __m128 b) Compare for ordered.

39

Intel(R) C++ Intrinsics Reference

R0

R1

R2

R3

(a0 ord? b0) ? 0xffffffff : 0x0

(a1 ord? b1) ? 0xffffffff : 0x0

(a2 ord? b2) ? 0xffffffff : 0x0

(a3 ord? b3) ? 0xffffffff : 0x0

__m128 _mm_cmpunord_ss(__m128 a, __m128 b) Compare for unordered. R0

R1 R2 R3

(a0 unord? b0) ? 0xffffffff : 0x0 a1 a2 a3 __m128 _mm_cmpunord_ps(__m128 a, __m128 b) Compare for unordered. R0

R1

R2

R3

(a0 unord? b0) ? 0xffffffff : 0x0

(a1 unord? b1) ? 0xffffffff : 0x0

(a2 unord? b2) ? 0xffffffff : 0x0

(a3 unord? b3) ? 0xffffffff : 0x0

int _mm_comieq_ss(__m128 a, __m128 b) Compares the lower SP FP value of a and b for a equal to b. If a and b are equal, 1 is returned. Otherwise 0 is returned. R (a0 == b0) ? 0x1 : 0x0 int _mm_comilt_ss(__m128 a, __m128 b) Compares the lower SP FP value of a and b for a less than b. If a is less than b, 1 is returned. Otherwise 0 is returned. R (a0 < b0) ? 0x1 : 0x0 int _mm_comile_ss(__m128 a, __m128 b) Compares the lower SP FP value of a and b for a less than or equal to b. If a is less than or equal to b, 1 is returned. Otherwise 0 is returned. R (a0 b0) ? 0x1 : 0x0 int _mm_comige_ss(__m128 a, __m128 b) Compares the lower SP FP value of a and b for a greater than or equal to b. If a is greater than or equal to b, 1 is returned. Otherwise 0 is returned. R (a0 >= b0) ? 0x1 : 0x0 int _mm_comineq_ss(__m128 a, __m128 b) Compares the lower SP FP value of a and b for a not equal to b. If a and b are not equal, 1 is returned. Otherwise 0 is returned. R (a0 != b0) ? 0x1 : 0x0 int _mm_ucomieq_ss(__m128 a, __m128 b) Compares the lower SP FP value of a and b for a equal to b. If a and b are equal, 1 is returned. Otherwise 0 is returned. R (a0 == b0) ? 0x1 : 0x0 int _mm_ucomilt_ss(__m128 a, __m128 b) Compares the lower SP FP value of a and b for a less than b. If a is less than b, 1 is returned. Otherwise 0 is returned. R (a0 < b0) ? 0x1 : 0x0 int _mm_ucomile_ss(__m128 a, __m128 b) Compares the lower SP FP value of a and b for a less than or equal to b. If a is less than or equal to b, 1 is returned. Otherwise 0 is returned. R (a0 b0) ? 0x1 : 0x0 int _mm_ucomige_ss(__m128 a, __m128 b) Compares the lower SP FP value of a and b for a greater than or equal to b. If a is greater than or equal to b, 1 is returned. Otherwise 0 is returned. R (a0 >= b0) ? 0x1 : 0x0 int _mm_ucomineq_ss(__m128 a, __m128 b) Compares the lower SP FP value of a and b for a not equal to b. If a and b are not equal, 1 is returned. Otherwise 0 is returned. R r := (a0 != b0) ? 0x1 : 0x0

Conversion Operations for Streaming SIMD Extensions Details about each intrinsic follows the table below. The results of each intrinsic operation are placed in a register. This register is illustrated for each intrinsic with R or R0-R3. R0, R1, R2 and R3 each represent one of the 4 32-bit pieces of the result register. Details about each intrinsic follows the table below. The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. Intrinsic Name

Operation

Corresponding SSE Instruction

_mm_cvtss_si32

Convert to 32-bit integer

CVTSS2SI

_mm_cvtss_si64

Convert to 64-bit integer

CVTSS2SI

_mm_cvtps_pi32

Convert to two 32-bit integers

CVTPS2PI

_mm_cvttss_si32

Convert to 32-bit integer

CVTTSS2SI

_mm_cvttss_si64

Convert to 64-bit integer

CVTTSS2SI

_mm_cvttps_pi32

Convert to two 32-bit integers

CVTTPS2PI

_mm_cvtsi32_ss

Convert from 32-bit integer

CVTSI2SS

_mm_cvtsi64_ss

Convert from 64-bit integer

CVTSI2SS

_mm_cvtpi32_ps

Convert from two 32-bit integers CVTTPI2PS 42

Intel(R) C++ Intrinsics Reference

_mm_cvtpi16_ps

Convert from four 16-bit integers composite

_mm_cvtpu16_ps

Convert from four 16-bit integers composite

_mm_cvtpi8_ps

Convert from four 8-bit integers

composite

_mm_cvtpu8_ps

Convert from four 8-bit integers

composite

_mm_cvtpi32x2_ps Convert from four 32-bit integers composite _mm_cvtps_pi16

Convert to four 16-bit integers

composite

_mm_cvtps_pi8

Convert to four 8-bit integers

composite

_mm_cvtss_f32

Extract

composite

int _mm_cvtss_si32(__m128 a) Convert the lower SP FP value of a to a 32-bit integer according to the current rounding mode. R (int)a0

__int64 _mm_cvtss_si64(__m128 a) Convert the lower SP FP value of a to a 64-bit signed integer according to the current rounding mode. R (__int64)a0 __m64 _mm_cvtps_pi32(__m128 a) Convert the two lower SP FP values of a to two 32-bit integers according to the current rounding mode, returning the integers in packed form. R0

R1

(int)a0 (int)a1

int _mm_cvttss_si32(__m128 a) Convert the lower SP FP value of a to a 32-bit integer with truncation. R (int)a0

43

Intel(R) C++ Intrinsics Reference

__int64 _mm_cvttss_si64(__m128 a) Convert the lower SP FP value of a to a 64-bit signed integer with truncation. R (__int64)a0 __m64 _mm_cvttps_pi32(__m128 a) Convert the two lower SP FP values of a to two 32-bit integer with truncation, returning the integers in packed form. R0

R1

(int)a0 (int)a1 __m128 _mm_cvtsi32_ss(__m128 a, int b) Convert the 32-bit integer value b to an SP FP value; the upper three SP FP values are passed through from a. R0

R1 R2 R3

(float)b a1 a2 a3 __m128 _mm_cvtsi64_ss(__m128 a, __int64 b) Convert the signed 64-bit integer value b to an SP FP value; the upper three SP FP values are passed through from a. R0

R1 R2 R3

(float)b a1 a2 a3 __m128 _mm_cvtpi32_ps(__m128 a, __m64 b) Convert the two 32-bit integer values in packed form in b to two SP FP values; the upper two SP FP values are passed through from a. R0

R1

R2 R3

(float)b0 (float)b1 a2 a3 __m128 _mm_cvtpi16_ps(__m64 a) Convert the four 16-bit signed integer values in a to four single precision FP values. R0

R1

R2

R3

(float)a0 (float)a1 (float)a2 (float)a3 __m128 _mm_cvtpu16_ps(__m64 a) 44

Intel(R) C++ Intrinsics Reference Convert the four 16-bit unsigned integer values in a to four single precision FP values. R0

R1

R2

R3

(float)a0 (float)a1 (float)a2 (float)a3 __m128 _mm_cvtpi8_ps(__m64 a) Convert the lower four 8-bit signed integer values in a to four single precision FP values. R0

R1

R2

R3

(float)a0 (float)a1 (float)a2 (float)a3 __m128 _mm_cvtpu8_ps(__m64 a) Convert the lower four 8-bit unsigned integer values in a to four single precision FP values. R0

R1

R2

R3

(float)a0 (float)a1 (float)a2 (float)a3 __m128 _mm_cvtpi32x2_ps(__m64 a, __m64 b) Convert the two 32-bit signed integer values in a and the two 32-bit signed integer values in b to four single precision FP values. R0

R1

R2

R3

(float)a0 (float)a1 (float)b0 (float)b1 __m64 _mm_cvtps_pi16(__m128 a) Convert the four single precision FP values in a to four signed 16-bit integer values. R0

R1

R2

R3

(short)a0 (short)a1 (short)a2 (short)a3 __m64 _mm_cvtps_pi8(__m128 a) Convert the four single precision FP values in a to the lower four signed 8-bit integer values of the result. R0

R1

R2

R3

(char)a0 (char)a1 (char)a2 (char)a3 float _mm_cvtss_f32(__m128 a)

45

Intel(R) C++ Intrinsics Reference This intrinsic extracts a single precision floating point value from the first vector element of an __m128. It does so in the most efficient manner possible in the context used.

Load Operations for Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. Details about each intrinsic follows the table below. The results of each intrinsic operation are placed in a register. This register is illustrated for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the 4 32-bit pieces of the result register. Intrinsic Name

Operation

Corresponding SSE Instruction

_mm_loadh_pi Load high

MOVHPS reg, mem

_mm_loadl_pi Load low

MOVLPS reg, mem

_mm_load_ss

Load the low value and clear the three high values

_mm_load1_ps Load one value into all four words _mm_load_ps

Load four values, address aligned

MOVSS MOVSS + Shuffling MOVAPS

_mm_loadu_ps Load four values, address unaligned

MOVUPS

_mm_loadr_ps Load four values in reverse

MOVAPS + Shuffling

__m128 _mm_loadh_pi(__m128 a, __m64 const *p) Sets the upper two SP FP values with 64 bits of data loaded from the address p. R0 R1 R2

R3

a0 a1 *p0 *p1 __m128 _mm_loadl_pi(__m128 a, __m64 const *p) Sets the lower two SP FP values with 64 bits of data loaded from the address p; the upper two values are passed through from a. R0

R1

R2 R3

*p0 *p1 a2 a3

46

Intel(R) C++ Intrinsics Reference

__m128 _mm_load_ss(float * p ) Loads an SP FP value into the low word and clears the upper three words. R0 R1 R2 R3 *p 0.0 0.0 0.0 __m128 _mm_load1_ps(float * p ) Loads a single SP FP value, copying it into all four words. R0 R1 R2 R3 *p *p *p *p __m128 _mm_load_ps(float * p ) Loads four SP FP values. The address must be 16-byte-aligned. R0

R1

R2

R3

p[0] p[1] p[2] p[3] __m128 _mm_loadu_ps(float * p) Loads four SP FP values. The address need not be 16-byte-aligned. R0

R1

R2

R3

p[0] p[1] p[2] p[3] __m128 _mm_loadr_ps(float * p) Loads four SP FP values in reverse order. The address must be 16-byte-aligned. R0

R1

R2

R3

p[3] p[2] p[1] p[0]

Set Operations for Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. Details about each intrinsic follows the table below. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R0, R1, R2 and R3 represent the registers in which results are placed. 47

Intel(R) C++ Intrinsics Reference

Intrinsic Name

Operation

Corresponding SSE Instruction

_mm_set_ss

Set the low value and clear the three high values

Composite

_mm_set1_ps

Set all four words with the same value

Composite

_mm_set_ps

Set four values, address aligned

Composite

_mm_setr_ps

Set four values, in reverse order

Composite

_mm_setzero_ps Clear all four values

Composite

__m128 _mm_set_ss(float w ) Sets the low word of an SP FP value to w and clears the upper three words. R0 R1 R2 R3 w 0.0 0.0 0.0 __m128 _mm_set1_ps(float w ) Sets the four SP FP values to w. R0 R1 R2 R3 w w w w __m128 _mm_set_ps(float z, float y, float x, float w ) Sets the four SP FP values to the four inputs. R0 R1 R2 R3 w x

y

z

__m128 _mm_setr_ps (float z, float y, float x, float w ) Sets the four SP FP values to the four inputs in reverse order. R0 R1 R2 R3 z

y

x

w

__m128 _mm_setzero_ps (void) Clears the four SP FP values.

48

Intel(R) C++ Intrinsics Reference

R0 R1 R2 R3 0.0 0.0 0.0 0.0

Store Operations for Streaming SIMD Extensions Details about each intrinsic follows the table below. The detailed description of each intrinsic contains a table detailing the returns. In these tables, p[n] is an access to the n element of the result. The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. Intrinsic Name

Operation

Corresponding SSE Instruction

_mm_storeh_pi Store high

MOVHPS mem, reg

_mm_storel_pi Store low

MOVLPS mem, reg

_mm_store_ss

Store the low value

_mm_store1_ps Store the low value across all four words, address aligned _mm_store_ps

Store four values, address aligned

MOVSS Shuffling + MOVSS MOVAPS

_mm_storeu_ps Store four values, address unaligned

MOVUPS

_mm_storer_ps Store four values, in reverse order

MOVAPS + Shuffling

void _mm_storeh_pi(__m64 *p, __m128 a) Stores the upper two SP FP values to the address p. *p0 *p1 a2 a3 void _mm_storel_pi(__m64 *p, __m128 a) Stores the lower two SP FP values of a to the address p. *p0 *p1 a0 a1 void _mm_store_ss(float * p, __m128 a) Stores the lower SP FP value. 49

Intel(R) C++ Intrinsics Reference

*p a0 void _mm_store1_ps(float * p, __m128 a ) Stores the lower SP FP value across four words. p[0] p[1] p[2] p[3] a0 a0 a0 a0 void _mm_store_ps(float *p, __m128 a) Stores four SP FP values. The address must be 16-byte-aligned. p[0] p[1] p[2] p[3] a0 a1 a2 a3 void _mm_storeu_ps(float *p, __m128 a) Stores four SP FP values. The address need not be 16-byte-aligned. p[0] p[1] p[2] p[3] a0 a1 a2 a3 void _mm_storer_ps(float * p, __m128 a ) Stores four SP FP values in reverse order. The address must be 16-byte-aligned. p[0] p[1] p[2] p[3] a3 a2 a1 a0

Cacheability Support Using Streaming SIMD Extensions Details about each intrinsic follows the table below. The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. Intrinsic Name

Operation Corresponding SSE Instruction

_mm_prefetch

Load

PREFETCH

_mm_stream_pi Store

MOVNTQ

_mm_stream_ps Store

MOVNTPS

_mm_sfence

Store fence SFENCE

50

Intel(R) C++ Intrinsics Reference

void _mm_prefetch(char const*a, int sel) Loads one cache line of data from address a to a location "closer" to the processor. The value sel specifies the type of prefetch operation: the constants _MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA should be used for IA-32, corresponding to the type of prefetch instruction. The constants _MM_HINT_T1, _MM_HINT_NT1, _MM_HINT_NT2, and _MM_HINT_NTA should be used for systems based on IA-64 architecture. void _mm_stream_pi(__m64 *p, __m64 a) Stores the data in a to the address p without polluting the caches. This intrinsic requires you to empty the multimedia state for the mmx register. See The EMMS Instruction: Why You Need It. void _mm_stream_ps(float *p, __m128 a) Stores the data in a to the address p without polluting the caches. The address must be 16-byte-aligned. void _mm_sfence(void) Guarantees that every preceding store is globally visible before any subsequent store.

Integer Intrinsics Using Streaming SIMD Extensions The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0, R1...R7 represent the registers in which results are placed. Details about each intrinsic follows the table below. The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file.The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. Before using these intrinsics, you must empty the multimedia state for the MMX(TM) technology register. See The EMMS Instruction: Why You Need It for more details. Intrinsic Name

Operation

Corresponding SSE Instruction

_mm_extract_pi16

Extract one of four words

PEXTRW

_mm_insert_pi16

Insert word

PINSRW

_mm_max_pi16

Compute maximum

PMAXSW

_mm_max_pu8

Compute maximum, unsigned

PMAXUB

_mm_min_pi16

Compute minimum

PMINSW

51

Intel(R) C++ Intrinsics Reference

_mm_min_pu8

Compute minimum, unsigned

PMINUB

_mm_movemask_pi8

Create eight-bit mask

PMOVMSKB

_mm_mulhi_pu16

Multiply, return high bits

PMULHUW

_mm_shuffle_pi16

Return a combination of four words

PSHUFW

_mm_maskmove_si64 Conditional Store

MASKMOVQ

_mm_avg_pu8

Compute rounded average

PAVGB

_mm_avg_pu16

Compute rounded average

PAVGW

_mm_sad_pu8

Compute sum of absolute differences PSADBW

int _mm_extract_pi16(__m64 a, int n) Extracts one of the four words of a. The selector n must be an immediate. R (n==0) ? a0 : ( (n==1) ? a1 : ( (n==2) ? a2 : a3 ) ) __m64 _mm_insert_pi16(__m64 a, int d, int n) Inserts word d into one of four words of a. The selector n must be an immediate. R0

R1

R2

R3

(n==0) ? d : a0; (n==1) ? d : a1; (n==2) ? d : a2; (n==3) ? d : a3; __m64 _mm_max_pi16(__m64 a, __m64 b) Computes the element-wise maximum of the words in a and b. R0

R1

R2

R3

min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3) __m64 _mm_max_pu8(__m64 a, __m64 b) Computes the element-wise maximum of the unsigned bytes in a and b. R0

R1

...

R7

min(a0, b0) min(a1, b1) ... min(a7, b7) __m64 _mm_min_pi16(__m64 a, __m64 b) Computes the element-wise minimum of the words in a and b.

52

Intel(R) C++ Intrinsics Reference

R0

R1

R2

R3

min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3) __m64 _mm_min_pu8(__m64 a, __m64 b) Computes the element-wise minimum of the unsigned bytes in a and b. R0

R1

...

R7

min(a0, b0) min(a1, b1) ... min(a7, b7) __m64 _mm_movemask_pi8(__m64 b) Creates an 8-bit mask from the most significant bits of the bytes in a. R sign(a7)4)&0x3) of a

word ((n>>6)&0x3) of a

void _mm_maskmove_si64(__m64 d, __m64 n, char *p) Conditionally store byte elements of d to address p. The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored. if (sign(n0)) if (sign(n1)) ... p[0] := d0

p[1] := d1

if (sign(n7))

... p[7] := d7

__m64 _mm_avg_pu8(__m64 a, __m64 b) Computes the (rounded) averages of the unsigned bytes in a and b. R0

R1

...

R7

(t >> 1) | (t &

(t >> 1) | (t &

... ((t >> 1) | (t & 53

Intel(R) C++ Intrinsics Reference

0x01), where t = (unsigned char)a0 + (unsigned char)b0

0x01), where t = (unsigned char)a1 + (unsigned char)b1

0x01)), where t = (unsigned char)a7 + (unsigned char)b7

__m64 _mm_avg_pu16(__m64 a, __m64 b) Computes the (rounded) averages of the unsigned short in a and b. R0

R1

... R7

(t >> 1) | (t & 0x01), (t >> 1) | (t & 0x01), ... (t >> 1) | (t & 0x01), where t = (unsigned where t = (unsigned where t = (unsigned int)a0 + (unsigned int)b0 int)a1 + (unsigned int)b1 int)a7 + (unsigned int)b7 __m64 _mm_sad_pu8(__m64 a, __m64 b) Computes the sum of the absolute differences of the unsigned bytes in a and b, returning the value in the lower word. The upper three words are cleared. R0

R1 R2 R3

abs(a0-b0) +... + abs(a7-b7) 0

0

0

Intrinsics to Read and Write Registers for Streaming SIMD Extensions Details about each intrinsic follows the table below. The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. Intrinsic Name

Operation

Corresponding SSE Instruction

_mm_getcsr Return control register STMXCSR _mm_setcsr Set control register

LDMXCSR

unsigned int _mm_getcsr(void) Returns the contents of the control register. void _mm_setcsr(unsigned int i) Sets the control register to the value specified.

54

Intel(R) C++ Intrinsics Reference

Miscellaneous Intrinsics Using Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h header file. The results of each intrinsic operation are placed in registers. The information about what is placed in each register appears in the tables below, in the detailed explanation of each intrinsic. R, R0, R1, R2 and R3 represent the registers in which results are placed. Details about each intrinsic follows the table below. Intrinsic Name

Operation

Corresponding SSE Instruction

_mm_shuffle_ps

Shuffle

SHUFPS

_mm_unpackhi_ps Unpack High

UNPCKHPS

_mm_unpacklo_ps Unpack Low

UNPCKLPS

_mm_move_ss

Set low word, pass in three high values MOVSS

_mm_movehl_ps

Move High to Low

MOVHLPS

_mm_movelh_ps

Move Low to High

MOVLHPS

_mm_movemask_ps Create four-bit mask

MOVMSKPS

__m128 _mm_shuffle_ps(__m128 a, __m128 b, unsigned int imm8) Selects four specific SP FP values from a and b, based on the mask imm8. The mask must be an immediate. See Macro Function for Shuffle Using Streaming SIMD Extensions for a description of the shuffle semantics. __m128 _mm_unpackhi_ps(__m128 a, __m128 b) Selects and interleaves the upper two SP FP values from a and b. R0 R1 R2 R3 a2 b2 a3 b3 __m128 _mm_unpacklo_ps(__m128 a, __m128 b) Selects and interleaves the lower two SP FP values from a and b. R0 R1 R2 R3 a0 b0 a1 b1

55

Intel(R) C++ Intrinsics Reference

__m128 _mm_move_ss( __m128 a, __m128 b) Sets the low word to the SP FP value of b. The upper 3 SP FP values are passed through from a. R0 R1 R2 R3 b0 a1 a2 a3 __m128 _mm_movehl_ps(__m128 a, __m128 b) Moves the upper 2 SP FP values of b to the lower 2 SP FP values of the result. The upper 2 SP FP values of a are passed through to the result. R0 R1 R2 R3 b2 b3 a2 a3 __m128 _mm_movelh_ps(__m128 a, __m128 b) Moves the lower 2 SP FP values of b to the upper 2 SP FP values of the result. The lower 2 SP FP values of a are passed through to the result. R0 R1 R2 R3 a0 a1 b0 b1 int _mm_movemask_ps(__m128 a) Creates a 4-bit mask from the most significant bits of the four SP FP values. R sign(a3)