Intel Itanium Floating-Point Architecture

Intel Itanium™ Floating-Point Architecture Marius Cornea, John Harrison, and Ping Tak Peter Tang Intel Corporation ® WCAE 2003 06/15/03 1 06/15/0...

Author: Mildred Norris

0 downloads 6 Views 101KB Size

Report

Download PDF

Recommend Documents

Intel Itanium Floating-Point Architecture

Intel Itanium Processor Core

Intel Itanium Processor Microarchitecture Overview

INTEL IA-32 ARCHITECTURE

Intel Integrated Performance Primitives for Intel Architecture

Intel Core Architecture

Performance Advantage of the Register Stack in Intel Itanium Processors

The Intel Architecture Processors Pipeline

Minimal Intel Architecture Boot Loader

OpenStack* Networking with Intel Architecture

Hyperthreads in Itanium - Prozessoren

Virtual Machines for the Intel x86 Architecture

Intel Xeon Phi Core Micro-architecture

Twenty-to-One Consolidation on Intel. Architecture

Intel Omni-Path Architecture Technology Overview

The Intel Extended Server Memory Architecture

First Experiences with Itanium 1 and Itanium 2

Intel Itanium Processor Reference Manual for Software Development Revision 2.0 December 2001

Performance Comparisons of Intel Itanium 2 Processor-Based Servers Running SAS 9

A Detailed Look Inside the Intel NetBurst Micro-Architecture of the Intel Pentium 4 Processor

Video Conversion Expert Movavi* Optimizes Performance on. Intel Architecture with Intel Media SDK

Intel Integrated Performance Primitives for Intel Architecture Unified Speech Component Interface

Introduction to Intel x86 Assembly, Architecture, Applications, & Alliteration

Implementing Industry Standard Architecture (ISA) with Intel Express Chipsets

Intel Itanium™ Floating-Point Architecture Marius Cornea, John Harrison, and Ping Tak Peter Tang Intel Corporation

®

WCAE 2003

06/15/03

1

06/15/03

2

Agenda Intel® Itanium® Architecture Intel® Itanium® Processor Floating-Point Architecture Status Fields and Exceptions The Floating-Point Multiply-Add Exact Arithmetic Accurate Remainders Accurate Range Reduction Comparison and Classification Division and Square Root Additional Features Conclusion

®

WCAE 2003

1

Intel® Itanium® architecture

One of the major processor architectures present in the market today z 2001

– Intel Itanium processor

z 2003

– Intel Itanium 2 processor – highest SPEC CFP2000 score currently

z Price/performance

consumption better implementation

z Large

ratio with

and power every new

register sets : 128 floating-point registers

z Predication z Speculation

®

06/15/03

WCAE 2003

3

Intel® Itanium® architecture z Support

for explicit parallelism: static and rotating registers

z Floating-point

accuracy

features aimed at speed and

z EPIC

(Explicitly Parallel Instruction Computing) design philosophy

z Target

the most demanding enterprise and high-performance computing applications

®

WCAE 2003

06/15/03

4

2

Itanium Processor Floating-Point Architecture

®

Floating-point multiply-add (fused multiply-add) allows higher accuracy and performance

Software and hardware interaction

Division: the throughput can be as high as one result for every 3.5 clock cycles

Floating-point formats: 24, 53, 64 bit significands; 8, 11, 15, 17-bit exponents; 1-bit sign

Register and memory encodings: 0, normalized values, denormalized/unnormalized values, infinity, NaN, NaTVal (‘not a value’, for speculative operations); redundant representations

WCAE 2003

06/15/03

5

Itanium Processor Floating-Point Architecture

®

Examples: z Using double-extended intermediate precision calculations to compute a double precision function: the double precision input arguments can be freely combined with double-extended intermediate results. z Computing functions involving constants with few significant digits: whatever the precision of the computation, the short constants can be stored in single precision

WCAE 2003

06/15/03

6

3

Status Fields and Exceptions

®

64-bit Floating-Point Status Register (FPSR) z six trap disable bits control the five IEEE Standard exceptions and the denormal exception z four 13-bit status fields: s0, s1, s2 and s3 z six flags per status field, that record the occurrence of each of the 6 exceptions z Seven control bits per status field: rounding (2 bits), precision (2 bits), traps disable, flush-tozero (ftz), and widest-range exponent (wre bit, for 17-bit exponents)

WCAE 2003

06/15/03

7

Status Fields and Exceptions

®

Status field usage determined by software conventions: z s0 is the main user status field z s1, with wre enabled and all exceptions disabled is used in many standard numerical software kernels such as those for division, square root, and transcendental functions z status fields s2 and s3 are commonly used for speculation

WCAE 2003

06/15/03

8

4

The Floating-Point Multiply-Add Basic assembly syntax: (qp) fma.pc.sf f1 = f3, f4, f2 which calculates f1 = f3 ⋅ f4 + f2 with one rounding error Addition and multiplication are implemented as special cases of the fma: x + y = x⋅1 + y and x ⋅ y = x⋅y + 0 Two variants of the fma exist: the fms (floatingpoint multiply-subtract) and fnma (floating-point negative multiply-add): (qp) fms.pc.sf f1 = f3, f4, f2 (qp) fnma.pc.sf f1 = f3, f4, f2 compute f1 = f3 ⋅ f4 – f2 and f1 = –f3 ⋅ f4 + f2 respectively

®

06/15/03

WCAE 2003

9

The Floating-Point Multiply-Add

Example: the vector dot product x ⋅ y of two ndimensional vectors: p = ∑ xi ⋅ yi can be evaluated by a succession of fma operations of the form p = p + xi ⋅ yi requiring only n floating-point operations, whereas with a separate multiplication and addition it would require 2n operations, with a longer overall latency

®

WCAE 2003

06/15/03

10

5

Exact Arithmetic Addition - if |x| ≥ |y| the exact sum x + y can be obtained as a two-piece expansion Hi + Lo: Hi = x + y tmp = x – Hi Lo = tmp + y (Hi + Lo = x + y exactly, with Lo a rounding error in Hi ≈ x + y) Multiplication - the exact product x ⋅ y can be obtained as a two-piece expansion Hi + Lo: Hi = x ⋅ y Lo = x ⋅ y - Hi (Hi + Lo = x ⋅ y exactly, with Lo a rounding error in Hi ≈ x ⋅ y)

®

06/15/03

WCAE 2003

11

Accurate Remainders

®

If a floating-point number q is approximately equal to the quotient a / b of two floating-point numbers, the remainder r = a – b ⋅ q can be calculated exactly with one fnma operation, if q is within 1 ulp (unit-in-the-last-place) of a / b

Useful in software implementations of the floatingpoint division, square root, and remainder; also for integer division and remainder computations, implemented based on floating-point operations

WCAE 2003

06/15/03

12

6

Accurate Range Reduction

Many algorithms for mathematical functions (e.g. sin) begin with an initial range reduction phase, subtracting an integer multiple of a constant such as π / 2

With the fma this can be done in a single instruction x – N ⋅ P

Typically: y=Q⋅x N = rint (y) r=x–N⋅P where rint (y) denotes the rounding of y to an integer, and Q ≈ 1 / P

®

06/15/03

WCAE 2003

13

Comparison and Classification

Syntax: (qp) fcmp.frel.fctype p1, p2 = f2, f3 where the frel completer determines the relation that is tested for.

®

Mnemonics for frel: eq for f2 = f3, lt for f2 < f3, le for f2 ≤ f3, gt for f2 > f3, ge for f2 ≥ f3,and unord for f2 ? f3.

There is no signed/unsigned distinction but there is a new possibility, (f2 ? f3): two values may be unordered, since a NaN (Not a Number) compares false with any floating-point value, even with itself

fctype is the comparison type – normal, or unconditional

WCAE 2003

06/15/03

14

7

Division and Square Root

Implemented in software, based on the reciprocal approximation and reciprocal square root approximation instructions

Given two floating-point numbers a and b, the floating-point reciprocal approximation instruction, frcpa, normally returns an approximation of 1/b good to about 8 bits (qp) frcpa.sf f1, p2 = f2, f3

Given a floating-point number a, the floating-point reciprocal square root approximation instruction normally returns an approximation of 1/√a good to about 8 bits: (qp) frsqrta.sf f1, p2 = f3

®

06/15/03

WCAE 2003

15

Additional Features

®

Transferring values between floating-point and integer registers by means of the getf and setf instructions

Foating-point merging with fmerge, useful in combining fields of multiple floating-point numbers

Floating-point to integer and integer to floatingpoint conversion using the fcvt instructions

Integer multiplication and division - implemented using the floating-point unit

Floating-point maximum and minimum, using the fmax, famax, fmin and famin instructions

WCAE 2003

06/15/03

16

8

Conclusion

®

The Itanium floating-point architecture was designed with high performance, accuracy, and flexibility characteristics which make it ideal for technical computing

All floating-point data types are mapped internally to an 82-bit format, with 64 bits of accuracy and a 17-bit exponent - calculations are more accurate and do not underflow or overflow as often as on other processors

Highest current SPEC CFP2000 score for a single processor system: 1431, for an Itanium 2 system at 1GHz - the Hewlett-Packard HP Server RX2600

06/15/03

WCAE 2003

17

References

®

[1] Intel(R) Itanium(TM) Architecture Software Developer's Manual, Revision 2.0, Vol 1-4, Intel Corporation, December 2001 [2] John Hennessy, David Patterson, “Computer Architecture - A Quantitative Approach”, Morgan Kauffman Publishers, Inc., third edition, 2002 [3] Peter Markstein, ‘‘IA-64 and Elementary Functions: Speed and Precision”, Hewlett-Packard/Prentice-Hall 2000 [4] Marius Cornea, John Harrison, Ping Tak Peter Tang, “Scientific Computing on Itanium-based Systems”, Intel Press 2002 [5] John Crawford, Jerry Huck, “Motivations and Design Approach for the IA-64 64-Bit Instruction Set Architecture”, Oct. 1997, San Jose, http://www.intel.com/pressroom/archive/speeches/mpf1097c.htm [6] ANSI/IEEE Standard 754-1985, IEEE Standard for Binary Floating-Point Arithmetic, IEEE, New York, 1985 [7] O. Moller, “Quasi double-precision in floating-point addition”, BIT journal, Vol. 5, 1965, pages 37-50 [8] T. J. Dekker, “A Floating-Point Technique for Extending the Available Precision”, Numerical Mathematics journal, Vol. 18, 1971, pages 224-242 [9] “Divide, Square Root, and Remainder Algorithms for the Itanium Architecture”, Intel Corporation, Nov. 2000, http://www.intel.com/software/products/opensource/libraries/numnote2.htm , http://developer.intel.com/software/products/opensource/libraries/numdow n2.htm

WCAE 2003

06/15/03

18

9