Intel Itanium Floating-Point Architecture

Intel Itanium™ Floating-Point Architecture Marius Cornea, John Harrison, and Ping Tak Peter Tang Intel Corporation ® WCAE 2003 06/15/03 1 06/15/0...
Author: Mildred Norris
0 downloads 6 Views 101KB Size
Intel Itanium™ Floating-Point Architecture Marius Cornea, John Harrison, and Ping Tak Peter Tang Intel Corporation

®

WCAE 2003

06/15/03

1

06/15/03

2

Agenda Intel® Itanium® Architecture ‰ Intel® Itanium® Processor Floating-Point Architecture ‰ Status Fields and Exceptions ‰ The Floating-Point Multiply-Add ‰ Exact Arithmetic ‰ Accurate Remainders ‰ Accurate Range Reduction ‰ Comparison and Classification ‰ Division and Square Root ‰ Additional Features ‰ Conclusion ‰

®

WCAE 2003

1

Intel® Itanium® architecture ‰

One of the major processor architectures present in the market today z 2001

– Intel Itanium processor

z 2003

– Intel Itanium 2 processor – highest SPEC CFP2000 score currently

z Price/performance

consumption better implementation

z Large

ratio with

and power every new

register sets : 128 floating-point registers

z Predication z Speculation

®

06/15/03

WCAE 2003

3

Intel® Itanium® architecture z Support

for explicit parallelism: static and rotating registers

z Floating-point

accuracy

features aimed at speed and

z EPIC

(Explicitly Parallel Instruction Computing) design philosophy

z Target

the most demanding enterprise and high-performance computing applications

®

WCAE 2003

06/15/03

4

2

Itanium Processor Floating-Point Architecture

®

‰

Floating-point multiply-add (fused multiply-add) allows higher accuracy and performance

‰

Software and hardware interaction

‰

Division: the throughput can be as high as one result for every 3.5 clock cycles

‰

Floating-point formats: 24, 53, 64 bit significands; 8, 11, 15, 17-bit exponents; 1-bit sign

‰

Register and memory encodings: 0, normalized values, denormalized/unnormalized values, infinity, NaN, NaTVal (‘not a value’, for speculative operations); redundant representations

WCAE 2003

06/15/03

5

Itanium Processor Floating-Point Architecture ‰

®

Examples: z Using double-extended intermediate precision calculations to compute a double precision function: the double precision input arguments can be freely combined with double-extended intermediate results. z Computing functions involving constants with few significant digits: whatever the precision of the computation, the short constants can be stored in single precision

WCAE 2003

06/15/03

6

3

Status Fields and Exceptions ‰

®

64-bit Floating-Point Status Register (FPSR) z six trap disable bits control the five IEEE Standard exceptions and the denormal exception z four 13-bit status fields: s0, s1, s2 and s3 z six flags per status field, that record the occurrence of each of the 6 exceptions z Seven control bits per status field: rounding (2 bits), precision (2 bits), traps disable, flush-tozero (ftz), and widest-range exponent (wre bit, for 17-bit exponents)

WCAE 2003

06/15/03

7

Status Fields and Exceptions ‰

®

Status field usage determined by software conventions: z s0 is the main user status field z s1, with wre enabled and all exceptions disabled is used in many standard numerical software kernels such as those for division, square root, and transcendental functions z status fields s2 and s3 are commonly used for speculation

WCAE 2003

06/15/03

8

4

The Floating-Point Multiply-Add Basic assembly syntax: (qp) fma.pc.sf f1 = f3, f4, f2 which calculates f1 = f3 ⋅ f4 + f2 with one rounding error ‰ Addition and multiplication are implemented as special cases of the fma: x + y = x⋅1 + y and x ⋅ y = x⋅y + 0 ‰ Two variants of the fma exist: the fms (floatingpoint multiply-subtract) and fnma (floating-point negative multiply-add): (qp) fms.pc.sf f1 = f3, f4, f2 (qp) fnma.pc.sf f1 = f3, f4, f2 compute f1 = f3 ⋅ f4 – f2 and f1 = –f3 ⋅ f4 + f2 respectively ‰

®

06/15/03

WCAE 2003

9

The Floating-Point Multiply-Add ‰

Example: the vector dot product x ⋅ y of two ndimensional vectors: p = ∑ xi ⋅ yi can be evaluated by a succession of fma operations of the form p = p + xi ⋅ yi requiring only n floating-point operations, whereas with a separate multiplication and addition it would require 2n operations, with a longer overall latency

®

WCAE 2003

06/15/03

10

5

Exact Arithmetic Addition - if |x| ≥ |y| the exact sum x + y can be obtained as a two-piece expansion Hi + Lo: Hi = x + y tmp = x – Hi Lo = tmp + y (Hi + Lo = x + y exactly, with Lo a rounding error in Hi ≈ x + y) ‰ Multiplication - the exact product x ⋅ y can be obtained as a two-piece expansion Hi + Lo: Hi = x ⋅ y Lo = x ⋅ y - Hi (Hi + Lo = x ⋅ y exactly, with Lo a rounding error in Hi ≈ x ⋅ y) ‰

®

06/15/03

WCAE 2003

11

Accurate Remainders

®

‰

If a floating-point number q is approximately equal to the quotient a / b of two floating-point numbers, the remainder r = a – b ⋅ q can be calculated exactly with one fnma operation, if q is within 1 ulp (unit-in-the-last-place) of a / b

‰

Useful in software implementations of the floatingpoint division, square root, and remainder; also for integer division and remainder computations, implemented based on floating-point operations

WCAE 2003

06/15/03

12

6

Accurate Range Reduction ‰

Many algorithms for mathematical functions (e.g. sin) begin with an initial range reduction phase, subtracting an integer multiple of a constant such as π / 2

‰

With the fma this can be done in a single instruction x – N ⋅ P

‰

Typically: y=Q⋅x N = rint (y) r=x–N⋅P where rint (y) denotes the rounding of y to an integer, and Q ≈ 1 / P

®

06/15/03

WCAE 2003

13

Comparison and Classification ‰

Syntax: (qp) fcmp.frel.fctype p1, p2 = f2, f3 where the frel completer determines the relation that is tested for.

®

‰

Mnemonics for frel: eq for f2 = f3, lt for f2 < f3, le for f2 ≤ f3, gt for f2 > f3, ge for f2 ≥ f3,and unord for f2 ? f3.

‰

There is no signed/unsigned distinction but there is a new possibility, (f2 ? f3): two values may be unordered, since a NaN (Not a Number) compares false with any floating-point value, even with itself

‰

fctype is the comparison type – normal, or unconditional

WCAE 2003

06/15/03

14

7

Division and Square Root ‰

Implemented in software, based on the reciprocal approximation and reciprocal square root approximation instructions

‰

Given two floating-point numbers a and b, the floating-point reciprocal approximation instruction, frcpa, normally returns an approximation of 1/b good to about 8 bits (qp) frcpa.sf f1, p2 = f2, f3

‰

Given a floating-point number a, the floating-point reciprocal square root approximation instruction normally returns an approximation of 1/√a good to about 8 bits: (qp) frsqrta.sf f1, p2 = f3

®

06/15/03

WCAE 2003

15

Additional Features

®

‰

Transferring values between floating-point and integer registers by means of the getf and setf instructions

‰

Foating-point merging with fmerge, useful in combining fields of multiple floating-point numbers

‰

Floating-point to integer and integer to floatingpoint conversion using the fcvt instructions

‰

Integer multiplication and division - implemented using the floating-point unit

‰

Floating-point maximum and minimum, using the fmax, famax, fmin and famin instructions

WCAE 2003

06/15/03

16

8

Conclusion

®

‰

The Itanium floating-point architecture was designed with high performance, accuracy, and flexibility characteristics which make it ideal for technical computing

‰

All floating-point data types are mapped internally to an 82-bit format, with 64 bits of accuracy and a 17-bit exponent - calculations are more accurate and do not underflow or overflow as often as on other processors

‰

Highest current SPEC CFP2000 score for a single processor system: 1431, for an Itanium 2 system at 1GHz - the Hewlett-Packard HP Server RX2600

06/15/03

WCAE 2003

17

References ‰ ‰

‰ ‰ ‰

‰ ‰ ‰ ‰

®

[1] Intel(R) Itanium(TM) Architecture Software Developer's Manual, Revision 2.0, Vol 1-4, Intel Corporation, December 2001 [2] John Hennessy, David Patterson, “Computer Architecture - A Quantitative Approach”, Morgan Kauffman Publishers, Inc., third edition, 2002 [3] Peter Markstein, ‘‘IA-64 and Elementary Functions: Speed and Precision”, Hewlett-Packard/Prentice-Hall 2000 [4] Marius Cornea, John Harrison, Ping Tak Peter Tang, “Scientific Computing on Itanium-based Systems”, Intel Press 2002 [5] John Crawford, Jerry Huck, “Motivations and Design Approach for the IA-64 64-Bit Instruction Set Architecture”, Oct. 1997, San Jose, http://www.intel.com/pressroom/archive/speeches/mpf1097c.htm [6] ANSI/IEEE Standard 754-1985, IEEE Standard for Binary Floating-Point Arithmetic, IEEE, New York, 1985 [7] O. Moller, “Quasi double-precision in floating-point addition”, BIT journal, Vol. 5, 1965, pages 37-50 [8] T. J. Dekker, “A Floating-Point Technique for Extending the Available Precision”, Numerical Mathematics journal, Vol. 18, 1971, pages 224-242 [9] “Divide, Square Root, and Remainder Algorithms for the Itanium Architecture”, Intel Corporation, Nov. 2000, http://www.intel.com/software/products/opensource/libraries/numnote2.htm , http://developer.intel.com/software/products/opensource/libraries/numdow n2.htm

WCAE 2003

06/15/03

18

9

Suggest Documents