Quadruple-precision Floating-point Format - Double-double Arithmetic

A common software technique to implement nearly quadruple precision using pairs of double-precision values is sometimes called double-double arithmetic. Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic can represent operations with at least a 2×53=106-bit significand (and possibly 107 bits via clever use of the sign bit), only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits, significantly lower than the 15-bit exponent of IEEE quadruple precision (a range of for double-double versus for binary128).

In particular, a double-double/quadruple-precision value q in the double-double technique is represented implicitly as a sum q=x+y of two double-precision values x and y, each of which supplies half of q's significand. That is, the pair (x,y) is stored in place of q, and operations on q values (+,−,×,...) are transformed into equivalent (but more complicated) operations on the x and y values. Thus, arithmetic in this technique reduces to a sequence of double-precision operations; since double-precision arithmetic is commonly implemented in hardware, double-double arithmetic is typically substantially faster than more general arbitrary-precision arithmetic techniques.

Note that double-double arithmetic has the following special characteristics:

  • As the magnitude of the value decreases, the amount of extra precision also decreases. Therefore, the smallest number in the normalized range is narrower than double precision. The smallest number with full precision is 1000...02 (106 zeros) × 2−1074, or 1.000...02 (106 zeros) × 2−968. Numbers whose magnitude is smaller than 2−1021 will not have additional precision compare with double precision.
  • The actual number of bits of precision can vary. In general, the magnitude of low-order part of the number is no greater than half ULP of the high-order part. If the low-order part is less than half ULP of the high-order part, significant bits (either all 0's or all 1's) are implied between the significant of the high-order and low-order numbers. Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128-bit long double numbers.
  • Because of the reason above, it is possible to represent values like 1 + 2−1074, which is the smallest representable number greater than 1.

In addition to the double-double arithmetic, it is also possible to generate triple-double or quad-double arithmetic if higher precision is required without any higher precision floating-point library. They are represented as a sum of three (or four) double-precision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively.

Similar technique can be used to produce a double-quad arithmetic, which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 (or 227) bits.

Read more about this topic:  Quadruple-precision Floating-point Format

Famous quotes containing the word arithmetic:

    O! O! another stroke! that makes the third.
    He stabs me to the heart against my wish.
    If that be so, thy state of health is poor;
    But thine arithmetic is quite correct.
    —A.E. (Alfred Edward)