5 Practical Floating Point System

Computers represent real numbers using a finite number of bits. This limitation introduces rounding errors and the need for standardized formats. The most widely adopted convention is the IEEE 754 floating-point standard, developed by the Institute of Electrical and Electronics Engineers.

5.1 Floating-Point Number Formats

A floating-point number consists of three main parts:

5.1.1 Precision levels

The IEEE 754 standard defines several levels of precision:

5.2 Normalization and Rounding

In normalized form, the leading (leftmost) bit of the mantissa must be \(1.\) For example: \[1001.01_2 = 1.00101_2 \times 2^3\] Here the leading \(1\) is placed to the left of the binary point, and the exponent is adjusted accordingly.

Since binary fractions cannot always represent decimal fractions exactly (e.g. \(0.3_{10}\)), we must approximate. The IEEE standard uses:

5.3 Exponent Bias

The exponent must represent both positive and negative values. To achieve this, a bias is added: \[\text{Bias} = 2^{k-1} - 1\] where \(k\) is the number of exponent bits. Examples:

Example 5.1

Representing 263 in IEEE 754 (single precision)

  1. Convert \(263\) to binary: \[263_{10} = 100000111_2\]

  2. Normalize: \[100000111_2 = 1.00000111_2 \times 2^8\]

  3. Exponent: \[e = 8 + 127 = 135 = 10000111_2\]

  4. Fraction (mantissa): \(00000111\ldots\)

Final IEEE 754 representation: \[0\ 10000111\ 00000111000000000000000\]

Example 5.2

Representing 0.3 in binary Multiply by 2 repeatedly: \[0.3 \times 2 = 0.6 \quad (0)\] \[0.6 \times 2 = 1.2 \quad (1)\] \[0.2 \times 2 = 0.4 \quad (0)\] \[0.4 \times 2 = 0.8 \quad (0)\] \[0.8 \times 2 = 1.6 \quad (1)\] \[0.6 \times 2 = 1.2 \quad (1)\] and so on. Thus: \[0.3_{10} \approx 0.01001100110011\ldots_2\]

Example 5.3

Double precision to decimal
Given: \[x = 0\ 10000000011\ 10111001000100\ldots\]

  • Sign bit \(=0 \Rightarrow\) positive.

  • Exponent \(=10000000011_2 = 1027.\) Subtract bias (1023): \[e = 1027 - 1023 = 4\]

  • Fraction \(=1.10111001000100\ldots_2 \approx 1.7229.\)

  • Value: \[x \approx 1.7229 \times 2^4 = 27.5664\]

5.3.1 General Form

In general, a normalized floating-point number is given by: \[x = (-1)^s \times (1 + f) \times 2^e\] where \(f\) is the fractional part and \(e\) is the unbiased exponent.

5.4 General Floating-Point System (summary)

A floating-point system is characterized by four parameters \((\beta, t, L, U)\):

5.4.1 Normalized IEEE 754 Numbers

A normalized floating-point number has the form \[(-1)^s \cdot (1.m) \cdot 2^{e - \text{bias}},\] where

The implied leading 1 increases the significand precision.

5.4.2 Normalized (Subnormal) Numbers

5.4.2.0.1 Largest Normalized Positive Number

\[(1.111\ldots1)_2 \times 2^{127} \approx 3.403 \times 10^{38}.\] If a result exceeds this, an overflow occurs.

5.4.2.0.2 Smallest Normalized Positive Number

\[1.0_2 \times 2^{-126} = 2^{-126} \approx 1.1755 \times 10^{-38}.\]

5.4.2.0.3 Representation of Zero

\[\begin{aligned} +0 &: 0~00000000~000\ldots0 \\ -0 &: 1~00000000~000\ldots0 \end{aligned}\] Although distinct, \(+0\) and \(-0\) are treated as equal.

5.4.2.0.4 Representation of Infinity

\[\begin{aligned} +\infty &: 0~11111111~000\ldots0 \\ -\infty &: 1~11111111~000\ldots0 \end{aligned}\]

5.4.2.0.5 Not a Number (NaN)

IEEE 754 defines two types of NaNs:

Both are encoded with exponent = 255 and nonzero fraction.

5.4.3 Denormalized (Subnormal) Numbers

Denormalized numbers have the form \[(-1)^s \cdot (0.m) \cdot 2^{-126}.\]

5.4.3.0.1 Largest Denormalized Number

\[(0.111\ldots1)_2 \times 2^{-126} \approx 0.99999988 \times 2^{-126}.\]

5.4.3.0.2 Smallest Denormalized Number

\[2^{-23} \times 2^{-126} = 2^{-149}.\] A result smaller than this causes underflow

5.4.4 Ranges of Single Precision

Home

Chapters

Contents