5 Practical Floating Point System

Computers represent real numbers using a finite number of bits. This limitation introduces rounding errors and the need for standardized formats. The most widely adopted convention is the IEEE 754 floating-point standard, developed by the Institute of Electrical and Electronics Engineers.

5.1 Floating-Point Number Formats

A floating-point number consists of three main parts:

Sign bit
(\(s\)): indicates the sign of the number.
Exponent
(\(e\)): encodes the order of magnitude.
Mantissa (fraction)
(\(m\)): encodes the significant digits.

5.1.1 Precision levels

The IEEE 754 standard defines several levels of precision:

Single precision • 32 bits
1 sign, 8 exponent, 23 fraction bits.
Double precision • 64 bits
1 sign, 11 exponent, 52 fraction bits.
Long double • 80 bits
1 sign, 15 exponent, 64 fraction bits.

5.2 Normalization and Rounding

In normalized form, the leading (leftmost) bit of the mantissa must be \(1.\) For example: \[1001.01_2 = 1.00101_2 \times 2^3\] Here the leading \(1\) is placed to the left of the binary point, and the exponent is adjusted accordingly.

Since binary fractions cannot always represent decimal fractions exactly (e.g. \(0.3_{10}\)), we must approximate. The IEEE standard uses:

Chopping: discarding extra bits.
Rounding to nearest: adding 1 to the last stored bit if the next bit is 1.

5.3 Exponent Bias

The exponent must represent both positive and negative values. To achieve this, a bias is added: \[\text{Bias} = 2^{k-1} - 1\] where \(k\) is the number of exponent bits. Examples:

Single precision (\(k=8\)): Bias \(=127.\)
Double precision (\(k=11\)): Bias \(=1023.\)

Example 5.1

Representing 263 in IEEE 754 (single precision)

Convert \(263\) to binary: \[263_{10} = 100000111_2\]
Normalize: \[100000111_2 = 1.00000111_2 \times 2^8\]
Exponent: \[e = 8 + 127 = 135 = 10000111_2\]
Fraction (mantissa): \(00000111\ldots\)

Final IEEE 754 representation: \[0\ 10000111\ 00000111000000000000000\]

Example 5.2

Representing 0.3 in binary Multiply by 2 repeatedly: \[0.3 \times 2 = 0.6 \quad (0)\] \[0.6 \times 2 = 1.2 \quad (1)\] \[0.2 \times 2 = 0.4 \quad (0)\] \[0.4 \times 2 = 0.8 \quad (0)\] \[0.8 \times 2 = 1.6 \quad (1)\] \[0.6 \times 2 = 1.2 \quad (1)\] and so on. Thus: \[0.3_{10} \approx 0.01001100110011\ldots_2\]

Example 5.3

Double precision to decimal
Given: \[x = 0\ 10000000011\ 10111001000100\ldots\]

Sign bit \(=0 \Rightarrow\) positive.
Exponent \(=10000000011_2 = 1027.\) Subtract bias (1023): \[e = 1027 - 1023 = 4\]
Fraction \(=1.10111001000100\ldots_2 \approx 1.7229.\)
Value: \[x \approx 1.7229 \times 2^4 = 27.5664\]

5.3.1 General Form

In general, a normalized floating-point number is given by: \[x = (-1)^s \times (1 + f) \times 2^e\] where \(f\) is the fractional part and \(e\) is the unbiased exponent.

5.4 General Floating-Point System (summary)

A floating-point system is characterized by four parameters \((\beta, t, L, U)\):

\(\beta\): base of the system (e.g., 2 for binary).
\(t\): length of the fractional part (precision).
\(L\): lower bound of the exponent. No normalized number can have an exponent smaller than \(L.\) Denormalized numbers and \(0\) extend to \(L-1.\)
\(U\): upper bound of the exponent. No normalized number can have an exponent larger than \(U.\) Special numbers (\(\pm\infty,\) NaN) are stored at exponent \(U+1.\)

5.4.1 Normalized IEEE 754 Numbers

A normalized floating-point number has the form \[(-1)^s \cdot (1.m) \cdot 2^{e - \text{bias}},\] where

\(s\) is the sign bit.
\(m\) is the fractional part (mantissa).
\(e\) is the biased exponent.

The implied leading 1 increases the significand precision.

5.4.2 Normalized (Subnormal) Numbers

Sign: 1 bit
Exponent: 8 bits (bias = 127)
Fraction: 23 bits

5.4.2.0.1 Largest Normalized Positive Number

\[(1.111\ldots1)_2 \times 2^{127} \approx 3.403 \times 10^{38}.\] If a result exceeds this, an overflow occurs.

5.4.2.0.2 Smallest Normalized Positive Number

\[1.0_2 \times 2^{-126} = 2^{-126} \approx 1.1755 \times 10^{-38}.\]

5.4.2.0.3 Representation of Zero

\[\begin{aligned} +0 &: 0~00000000~000\ldots0 \\ -0 &: 1~00000000~000\ldots0 \end{aligned}\] Although distinct, \(+0\) and \(-0\) are treated as equal.

5.4.2.0.4 Representation of Infinity

\[\begin{aligned} +\infty &: 0~11111111~000\ldots0 \\ -\infty &: 1~11111111~000\ldots0 \end{aligned}\]

5.4.2.0.5 Not a Number (NaN)

IEEE 754 defines two types of NaNs:

Quiet NaN (QNaN):
propagates through computations (e.g., \(0/0\)).
Signaling NaN (SNaN):
raises an exception when used.

Both are encoded with exponent = 255 and nonzero fraction.

5.4.3 Denormalized (Subnormal) Numbers

Denormalized numbers have the form \[(-1)^s \cdot (0.m) \cdot 2^{-126}.\]

No implied leading 1.
Exponent bits are all zero.
Allow gradual underflow

5.4.3.0.1 Largest Denormalized Number

\[(0.111\ldots1)_2 \times 2^{-126} \approx 0.99999988 \times 2^{-126}.\]

5.4.3.0.2 Smallest Denormalized Number

\[2^{-23} \times 2^{-126} = 2^{-149}.\] A result smaller than this causes underflow

5.4.4 Ranges of Single Precision

Normalized: \([2^{-126}, (2-2^{-23})\cdot 2^{127}].\)
Denormalized: \([2^{-149}, (1-2^{-23})\cdot 2^{-126}].\)