I discussed in an earlier article the challenge of accurately representing real numbers in digital systems. Real numbers include the integers (like -236 or 14), the rational numbers (like 1/3 or -23.45) and the irrational numbers (like √3 or π). There are clearly an infinite number of these, and many of these can’t be precisely expressed in any numeric representation.
In decimal, we therefore have to represent real numbers only to a certain number of significant figures. For example, we represent π to 6 significant figures as 3.14159 or ε0 (the permittivity of free space) as 8.85419 × 10-12. You will note that while these two numbers have the same number of significant digits, they are very different in magnitude because ε0 is expressed in exponential or scientific notation.
We can represent any real number this way: with a signed significand, a base and a signed exponent. For example, we could write π as 3.14159 × 100 or –10.0 as -1.0 × 101. Thats all a floating-point representation is.
Digital systems are more at home with binary numbers than decimal, so we need an agreed way to efficiently pack a signed significand and signed exponent into a series of bytes. There are many ways to do this, but the most common is the IEEE 754 binary representation which we will look at in some detail.
As we will always use a base of two for this representation, so we don’t need to store this – we just need the significand and the exponent. We can also reduce the number of bits we need to store (or store more precision in a given space) by normalising the significand.
To understand what this means, it helps to go back to our decimal example: 8.85419 × 10-12 could equally be written 0.885419 × 10-11 or 88.5418 × 10-13. In scientific notation we adjust each number to have one digit to the left of the decimal point by convention – which we call the normalised form. This means the first digit of the significand will never be zero (or we would shift it left).
If we normalise a binary significand, the digit to the left of the “binary point” will always be a one (since if it can’t be zero and we only have those two choices). As this digit is always a one, we can omit it from the representation and save one bit. Great!
Well, great until we try to represent zero! If the implied bit is always one, we can never truly represent zero. The IEEE 754 standard gets around this by using a de-normalised form to represent zero and some other special numbers. Let’s look at this a little more closely using a 32-bit single-precision example.
Figure 1 shows how the 32 bits are organised. The most significant bit is a sign bit (S) indicating the sign of the significand, followed by an 8-bit biased exponent (E) and the 23-bit fractional part (F) of the 24-bit significand – remember the bit to the left of the binary point is implied. The number this represents is given by the following rules:
• If the exponent is between 1 and 254 inclusive, the significand is taken to be normalised, and the number represented is: (-1)S × 1.F × 2(E-127). This is how the vast majority of floating-point numbers are stored
• If the exponent is zero, the significand is taken to be de-normalised, and the number represented is: (-1)S × 0.F × 2(-126). Used for ± zero and numbers extremely close to it.
• If the exponent is 255, the number represents special values such as ± infinity or various types of NaN (Not a Number).
Let’s see how this works using a few practical examples. Figure 2a and 2b show the single-precision representation of ε0 and π. Note that the precise value stored in the float is shown, but it is only useful to just over 7 significant decimal figures due to the limitations of a 24-bit binary significand.
Figures 2c and 2d show how zero is stored in de-normalised form. It is possible to have both positive and negative zeros in this representation. In this case the significand is de-normalised and the implied bit is zero.
Figures 2e and 2f show examples of positive infinity and a NaN. There is also a negative infinity and several types of NaN.
We saw that single precision floating-point numbers have about 7.2 decimal significant figures, calculated by log10 224. The IEEE standard allows for other floating-point representations with different precisions, some of which are shown in the table below. Single and double precision are by far the most common for embedded systems.
You can imagine that performing mathematical operations on floating point numbers is far from trivial. Fortunately, your C compiler probably already comes with floating point libraries for this purpose. If you are really luck your MCU may have a floating-point accelerator or coprocessor that will speed up these operations dramatically. Just remember that floating point representation is always an approximation for most real numbers and the precision is limited, so choose the right representation and be aware of the limitations.
“Floating-Point Arithmetic.” In Wikipedia, January 20, 2023. https://en.wikipedia.org/w/index.php?title=Floating-point_arithmetic&oldid=1134774403.
“IEEE 754.” In Wikipedia, January 23, 2023. https://en.wikipedia.org/w/index.php?title=IEEE_754&oldid=1135164661.
“A Tutorial on Data Representation – Integers, Floating-Point Numbers, and Characters.” Accessed January 23, 2023. https://www3.ntu.edu.sg/home/ehchua/programming/java/datarepresentation.html.
“IEEE-754 Floating Point Converter.” Accessed January 27, 2023. https://www.h-schmidt.net/FloatConverter/IEEE754.html.Sponsor this Article
Andrew Levido (firstname.lastname@example.org) earned a bachelor’s degree in Electrical Engineering in Sydney, Australia, in 1986. He worked for several years in R&D for power electronics and telecommunication companies before moving into management roles. Andrew has maintained a hands-on interest in electronics, particularly embedded systems, power electronics, and control theory in his free time. Over the years he has written a number of articles for various electronics publications and occasionally provides consulting services as time allows.