Floating Point Error Handling in Java

Chamal Weerasinghe
4 min readMay 5, 2021
Photo by Mika Baumeister on Unsplash

“The purpose of computing is insight, not numbers.” — Richard Hamming

The way computer deals with the numbers are quite a different, The computer does not do the arithmetic operations in exactly how they are, and operators are not applied as we expected. as an example if we want to add two numbers like 10 + 15, what the computer does is it first convert them into the appropriate binary values just like this

00001010
00001111
---------
00011001
---------

But when it comes to handling floating point values the process become little bit complicated, floating point in the computers are handled according to the IEEE-754 Standard using single precision(32bit) for representing “float” and double precision(64bit) for representing “double” values.

Image from GeeksForGeeks
Image from GeeksForGeeks

As now we can see there is a limitation of bits for representing the number based on the data type we are using, it should be represented within the limit of the bits it has allocated. How let’s see an example how to represent a number in double precision.

99.8
99 -> 1100011
0.8 -> 0.1100110011001100110011001100110011001100110011001100

Now to we have to convert this into the scientific format in order to represent in double precision,

1.1000111100110011001100110011001100110011001100110011001100 x 2^6

Now we calculate the exponent by adding the biased value

1023 + 6 = 1029
1029 -> 10000000101

Now the total representation of 99.8 in double precision will be like this

0 10000000101 1000111100110011001100110011001100110011001100110011001100

The separate elements can be described as

Sign bit- 0
Exponent- 10000000101
Mantissa- 1000111100110011001100110011001100110011001100110011001100

But if we use a computer do to this long calculation for us, if will result something like this.

Sign bit- 0
Exponent- 10000000101
Mantissa- 1000111100110011001100110011001100110011001100110011001101

As we can see there is a one bit difference at the end of the mantissa. This is because of the rounding error, Why the rounding error happens is that there is a limit of bits we can represent the mantissa which is 52bits, but we can move further in any case there is a extra 1 in the 53th place what computer does is it added it back to the 52th bit. As humans we ignore it but computer do not do that.

Now we know the error, But when it comes to the representation it back to the decimal, human readable format we do not get the exact value we expected to get with the exact number of precision. instead of we get something like this “99.80000000000001".

When it comes to the FinTech, Banking, Scientific application this is very critical issue during the time the value can keep getting change. Let’s look at an example.

In theoretically this code should stop when it comes to ‘0’, but it is not, This cause infinite loop due to the double precision error, and the part of the result will be like this,

0.10000000000140435
1.4043488594239761E-12
-0.09999999999859566
-0.19999999999859566
-0.29999999999859567
-0.3999999999985957
-0.4999999999985957
-0.5999999999985957
-0.6999999999985956
-0.7999999999985956
-0.8999999999985956
-0.9999999999985956

It is never will equals to zero and the keep the loop running, So how to solve this problem?

BigDecimal in Java

To handle this situation Java comes with the java.math.BigDecimal class for which requires exact precision, BigDecimal is an immutable object and can represent the signed values, though it handles the number values the standard way of applying operators like “+”, “-” and logical operations like “>”, “<”, etc.. will not support with the BigDecimal.

For arithmetic operations it has provided methods like add(), substract(), divide(), and multiply().

For logical operations it has provided compareTo() method which returns -1 if the value is less than the given value, 0 if it equals and 1 if the value is greater than the value.

For operations which engages with decimal values it has scale() and round() methods.

Here is an example of how to deal with the BigDecimal

And here’s how we solve the infite loop problem caused before.

For further info on converting decimal into binary and for a better explanation please find the references below

--

--