Floating-point numbers: a love story

So we may as well kick off these blog entries in the spirit of blogging everywhere. That’s right, I am going to moan about something that means very little to the majority of human beings on the planet. I am going to complain about floating-point arithmetic.

For something like the 500th time since I started in this business they call science, I got to the bottom of a seemingly-bizarre coding problem today, only to find that it was being caused by the accuracy or floating point numbers. This may seem like an odd topic to pick on, but this is something that manages to crop up with depressing regularity and manifest itself in a thousand subtle and peculiar ways.

Before I get into the rant proper, though, let me explain what I’m talking about. Floating point numbers are what computers typically use to handle numbers that are not integers. Non-integers sound like the sort of thing that would be child’s play for a computer (or the unfortunate scientist bending it to his or her will) to handle: integers are whole numbers, and sometimes we need to handle halves or quarters, hundreths or pi. Surely this the computer just holds them all the same way?

Well, actually no. Ultimately, computers manipulate sequences of binary numbers. The split them into chunks, change them and move them around, then they send them to devices like screens and speakers that interpret these numbers and perform a function. Integers are just sequences of 1s and 0s that represent numbers. The computer knows how to translate binary into decimal and that’s that.

Any integer can be written in binary in a straightforward way. Think about the numer 149. Our decimal repressentation is really just a quick way of saying “one 100 plus four tens plus nine units” i.e. we’ve written a sum in powers of ten. We can do the same thing in powers of two: 149 in binary is 10010101. So many ones, twos, fours, eights and so on. The thing to notice here is that both representations are equivalent and completely accurate, it’s just that one is convenient for computers and the other for humans.

It all gets a lot more messy when you start thinking about non-integers. Think about a half: that’s 1/2, or 0.5 in familiar notation. It’s also 5×10^-1 in scientific notation. All fine so far, any schoolkid can tell you that 1/2 = 0.5. In fact, we can see the same idea at work here as with the integers: the decimal point tells you where the ones are and where the non-integer part of the number begins. Now we have an extension of the sum idea that we had before, zero ones plus 5 tenths and so on. Our notation for decimals works in the same way as our notation for integers, a sum over powers of ten.

So what if we ask the computer? You might think from what I’ve written so far that computers would use the same trick, only with powers of 2. So you could represent halves as negative powers of two, so that you have so many halves, plus so many quarters, so many eights and so on. It seems natural to think that computers would use binary decimals, something like 1100101.1100 for 104.75? In fact, things work a little differently, and this is the source of much chargrin to anyone who does any numerical work.

The snag with the sum system that I’ve talked about above is that it restricts you a little in terms of the range of numbers you can encapsulate in the computer. The numbers in a computer can only be of a certain, fixed length (this is part of what it means for a computer to be 32-bit or 64-bit — that’s how many digits there are in the number). So if we put a point somewhere, we’re immediately restricting the range and the accuracy of the numbers we can represent. Imagine, for a moment, I’m a programmer on an old-fashioned 16-bit machine. My numbers can only be 16-bits long, so my integers (assuming they aren’t signed) can go from 0 to 65,535. If I want to consider fractions and decimals then I need to put a point in there. Half way along and I have eight bits of range and other eight of accuracy, that’s 0 to 511 and down to 1/512. The largest number I can represent is 511 and the smallest is 0.001953125 which is a little restrictive.

The trick is to use something similar to scientific notation, i.e. 1.2 x 10^2 for 120 or 5 x 10^-1 for a half and so on. Now I split the number up into an exponent (the 2 or the -1) and mantissa (the 1.2 or the 5). So now I could have eight bits of mantissa and eight bits of exponent. Now my range is drastically expanded: If I have a signed exponent I can go up to 127 and down to -128: hundreds of quadrillions of quadrillions and so on down to the same number of zeros after the decimal place. This is how it worked in the 1970s and 80s, and this is how it still works now, albeit with more than 16 bits.

How wonderful! At a stroke I’ve expanded the range of numbers the computer can store vastly, and eliminated the restrictions of fixed point arithmetic! my 8 or 16 bit computer can now do far more, and with the IEEE standard in place for doing this, everyone agreed on a single method from one system to another. Hurrah for progress!

Well, not exactly. It’s true that floating point numbers enable you to increase you dynamic range, but there is a cost. I am only as accurate as the number of bits in my mantissa (1.2 or the 5). I can’t compare a hundred trillion to 1/1000, for example, the smaller number will get lost on the way. What’s more important, though, is that my numerical accuracy is dependent on how big my number is. In double precision I have 14 digits or so, and things can start getting dicey after the 11th or so. There are numbers I can’t represent, and these gaps occur unpredictably.

This is a bigger issue than you might think. Let’s return to our first example, the half. A half is 1/2 or 0.5, as our schoolkid will tell us. So what happens if we ask the computer? This is easy enough to code up, and in Java, the code might look like this:

double x= 1.0/2.0;

double y= 0.5;

if(x==y){

  System.out.println("a half equals 0.5");

else{

  System.out.println("a half does not equal 0.5");

What do you reckon this code would print? Given my panchent for sardonic false conclusions, you might guess that the computer would get it wrong. And you’d be correct, in this code x does not equal y (you can set up similar examples using 1.0-x and so on.

The reason for this is floating-point inaccuracy. The answer that comes out of 1.0/2.0 will actually be 0.4999999999997 or 0.500000000000008 or something similar that is very close to a half, but not actually right. What’s more, this tiny difference is unpredicable. It will always be small, but depending on the calculation and the numbers involved it will always be different, and it’s this that Has come to plague and vex me and my numerical brethren.

My first encounter with this was in calculating integrals. We had some code that evaluated an integral using a trapezium rule, and we made a small modification that involved changing the order of the calculation (that’s all, just the order). The result? The answer was suddenly off by twenty (yes, twenty) orders of magnitude. The integral was really a very long sum, and each term in the sum had a tiny error that was minimised by one order and maximised by the other. As a result, the value became meaningless.

So, a lesson learned about coding. Yes, but this is an issue that refuses to go away. Years later, I was investigating a strange departure from theory by some diffusion MR signals generated from particles in a cylinder. 99% of the curves coming out of the sim were spot on, but this one single curve was way off. This time, it turned out to be the placement of the origin of the coordinate system. When particles were too far from the origin compared the distance they travelled, the displacement was washed out by floating-point error.

Another time, I had an issue with particles reflecting from curved surfaces. Under certain circumstances they would track the curved surface, run out of accuracy and teleport through the surface. That was a toughie to fix. And today, the reason for all this verbiage is that we had a problem with triangles being ever-so-slightly non-congruent and causing simulations to go into infinite loops. All of these (and hundreds more, all different) caused by the inaccuracies in floating point numbers and all invloved some cunning thinking to get around it or (in extreme cases) hack the issue away.

So where am I going with this? Well, as you migh imagine, I do have an agenda here. The problem with floating point numbers is not that they are inaccurate per se, it’s that the errors are unpredictable, so we can’t just allow for them in the code. I can’t write code that says “ignore any disparity less than X” because I don’t know what X should be. This is something that could definitely be remedied, we could ditch floating point numbers and return to a form of fixed-point.

A number system where I could specify that there is a point at bit n and then compare numbers with overlapping ranges in a specified way would eliminate vast swathes of floating-point related problems. This is not the 1980s, the vast majority of computers in the world are either 32 or 64 bit. Processors can also split calulcations over two chunks of number, so I can use up to 128 bits on a 64bit machine, giving me 64bits of range and 64 bits of accuracy (if the point is in the middle), that’s 0 to 18, 446, 744, 073, 709, 551, 616 and an accuracy down to 10^-20 or so. Coupled with that we have known a known accuracy that’s constant right across the range, and that’s a compromise between range and accuracy. There are other alternatives, too. We could specify non-integers by placing them in a range given by two fractions — we can specify any number to any accuracy in this way.

Of course, we don’t do this and probably never will. In order to get this idea workable there would not only have to be appropriate standards for hardware and implementations the standard math library, LAPACK, NAG etc there would also have to be an ecoonomic case for doing it. Much as I hate to end on a pessimistic note, I can’t see it happening any time soon, but I’m fed up with being screwed over by floating point numbers and want to nail my colours to the mast: floating point is an unnecessary anachronism, the world of fixed point is waiting for us…

Tags: Coding, frustration, numbers, simulations

This entry was posted on Monday, February 15th, 2010 at 5:41 pm and is filed under Thoughts. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

MIG Blog

Floating-point numbers: a love story

Leave a Reply