2018-06-24

Java and its silly war against unsigned integers

Short story: Java hasn't (primitive) unsigned integers because Gosling thinks that too many programmers don't get what goes on with unsigned, what unsigned arithmetic is. Good solution: teach and make them aware, fix the problems of the education system which hasn't taught them, and so on. Bad solution: treat everybody as a stupid unable to learn, remove the feature and disappoint programmers who understand what goes on with unsigned, what unsigned arithmetic is, and those who think they can still learn. End of the story.

This is a signed 32 bit integer:

    int s = 0x801EFFFF;

If you want to get the higher 8 bit, you should do a logical right shift, that is, >>>.

The logical left shift doesn't exist — because it is the same as the arithmetic left shift; so we haven't problems, unless sign extension enters the scene. The first clue comes from this:

   byte x = 0xFF;

The compiler says error: possible loss of precision, and one may wonder why: 255 fits into a byte graciously. But the byte is signed, too: the higher positive value it can hold is 127. For this reason, 0xFF must be an integer, which can't be put into a byte. We can fix this by casting:

   byte x = (byte)0xFF;

But indeed this way we are hiding something. If we write

   byte x = -1;

Java doesn't complain, even if we know that -1 is 11111111 represented as a binary, that is, FF in hex. This makes sense1, but it's annoying when you want to manipulate bytes as “sequences of 8 bits”.

It goes worse:

        int s;
        // ...
        byte ba = 0x7F;
        byte bb = 0x7F;
        byte bc = 0x01;
        byte bd = (byte)0xFF;
        s = (ba << 24) | (bb << 16) | (bc << 8) | bd;

Now s is 0xFFFFFFFF instead of 0x7F7F01FF. This is because each byte “becomes” an integer, and being everything signed, it gets sign extended. Thus we have:

7F000000 | 007F0000 | 000001000 | FFFFFFFF

In fact -1 in a byte (FF) is still -1 in an integer (FFFFFFFF)… Also, consider that the shift happens after the sign extension.

To avoid these effects we need to mask the bytes we want. We can avoid this with the higher one, the one which gets shifted by 24 — but I do it anyway, so that everything can be aligned and looks like it looks for reson you can imagine; this is also why I write the wasteful << 0, which hopefully won't generate a shift instruction.

        s = ((ba << 24) & 0xFF000000) |
            ((bb << 16) & 0x00FF0000) |
            ((bc <<  8) & 0x0000FF00) |
            ((bd <<  0) & 0x000000FF);

On the other hand, if we keep our bytes into ints…

        int ba = 0x7F;
        int bb = 0x7F;
        int bc = 0x01;
        int bd = 0xFF;
        s = (ba << 24) |
            (bb << 16) |
            (bc <<  8) |
            (bd <<  0);

The literal constants are all integers, e.g., 0x000000FF. No sign extension. Easy. Here we are assigning constants; when we read or extract them as bytes someway, likely we need & 0xFF to cut everything an unwanted sign extension could have done.

If we need to deal with 32 bit (unsigned) integers, likely it will be better to keep them in long (64 bit signed integers). But this doesn't mean we need to be less careful2.

Now, suppose you have to read a 32 bit unsigned integer from a file, stored in big endian order3, to manipulate it later someway.

import java.io.*;
//...
    DataInputStream in = new DataInputStream(...);
    long x = in.readInt();
    // ...

If this int is, say, FF010203, the variable x will hold FFFFFFFFFF010203, unless we do:

            long x = in.readInt() & 0xFFFFFFFFL;

The L is important, otherwise 0xFFFFFFFF would be an -1 (int) which will be sign extended to a -1 (long), that is, 0xFFFFFFFFFFFFFFFF.

DataInputStream has readUnsignedShort(), which reads a 16 bit unsigned integer and put it into a 32 signed int (avoiding sign extension, of course), and readUnsignedByte(), which reads a 8 bit unsigned integer and put it into a 32 bit signed it. readUnsignedInt() doesn't exist, though. The class should be extended (as Java can do, even if it would be better to do it like Smalltalk can), something like:

class DataInputStreamX extends DataInputStream
{
    // constructors... e.g., with BufferedInputStream, because
    // it is the one I'm using in my tests :)
    public DataInputStreamX(BufferedInputStream x) {
        super(x);
    }

    long readUnsignedInt() throws IOException {
        return readInt() & 0xFFFFFFFFL;
    }
}

Now you need to 1) complete the implementation of DataInputStreamX according to your need, 2) replace DataInputStream with DataInputStreamX everywhere you need it, 3) fix problems where they occur… In Smalltalk4 you would need only the point 1 and 2.

Anyway…

Dealing with data in Java is more painful than it needs to be. I am glad I am not alone blaming the choice of not giving Java the unsigned types.

And of course it's a well known discussion, and in fact you can find articles like this:

This is going to be a huge help when dealing with data over the wire and re-implementing algorithms from C.

The suggestion of naming variables with a final U or _U isn't a big help, indeed, and still there isn't anything like unsigned int or unsigned long,

Well, even in Java 8, long and int are still signed, only some methods treat them as if they were unsigned.

This is what Java has to offer, but not on its primitive types.

Once we are sure we have the right sequence of bits into our int or long, and we want to write it to a file, we don't need to worry about signedness anymore (but we need to choose the endianness we want, in case it isn't big — default for Java).

import java.nio.ByteBuffer;
import java.nio.ByteOrder;

// ...

            System.out.write(ByteBuffer.allocate(4).putInt((int)x).array());

Or

            System.out.write(ByteBuffer
                             .allocate(4)
                             .order(ByteOrder.LITTLE_ENDIAN)
                             .putInt((int)x)
                             .array());

In both cases we need (int)x if x is a long.

Using a long for an unsigned int or an int for unsigned byte can be useful in situation where otherwise sign extension would be annoying, forcing the overuse of masking to obtain the right result.

If we're ok with just reading an int, manipulating few bits with ors and/or ands, and write it somewhere, we don't need extra care because of signedness. E.g.,

            int x = in.readInt();
            x &= 0x807F7FFF;
            x |= 0x00000080;
            System.out.write(ByteBuffer
                             .allocate(4)
                             .putInt(x)
                             .array());

Good, old, low level days

In assembly, registers hold data, and signed/unsigned registers don't exist. In fact, signedness is just a “special” interpretation of a certain configuration of bits — according to two's complement. This interpretation isn't carved into the name of a register5: it is an action, and it must be done by instructions.

In order to treat a number into a register (or in memory) as signed, you must pick the right instructions which do so — i.e., instructions that implement their operations to consider the two's complement representation. When you use high level languages, compilers and interpreters do it for you according to the specification of the language (and the target CPU).

When you have, say, 32 bit registers, and you store into one of it a byte, and then you start using the whole register, no sign extension happens. You must explicitly do it.

In assembly Motorola 680x0 (the CPU which ran the Amiga and the Apple's Mac before they switched to PowerPC first, and then to Intel x86),

      clr.l    d0
      move.b   #$80, d0
      move.l   d0,   d1

A little explanation. clr.l is a clear (clr) instruction done on the whole 32 bit register d0 (.l means long, and on that architecture the long is 32 bit long…)6. By clear we mean to put all the bits in the register to 0. MC680x0 had 8 data registers, d0-d7. Imagine this in C:

   uint32_t d[8]; // just for now...
   // ...
   d[0] = 0;

The second instruction “moves” (copies) the value 0x80 in d0. The move instruction doesn't care about signedness. It doesn't know what it is, it doesn't need it. It just copies the value. The .b means this: operate only on the (lowest) byte of the register. That is, no matter what the other bytes hold, modify only the less significant byte.

The last instruction uses the whole data register and copies its value into another data register (d1). Again, it doesn't care about signedness or whatever. Indeed, it can't: it would be mean to remember that there was a move.b, and to pretend to know the intention of the programmer was to have a signed number which now she wants to copy in another register.

The CPU isn't in charge of such a reasoning.

If the programmer knowns that the lowest byte of d0 is a signed value, and so she wants to extend the sign because now she needs to use the whole data register…

      clr.l    d0
      move.b   #$80, d0
      ext.w    d0
      ext.l    d0
      move.l   d0,   d1

We added two instructions: ext.w extends the sign of the lowest byte to 16 bit (.w, word), and ext.l extends the sign of the lowest word to 32 bit (.l, long). This is 68000. In 680207 (and later) processors, there's a single instruction which extends the sign from b to l: extb.l. So

      clr.l    d0
      move.b   #$80, d0
      extb.l   d0
      move.l   d0,   d1

But, as said in a note, nobody used clr.l to clear (set to 0) a data register, then just to move into it a byte and to extend it! All this can be done with a single instruction, moveq (since 68000 already):

     moveq    #$80, d0
     move.l   d0,   d1

Instead, if we don't want the sign extension,

     moveq    #0,   d0  ; = clr.l d0
     move.b   #$80, d0
     move.l   d0,   d1

There are other ways — but this isn't interesting for my point.

What I am trying to say is that at low level, integers are more naturally unsigned than signed. “Special” instructions are known to handle signedness, doing whatever they need to to take two's complement into account. The programmer must explicitly program the processor to do sign extension, and she can always opt for instructions which don't “implement” two's complement rules. So that nobody gets in the way of messing with your bits if you don't want to, and you can always consider as unsigned a datum which you treated as signed before, if it is now what you need.

Higher level languages hide/abstract all this, and this is usually good, except when it isn't — sometimes by design.


  1. It would be nice to inform Java that we aknowledge the signedness of the type, but we are just saying how we want the bits. Like if with 0xFF we are not meaning the number 255 (written in hexadecimal), which, as number, can't be put into a signed byte: we are meaning how the bits of the byte are… By the way, -0x7f is exactly the same as -127. The -127 can be interpreted as a unary operator followed by a number literal, but altogether it is just a number literal, and as such is a constant altogether, not an operator acting on a constant (127). It's the same for -0x7f.

  2. I tell this from my first experience converting into Java C code full of bit manipulations — C has unsigned integers and manipulating bits is easy and “natural”: if you need to consider them just sequences of bit grouped by 8, you can.

  3. To prepare a test file I do, e.g., echo "ff010203" | xxd -r -p - data.bin.

  4. And Objective-C too, if I remember well.

  5. Usually… I don't know if they exist, but they could in fact exist processors which have “signed registers”, meaning that instructions involving these registers are always signed — exactly how they exist floating point registers used only by floating point instructions.

  6. Actually, nobody used clr.l to clear a data register. There was an instruction, move quick (moveq) to “move” (copy) a 8-bit signed immediate value into a data register, 32-bit wide. This is a “signed” instruction: if you don't want sign extension, you don't use it.

  7. In x86 the situation is similar, even if there are different instructions. There are two instructions (movsx and movsxd) which can “move” (copy) extending the sign from a memory location or a register to a 16/32 register (and 64 on 64bit processors). There isn't an ext-like instruction, though.

No comments:

Post a Comment