2024-02-11

Run Length Encoding

Let’s have a little bit of programming fun with data compression.

Run-length encoding

The idea is to remove adjacent repetitions of a byte. If I have a string such as

aaaaaaaaaaaaaaaaaaaa

I could reduce the size of the actual data encoding something like:

the byte “a” repeated 20 times

But we can have also bytes that do not repeat:

aaaaaaaaaaaaaaaaaaaaBCDEFGHaaaaaaaaaa

Thus, we could invent an encoding such that, once interpreted, would be read as:

the byte “a” repeated 20 times, then the sequence of bytes BCDEFGH, then the bytes “a” repeated 10 times.

If we can encode these instructions in less than 37 bytes, we have in fact reduced the amount of bytes taken by our data.

We are working with binary data, because we want to be able to compress everything, not just “text” (that is, bytes which can be interpreted as glyphs according to some character set like ASCII, or encoding like UTF-8).

Let’s try with some possible encodings — I am just sketching ideas, not searching the best possible way (if it exists).

From now on, I will write octects (8-bit bytes) with two hexadecimal digits, e.g. FF, 1A, and so on — lowercase letters will be used for variables. Also, it will be useful to write bytes as bits, e.g. 10000000 — 8 binary digits — and in order to read them more easily, we could write 1000-0000 (two “nibbles”). The most significant bits (MSB) are those on the left.

Let’s imagine a possible encoding.

  • we start in mode V. In mode V, every byte of the input is copied verbatim to the output.
  • FF enters the “repeat bytes” mode, mode R
  • in mode R the incoming data are made of pairs of bytes:
    • nn dd: nn says how many times the byte dd must be repeated, with nn between 01 and FE.
    • when nn is 00, we exit mode R and enter again the mode V
  • the byte FF must be encoded as FF FF.

With this encoding, the byte FF shouldn’t appear too often, since otherwise it can easily defeat the purpose of the encoding — which is to reduce the size of the data stream.

Moreover, when bytes repeat, they should repeat at least three times; otherwise, it is not convenient to enter mode R. In fact, aaa would be encoded as

FF 03 61

We do not spare any byte, and maybe we also need to exit mode R, which takes up 2 bytes: aaab (4 bytes) would be encoded as

FF 03 61 FF 00 62

That is, 6 bytes instead of 4: our compressiong algorithm makes our data bigger! Unless we avoid mode R in these cases. If we had repetition of b, too, we could however enter mode R:

FF 03 61 03 62

These bytes encode aaabbb with 5 bytes instead of 6. Notice another case when it is not convenient to use the mode R:

FF 02 61

We have encoded the 2-bytes long string aa with 3 bytes! Again, our encoding algorithm shouldn’t enter mode R in such situations.

In our encoding we have used a whole byte as a sort of escape character to enter (and then exit) mode R. But we could think in term of bits, instead, and have the encoding stateless.

For instance, the MSB of a byte could be used to say the the other 7 bits says how many time the next byte repeats.

1nnn nnnn dddd dddd

This could mean: repeat the byte dddd dddd nnn nnnn time. But then, a sequence of different bytes would be encoded as:

81 61 81 62

Indeed we need to use this encoding only for bytes between 80 and FF; bytes from 00 to 7F can be copied verbatim. This could work if our data contains mostly bytes in the range 00-7F, and of course many repeated bytes.

We could use another kind of encoding:

  • 0nnn nnnn means: take the next N bytes verbatim
  • 1nnn nnnn dddd dddd: repeat the byte D N-times.

Since take the next 0 bytes has no meaning, as it wouldn’t repeat the next byte 0-times, we can interpret 0 as 128.

A sequence of 128 bytes (〈Z〉) containing no repetions would be encoded as:

00 <Z>

And the byte “s” (61) repeated 128 times would be encoded as:

80 61

The sequence “aa” can be encoded as

82 61

That is, with the same number of bytes of the original sequence.

Overheads

What if our decoder has to decode a stream like the following?

8E 30 02

This is a corrupted stream, because after the 02, our decoder expects two more bytes to be copied verbatim. Because of this, we can say that our stream is corrupted. Another example of corrupted stream:

8D 30 82

In fact, after 82 the decoder expect another byte (the one that should be repeated 2 times).

These corrupted stream can be identified because the corruption happens at the end of the stream; but in general, how can we be sure our stream is not corrupted?

Likely, it is better to have two pieces of information stuffed in our encoded (compressed) stream: the original length of the uncompressed data, and a checksum.

These bytes can be put at the beginning (or maybe at the end!) of the file containing the compressed data. Also, likely we want something that tells our decoder if the data are in fact encoded the way it expects. This means adding a header in the file1 to make it recognizable and give the decoder some hints/metadata, so that it can check if everything is OK.

We can imagine a header like:

FE ED A1 10           signature
nn nn nn nn           original file length
cc cc cc cc           4-bytes checksum

These are 24 bytes. This means that if our encoder cannot compress the original data sparing at least 24 bytes — it is not worth encoding it at all.

Transformations

What kind of data would be compressed well by a RLE compressing algorithm?

Raster graphics where each pixel is a byte representing one of 256 possible colours is an example, as long as there are a lot of same-colour pixels one after another.

One after another? An image is bidimensional… A 8×8 image would be:

00 01 02 03 04 05 06 07
10 11 12 13 14 15 16 17
…
70 71 72 73 74 75 76 77

where each number is a label for a pixel (RC, R = row, C = column).

But the memory isn’t 2D2. Maybe the in-memory layout of the pixels of the image is like this:

row0 row1 row3 …

Then, our encoding is able to compress equal colors on the same row, but not if they appear as “stripes”. E.g., the folowing image

00 00 00 00 00 00 00 00
00 00 FF FF FF FF 00 00
00 00 FF FF FF FF 00 00
…
00 00 FF FF FF FF 00 00
00 00 00 00 00 00 00 00

would be encoded as

8A 00 84 FF 84 00 84 FF … 8A 00

The 8A 00 encodes for the 8 pixels of a row, plus two pixels of the next row. This is possible because in memory two rows appear one after another.

00 … 00 | 00 00 FF … 00 00 | …
  row0    row1

With this way rows appear in memory, what if our image is made of differently coloured vertical lines?

00 FF 00 FF 00 FF 00 FF
00 FF 00 FF 00 FF 00 FF
00 FF 00 FF 00 FF 00 FF
…
00 FF 00 FF 00 FF 00 FF

Our encoder won’t see any repetition… but we can tell that there is! Well — the 2-bytes sequence 00 FF repeats, but our RLE can consider only repetion of single bytes. Nonetheless, we can see that if we consider the image rotated, we it would be like having

00 00 00 00 00 00 00 00
FF FF FF FF FF FF FF FF
…

which can be compressed well enough as:

88 00 88 FF …

The general idea is: if we apply a transformation, maybe we can make our algorithm more efficient. The big problem becomes: how can we design an algorithm which understands the kind of transformation that would make the compression more effective? Then, if we have several possible transformations, we also need to encode which transformation we applied, so that the decoder can apply its inverse to obtain the correct bytes of the original data: it’s more overhead.

Knowing what we are compressing might help — but then, our algorithm would be specific for a certain kind of data; e.g., for an image where the encoder knows width and height, and how the in-memory pixels layout is.

Decoding (decompressing)

The decoder (decompressor) is easier. This is a nice to have property in general, because usually we care (a little bit) less about the time and computing power needed to compress the file, while we want the decompressing stage to be as fast as possible.

The decoder will look very much like an interpreter which executes instructions. Here we have two instructions: copy the next N bytes verbatim, and repeat the next bytes N-times.

Examples

A simple RLE compressor can be written as a filter reading bytes one at a time from stdin and emitting bytes to stdout. In this design, we do not have a checksum, nor the length of the original stream — which we do not know in advance! — and a header either.

This means that it is our duty to check if everything’s OK. We can give our decoder even a file which wasn’t encoded by the encoder — the output will be garbage, or the decoder will fail at the end of the stream (if the stream does not end correctly).

The RLE decompressor can be designed as a filter consuming bytes from the stdin and producing bytes to the stdout. Of course it is not an universal RLE decompressor! It understands correctly only a stream produced by our encoder.

Let’s see few examples.

The following is the unfortunate case when the compressed stream is twice as big as the original stream: it is 100% bigger (1 + 1 = 2…)

$ echo  -n "a" |./rle_encode |hexdump -C
1 -> 2 (-100.00%)
00000000  01 61                                             |..|

The following is when compressed and original streams have the same length.

$ echo  -n "aa" |./rle_encode |hexdump -C
2 -> 2 (0.00%)
00000000  82 61                                             |.a|

The following is a very fortunate case:

$ perl -e 'print "x"x128 . "y"x128' |./rle_encode |hexdump -C
256 -> 4 (98.44%)
00000000  80 78 80 79                                       |.x.y|

When we have long sequences, the overhead of the encoding isn’t terrible, but anywhere it’s there!

$ echo  -n "01234567890abcdef01234567890abcdef01234567890abcdef01234567890abcdef01234567890abcdef01234567890abcdef01234567890abcdef012345678" |./rle_encode |hexdump -C
128 -> 129 (-0.78%)
00000000  00 30 31 32 33 34 35 36  37 38 39 30 61 62 63 64  |.01234567890abcd|
00000010  65 66 30 31 32 33 34 35  36 37 38 39 30 61 62 63  |ef01234567890abc|
00000020  64 65 66 30 31 32 33 34  35 36 37 38 39 30 61 62  |def01234567890ab|
00000030  63 64 65 66 30 31 32 33  34 35 36 37 38 39 30 61  |cdef01234567890a|
00000040  62 63 64 65 66 30 31 32  33 34 35 36 37 38 39 30  |bcdef01234567890|
00000050  61 62 63 64 65 66 30 31  32 33 34 35 36 37 38 39  |abcdef0123456789|
00000060  30 61 62 63 64 65 66 30  31 32 33 34 35 36 37 38  |0abcdef012345678|
00000070  39 30 61 62 63 64 65 66  30 31 32 33 34 35 36 37  |90abcdef01234567|
00000080  38                                                |8|

In the example above, the 00 means: copy the next 128 bytes to the output. Thus, we have 129 bytes instead of 128. A small price to pay, if then the stream contains long sequences of a repeated byte — not the case here.

We can pipe the output of the encoder as input to the decoder, verifying that the resulting output is the same as the input.

$ echo  -n "abaaa" |./rle_encode |./rle_decode |hexdump -C
5 -> 5 (0.00%)
00000000  61 62 61 61 61                                    |abaaa|
$ perl -e 'print "x"x127 . "yyyy"' |./rle_encode |./rle_decode| hexdump -C
131 -> 4 (96.95%)
00000000  78 78 78 78 78 78 78 78  78 78 78 78 78 78 78 78  |xxxxxxxxxxxxxxxx|
*
00000070  78 78 78 78 78 78 78 78  78 78 78 78 78 78 78 79  |xxxxxxxxxxxxxxxy|
00000080  79 79 79                                          |yyy|
$ perl -e 'print "x"x64 . "y"x64' |./rle_encode |./rle_decode| hexdump -C
128 -> 4 (96.88%)
00000000  78 78 78 78 78 78 78 78  78 78 78 78 78 78 78 78  |xxxxxxxxxxxxxxxx|
*
00000040  79 79 79 79 79 79 79 79  79 79 79 79 79 79 79 79  |yyyyyyyyyyyyyyyy|
*
00000080

We should also test it with generic binary code.

$ echo  -ne "\xFF\xFF\x00\x00\x00\x00" |./rle_encode |./rle_decode |hexdump -C
6 -> 4 (33.33%)
00000000  ff ff 00 00 00 00                                 |......|
$ echo  -ne "\x00\x01\x02\x03\x04\x05\x06\x07" |./rle_encode |./rle_decode |hexdump -C
8 -> 9 (-12.50%)
00000000  00 01 02 03 04 05 06 07                           |........|

Let’s try it with the RLE encoder executable itself:

$ cat rle_encode |./rle_encode >ENCODED
15736 -> 5489 (65.12%)
$ ll rle_encode ENCODED 
 5489 ENCODED
15736 rle_encode*

Not bad! The ELF binary format contains a lot of repeated bytes, it seems! Let’s check that the executable still works:

$ cat rle_encode |./rle_encode |./rle_decode >DECODED
 md5sum DECODED rle_encode
c12a7c64e74e2aa50500992b116940af  DECODED
c12a7c64e74e2aa50500992b116940af  rle_encode

Note: the first time I tried this, the checksum was different! In fact, there was a bug — which I have fixed, hopefully not adding other bugs.

Encoder source code

This is a quick implementation — I didn’t bother to check it properly, either. The examples run above are all the tests I did. It does not satisfy my eyes, but I won’t spend more time on it!

#include <stdio.h>
#include <inttypes.h>

/*
  0nnn nnnn ...    take the next N bytes verbatim
                   where N = 128 if n == 0,
                         N = n   otherwise
  1nnn nnnn BB     repeat byte BB N-times
                   where N = 128 if n == 0,
                        N = n    otherwise
 */

#define MAX_SEQ 128
#define MAX_REPEAT 128

enum What { REPEAT, COPY };

uint8_t accumulator[MAX_SEQ];


size_t dump(enum What what, size_t *i) {
    size_t oi = *i;
    
    switch (what) {
    case COPY:
        if (*i == 0)
            return 0;
        putchar(oi == MAX_SEQ ? 0 : oi);
        fwrite(accumulator, 1, oi, stdout);
        *i = 0;
        return oi + 1;

    case REPEAT:
        if (*i == 0)
            return 0;
        if (oi == MAX_REPEAT)
            oi = 0;
        putchar(oi | 0x80);
        putchar(accumulator[0]);
        *i = 0;
        return 2;
    }
}



double reduction(size_t total_size, size_t compressed_size) {
    return 1.0 - (double)compressed_size / total_size;
}

int main(void) {
    size_t accumulated = 0;
    size_t repeated = 0;

    size_t total_size = 0;
    size_t compressed_size = 0;

    enum What state = COPY;
    int c;
    int previous_char = getchar();
    if (previous_char == EOF)
        return 0;

    accumulator[0] = previous_char;
    accumulated = 1;
    
    total_size = 1;
    while ((c = getchar()) != EOF) {
        total_size++;
        switch (state) {
        case COPY:
            if (c != previous_char) {
                previous_char = c;
                accumulator[accumulated++] = c;
                if (accumulated == MAX_SEQ) {
                    compressed_size += dump(state, &accumulated);
                    previous_char = -1;
                }
            } else {
                accumulated--;
                compressed_size += dump(state, &accumulated);
                repeated = 2;
                accumulator[0] = c;
                state = REPEAT;
            }
            break;
        case REPEAT:
            if (c == previous_char) {
                repeated++;
                if (repeated == MAX_REPEAT) {
                    accumulator[0] = previous_char;
                    compressed_size += dump(state, &repeated);
                }
            } else {
                accumulator[0] = previous_char;
                compressed_size += dump(state, &repeated);
                previous_char = c;
                accumulator[0] = c;
                accumulated = 1;
                state = COPY;
            }
        }
    }
    
    compressed_size += dump(state, state == COPY ? &accumulated : &repeated);
    fprintf(stderr, "%zu -> %zu (%.2lf\%)\n", total_size, compressed_size, 100.0*reduction(total_size, compressed_size));
    return 0;
}

Decoder source code

This reads (and writes) a single character per time, even when it could read more than one bytes and write it back altogether. Because of buffering, this shouldn’t be terrible, but the other way would be nonetheless a little bit better because it would make less calls to a library (and the system) function.

#include <stdio.h>
#include <stdlib.h>

void corrupted_stream(void) {
    fprintf(stderr, "corrupted stream");
    exit(1);
}

int main(void) {
    int c;
    while ((c = getchar()) != EOF) {
        if (c & 0x80) {
            c = c == 0x80 ? 128 : (c & 0x7f);
            int b = getchar();
            if (b == EOF)
                corrupted_stream();
          
            while (c-- > 0)
                putchar(b);
        } else {
            c = c == 0 ? 128 : c;
            while (c-- > 0) {
                int d = getchar();
                if (d == EOF)
                    corrupted_stream();
                putchar(d);
            }
        }
    }

    return 0;
}

  1. A lot of windowsians think that a file suffix is absolutely needed to identify a file. Indeed, the majority of file formats are designed so that a program can tell if they are what they should be — despite the file name. So, you can rename a image.jpg file as image.png, and still a program can tell that the image format is JPG, not PNG. This is because those formats contain “magic values” at their beginning — a sort of signature that says “I am a JPEG file” or “I am a PNG” files. On nix-like Operating Systems there is the file utility which can identify a file by its content. It uses these special headers — and some kind of heuristics for those formats which has no headers — for instance, a plain textual file has no headers saying “I am a text file”.↩︎

  2. From our point of view, it is a “sequence” of cells. On a chip, it would look more like a 2D grid.↩︎

No comments:

Post a Comment