Let’s have a little bit of programming fun with data compression.
Run-length encoding
The idea is to remove adjacent repetitions of a byte. If I have a string such as
aaaaaaaaaaaaaaaaaaaa
I could reduce the size of the actual data encoding something like:
the byte “a” repeated 20 times
But we can have also bytes that do not repeat:
aaaaaaaaaaaaaaaaaaaaBCDEFGHaaaaaaaaaa
Thus, we could invent an encoding such that, once interpreted, would be read as:
the byte “a” repeated 20 times, then the sequence of bytes BCDEFGH, then the bytes “a” repeated 10 times.
If we can encode these instructions in less than 37 bytes, we have in fact reduced the amount of bytes taken by our data.
We are working with binary data, because we want to be able to compress everything, not just “text” (that is, bytes which can be interpreted as glyphs according to some character set like ASCII, or encoding like UTF-8).
Let’s try with some possible encodings — I am just sketching ideas, not searching the best possible way (if it exists).
From now on, I will write octects (8-bit bytes) with two hexadecimal
digits, e.g. FF, 1A, and so on — lowercase letters will be used for
variables. Also, it will be useful to write bytes as bits,
e.g. 10000000
— 8 binary digits — and in order to read them
more easily, we could write 1000-0000
(two “nibbles”). The
most significant bits (MSB) are those on the left.
Let’s imagine a possible encoding.
- we start in mode V. In mode V, every byte of the input is copied verbatim to the output.
FF
enters the “repeat bytes” mode, mode R- in mode R the incoming data are made of pairs of bytes:
nn dd
:nn
says how many times the bytedd
must be repeated, withnn
between 01 and FE.- when
nn
is00
, we exit mode R and enter again the mode V
- the byte
FF
must be encoded asFF FF
.
With this encoding, the byte FF
shouldn’t appear too
often, since otherwise it can easily defeat the purpose of the encoding
— which is to reduce the size of the data stream.
Moreover, when bytes repeat, they should repeat at least three times;
otherwise, it is not convenient to enter mode R. In fact,
aaa
would be encoded as
FF 03 61
We do not spare any byte, and maybe we also need to exit mode
R, which takes up 2 bytes: aaab
(4 bytes) would be
encoded as
FF 03 61 FF 00 62
That is, 6 bytes instead of 4: our compressiong algorithm makes our
data bigger! Unless we avoid mode R in these cases. If we had
repetition of b
, too, we could however enter mode
R:
FF 03 61 03 62
These bytes encode aaabbb
with 5 bytes instead of 6.
Notice another case when it is not convenient to use the mode
R:
FF 02 61
We have encoded the 2-bytes long string aa
with 3 bytes!
Again, our encoding algorithm shouldn’t enter mode R in such
situations.
In our encoding we have used a whole byte as a sort of escape character to enter (and then exit) mode R. But we could think in term of bits, instead, and have the encoding stateless.
For instance, the MSB of a byte could be used to say the the other 7 bits says how many time the next byte repeats.
1nnn nnnn dddd dddd
This could mean: repeat the byte dddd dddd
nnn nnnn
time. But then, a sequence of different bytes
would be encoded as:
81 61 81 62
Indeed we need to use this encoding only for bytes between 80 and FF; bytes from 00 to 7F can be copied verbatim. This could work if our data contains mostly bytes in the range 00-7F, and of course many repeated bytes.
We could use another kind of encoding:
0nnn nnnn
means: take the next N bytes verbatim1nnn nnnn dddd dddd
: repeat the byte D N-times.
Since take the next 0 bytes has no meaning, as it wouldn’t repeat the next byte 0-times, we can interpret 0 as 128.
A sequence of 128 bytes (〈Z〉) containing no repetions would be encoded as:
00 <Z>
And the byte “s” (61) repeated 128 times would be encoded as:
80 61
The sequence “aa” can be encoded as
82 61
That is, with the same number of bytes of the original sequence.
Overheads
What if our decoder has to decode a stream like the following?
8E 30 02
This is a corrupted stream, because after the 02, our decoder expects two more bytes to be copied verbatim. Because of this, we can say that our stream is corrupted. Another example of corrupted stream:
8D 30 82
In fact, after 82
the decoder expect another byte (the
one that should be repeated 2 times).
These corrupted stream can be identified because the corruption happens at the end of the stream; but in general, how can we be sure our stream is not corrupted?
Likely, it is better to have two pieces of information stuffed in our encoded (compressed) stream: the original length of the uncompressed data, and a checksum.
These bytes can be put at the beginning (or maybe at the end!) of the file containing the compressed data. Also, likely we want something that tells our decoder if the data are in fact encoded the way it expects. This means adding a header in the file1 to make it recognizable and give the decoder some hints/metadata, so that it can check if everything is OK.
We can imagine a header like:
FE ED A1 10 signature
nn nn nn nn original file length
cc cc cc cc 4-bytes checksum
These are 24 bytes. This means that if our encoder cannot compress the original data sparing at least 24 bytes — it is not worth encoding it at all.
Transformations
What kind of data would be compressed well by a RLE compressing algorithm?
Raster graphics where each pixel is a byte representing one of 256 possible colours is an example, as long as there are a lot of same-colour pixels one after another.
One after another? An image is bidimensional… A 8×8 image would be:
00 01 02 03 04 05 06 07
10 11 12 13 14 15 16 17
…
70 71 72 73 74 75 76 77
where each number is a label for a pixel (RC, R = row, C = column).
But the memory isn’t 2D2. Maybe the in-memory layout of the pixels of the image is like this:
row0 row1 row3 …
Then, our encoding is able to compress equal colors on the same row, but not if they appear as “stripes”. E.g., the folowing image
00 00 00 00 00 00 00 00
00 00 FF FF FF FF 00 00
00 00 FF FF FF FF 00 00
…
00 00 FF FF FF FF 00 00
00 00 00 00 00 00 00 00
would be encoded as
8A 00 84 FF 84 00 84 FF … 8A 00
The 8A 00
encodes for the 8 pixels of a row, plus two
pixels of the next row. This is possible because in memory two rows
appear one after another.
00 … 00 | 00 00 FF … 00 00 | …
row0 row1
With this way rows appear in memory, what if our image is made of differently coloured vertical lines?
00 FF 00 FF 00 FF 00 FF
00 FF 00 FF 00 FF 00 FF
00 FF 00 FF 00 FF 00 FF
…
00 FF 00 FF 00 FF 00 FF
Our encoder won’t see any repetition… but we can tell that there is!
Well — the 2-bytes sequence 00 FF
repeats, but our RLE can
consider only repetion of single bytes. Nonetheless, we can see that if
we consider the image rotated, we it would be like having
00 00 00 00 00 00 00 00
FF FF FF FF FF FF FF FF
…
which can be compressed well enough as:
88 00 88 FF …
The general idea is: if we apply a transformation, maybe we can make our algorithm more efficient. The big problem becomes: how can we design an algorithm which understands the kind of transformation that would make the compression more effective? Then, if we have several possible transformations, we also need to encode which transformation we applied, so that the decoder can apply its inverse to obtain the correct bytes of the original data: it’s more overhead.
Knowing what we are compressing might help — but then, our algorithm would be specific for a certain kind of data; e.g., for an image where the encoder knows width and height, and how the in-memory pixels layout is.
Decoding (decompressing)
The decoder (decompressor) is easier. This is a nice to have property in general, because usually we care (a little bit) less about the time and computing power needed to compress the file, while we want the decompressing stage to be as fast as possible.
The decoder will look very much like an interpreter which executes instructions. Here we have two instructions: copy the next N bytes verbatim, and repeat the next bytes N-times.
Examples
A simple RLE compressor can be written as a filter reading bytes one at a time from stdin and emitting bytes to stdout. In this design, we do not have a checksum, nor the length of the original stream — which we do not know in advance! — and a header either.
This means that it is our duty to check if everything’s OK. We can give our decoder even a file which wasn’t encoded by the encoder — the output will be garbage, or the decoder will fail at the end of the stream (if the stream does not end correctly).
The RLE decompressor can be designed as a filter consuming bytes from the stdin and producing bytes to the stdout. Of course it is not an universal RLE decompressor! It understands correctly only a stream produced by our encoder.
Let’s see few examples.
The following is the unfortunate case when the compressed stream is twice as big as the original stream: it is 100% bigger (1 + 1 = 2…)
$ echo -n "a" |./rle_encode |hexdump -C
1 -> 2 (-100.00%)
00000000 01 61 |..|
The following is when compressed and original streams have the same length.
$ echo -n "aa" |./rle_encode |hexdump -C
2 -> 2 (0.00%)
00000000 82 61 |.a|
The following is a very fortunate case:
$ perl -e 'print "x"x128 . "y"x128' |./rle_encode |hexdump -C
256 -> 4 (98.44%)
00000000 80 78 80 79 |.x.y|
When we have long sequences, the overhead of the encoding isn’t terrible, but anywhere it’s there!
$ echo -n "01234567890abcdef01234567890abcdef01234567890abcdef01234567890abcdef01234567890abcdef01234567890abcdef01234567890abcdef012345678" |./rle_encode |hexdump -C
128 -> 129 (-0.78%)
00000000 00 30 31 32 33 34 35 36 37 38 39 30 61 62 63 64 |.01234567890abcd|
00000010 65 66 30 31 32 33 34 35 36 37 38 39 30 61 62 63 |ef01234567890abc|
00000020 64 65 66 30 31 32 33 34 35 36 37 38 39 30 61 62 |def01234567890ab|
00000030 63 64 65 66 30 31 32 33 34 35 36 37 38 39 30 61 |cdef01234567890a|
00000040 62 63 64 65 66 30 31 32 33 34 35 36 37 38 39 30 |bcdef01234567890|
00000050 61 62 63 64 65 66 30 31 32 33 34 35 36 37 38 39 |abcdef0123456789|
00000060 30 61 62 63 64 65 66 30 31 32 33 34 35 36 37 38 |0abcdef012345678|
00000070 39 30 61 62 63 64 65 66 30 31 32 33 34 35 36 37 |90abcdef01234567|
00000080 38 |8|
In the example above, the 00
means: copy the next 128
bytes to the output. Thus, we have 129 bytes instead of 128. A small
price to pay, if then the stream contains long sequences of a repeated
byte — not the case here.
We can pipe the output of the encoder as input to the decoder, verifying that the resulting output is the same as the input.
$ echo -n "abaaa" |./rle_encode |./rle_decode |hexdump -C
5 -> 5 (0.00%)
00000000 61 62 61 61 61 |abaaa|
$ perl -e 'print "x"x127 . "yyyy"' |./rle_encode |./rle_decode| hexdump -C
131 -> 4 (96.95%)
00000000 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 |xxxxxxxxxxxxxxxx|
*
00000070 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 79 |xxxxxxxxxxxxxxxy|
00000080 79 79 79 |yyy|
$ perl -e 'print "x"x64 . "y"x64' |./rle_encode |./rle_decode| hexdump -C
128 -> 4 (96.88%)
00000000 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 |xxxxxxxxxxxxxxxx|
*
00000040 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 |yyyyyyyyyyyyyyyy|
*
00000080
We should also test it with generic binary code.
$ echo -ne "\xFF\xFF\x00\x00\x00\x00" |./rle_encode |./rle_decode |hexdump -C
6 -> 4 (33.33%)
00000000 ff ff 00 00 00 00 |......|
$ echo -ne "\x00\x01\x02\x03\x04\x05\x06\x07" |./rle_encode |./rle_decode |hexdump -C
8 -> 9 (-12.50%)
00000000 00 01 02 03 04 05 06 07 |........|
Let’s try it with the RLE encoder executable itself:
$ cat rle_encode |./rle_encode >ENCODED
15736 -> 5489 (65.12%)
$ ll rle_encode ENCODED
5489 ENCODED
15736 rle_encode*
Not bad! The ELF binary format contains a lot of repeated bytes, it seems! Let’s check that the executable still works:
$ cat rle_encode |./rle_encode |./rle_decode >DECODED
md5sum DECODED rle_encode
c12a7c64e74e2aa50500992b116940af DECODED
c12a7c64e74e2aa50500992b116940af rle_encode
Note: the first time I tried this, the checksum was different! In fact, there was a bug — which I have fixed, hopefully not adding other bugs.
Encoder source code
This is a quick implementation — I didn’t bother to check it properly, either. The examples run above are all the tests I did. It does not satisfy my eyes, but I won’t spend more time on it!
#include <stdio.h>
#include <inttypes.h>
/*
0nnn nnnn ... take the next N bytes verbatim
where N = 128 if n == 0,
N = n otherwise
1nnn nnnn BB repeat byte BB N-times
where N = 128 if n == 0,
N = n otherwise
*/
#define MAX_SEQ 128
#define MAX_REPEAT 128
enum What { REPEAT, COPY };
uint8_t accumulator[MAX_SEQ];
size_t dump(enum What what, size_t *i) {
size_t oi = *i;
switch (what) {
case COPY:
if (*i == 0)
return 0;
(oi == MAX_SEQ ? 0 : oi);
putchar(accumulator, 1, oi, stdout);
fwrite*i = 0;
return oi + 1;
case REPEAT:
if (*i == 0)
return 0;
if (oi == MAX_REPEAT)
= 0;
oi (oi | 0x80);
putchar(accumulator[0]);
putchar*i = 0;
return 2;
}
}
double reduction(size_t total_size, size_t compressed_size) {
return 1.0 - (double)compressed_size / total_size;
}
int main(void) {
size_t accumulated = 0;
size_t repeated = 0;
size_t total_size = 0;
size_t compressed_size = 0;
enum What state = COPY;
int c;
int previous_char = getchar();
if (previous_char == EOF)
return 0;
[0] = previous_char;
accumulator= 1;
accumulated
= 1;
total_size while ((c = getchar()) != EOF) {
++;
total_sizeswitch (state) {
case COPY:
if (c != previous_char) {
= c;
previous_char [accumulated++] = c;
accumulatorif (accumulated == MAX_SEQ) {
+= dump(state, &accumulated);
compressed_size = -1;
previous_char }
} else {
--;
accumulated+= dump(state, &accumulated);
compressed_size = 2;
repeated [0] = c;
accumulator= REPEAT;
state }
break;
case REPEAT:
if (c == previous_char) {
++;
repeatedif (repeated == MAX_REPEAT) {
[0] = previous_char;
accumulator+= dump(state, &repeated);
compressed_size }
} else {
[0] = previous_char;
accumulator+= dump(state, &repeated);
compressed_size = c;
previous_char [0] = c;
accumulator= 1;
accumulated = COPY;
state }
}
}
+= dump(state, state == COPY ? &accumulated : &repeated);
compressed_size (stderr, "%zu -> %zu (%.2lf\%)\n", total_size, compressed_size, 100.0*reduction(total_size, compressed_size));
fprintfreturn 0;
}
Decoder source code
This reads (and writes) a single character per time, even when it could read more than one bytes and write it back altogether. Because of buffering, this shouldn’t be terrible, but the other way would be nonetheless a little bit better because it would make less calls to a library (and the system) function.
#include <stdio.h>
#include <stdlib.h>
void corrupted_stream(void) {
(stderr, "corrupted stream");
fprintf(1);
exit}
int main(void) {
int c;
while ((c = getchar()) != EOF) {
if (c & 0x80) {
= c == 0x80 ? 128 : (c & 0x7f);
c int b = getchar();
if (b == EOF)
();
corrupted_stream
while (c-- > 0)
(b);
putchar} else {
= c == 0 ? 128 : c;
c while (c-- > 0) {
int d = getchar();
if (d == EOF)
();
corrupted_stream(d);
putchar}
}
}
return 0;
}
A lot of windowsians think that a file suffix is absolutely needed to identify a file. Indeed, the majority of file formats are designed so that a program can tell if they are what they should be — despite the file name. So, you can rename a
image.jpg
file asimage.png
, and still a program can tell that the image format is JPG, not PNG. This is because those formats contain “magic values” at their beginning — a sort of signature that says “I am a JPEG file” or “I am a PNG” files. On nix-like Operating Systems there is thefile
utility which can identify a file by its content. It uses these special headers — and some kind of heuristics for those formats which has no headers — for instance, a plain textual file has no headers saying “I am a text file”.↩︎From our point of view, it is a “sequence” of cells. On a chip, it would look more like a 2D grid.↩︎
No comments:
Post a Comment