2012-11-25

Compressing

Sometimes I have to compress-archive files which differ not too much. I can see it with a glance, but computers have nothing similar to human vision... The archive formats I consider are zip (only if I have to share with windowsers who could be scared by other formats), tar.gz (tgz), tar.bz2 and if I don't care about file attributes, 7z. Among these, 7z compresses usually the most.
But since the files I am archiving have very repetitive pattern, I am not satisfied with the result of all archivers (except maybe 7z...).
Could we obtain better results in some way?

First, I have generated two files. They are the same file except for a byte in a certain position.

15274 file1.txt
15274 file2.txt


Then I have archived them:

 1847 files.7z
 3363 files.rar
 7142 files.tar.bz2
 7918 files.tar.gz
15342 files.zip


I have added rar too for the sake of completeness. Zip compresses each file at 50% more or less, so the final archive size is more or less half the size of the two files altogether. The others perform better, and 7z is simply the best. (The content of the file is important of course: it contains the number from 0 to 4095 written in ASCII without spaces or newlines: few symbols, repetitive patterns, no random).

Could it be better if I store just the first file and then a diff file? The result is:

1833 files.7z
1803 files.rar
5180 files.tar.bz2
7776 files.tar.gz
7891 files.zip


Zip reaches the performance of tar.gz (we could say it keeps making the first file half, while the diff file is small and so does not contribute too much to the final size). Rar seems slightly better than 7z (but the difference is negligible), and 7z itself does not gain too much from this pre-processing: it almost seems it is the only one that implements an algorithm able to crash the repetitions. Tar.gz's gain is small, tar.bz2 is bigger but not yet enough to keep the pace with 7z.
(Small differences could be due also to the file name of the diff file, which is "file1.txt.file2.patch"; 7z and rar do not store the filenames in clear, zip does, tar surely write filenames in clear, but of course bz2 and gzip compress them too).

Now let's try with two 16k (exactly 16*1024=16384 bytes) files containing the same random bytes (no special care in the random distribution), except for 1 bytes which differs in file2. The results:

16838 files.7z
32971 files.rar
20904 files.tar.bz2
16796 files.tar.gz
33086 files.zip


Zip stores the files. Rar is a little bit better, but not enough. Interestingly, gzip beats everyone, and 7z is a bit behind. Anyway, it seems like 7z and tar.gz algorithm exploits differences between files (since gzip works at a level where it is not aware of distinct files, it means it "explores" the stream and recognizes repetition patterns? better than bzip2? this need further analysis...). If I store only the first file and the diff file, I obtain

16824 files.7z
16606 files.rar
17114 files.tar.bz2
16654 files.tar.gz
16765 files.zip


Now almost all the archivers-compressors obtain the "same" result. Strangely, tar.bz2 gain is the worst. Rar, tar.gz and zip wins over 7z!

No conclusions. The way I have obtained the diff file is:

diff <(xxd -c 1 file1.rand) <(xxd -c 1 file2.rand) >file1.rand.file2.patch

and similar for the no-(pseudo)random case. Btw this could be not the best way, but indeed it's not important for this argument.

No comments:

Post a Comment