Sometimes I have to compress-archive files which differ not too much. I can see it with a glance, but computers have nothing similar to human vision... The archive formats I consider are zip (only if I have to share with windowsers who could be scared by other formats), tar.gz (tgz), tar.bz2 and if I don't care about file attributes, 7z. Among these, 7z compresses usually the most.
But since the files I am archiving have very repetitive pattern, I am not satisfied with the result of all archivers (except maybe 7z...).
Could we obtain better results in some way?
First, I have generated two files. They are the same file except for a byte in a certain position.
Then I have archived them:
I have added rar too for the sake of completeness. Zip compresses each file at 50% more or less, so the final archive size is more or less half the size of the two files altogether. The others perform better, and 7z is simply the best. (The content of the file is important of course: it contains the number from 0 to 4095 written in ASCII without spaces or newlines: few symbols, repetitive patterns, no random).
Could it be better if I store just the first file and then a diff file? The result is:
Zip reaches the performance of tar.gz (we could say it keeps making the first file half, while the diff file is small and so does not contribute too much to the final size). Rar seems slightly better than 7z (but the difference is negligible), and 7z itself does not gain too much from this pre-processing: it almost seems it is the only one that implements an algorithm able to crash the repetitions. Tar.gz's gain is small, tar.bz2 is bigger but not yet enough to keep the pace with 7z.
(Small differences could be due also to the file name of the diff file, which is "file1.txt.file2.patch"; 7z and rar do not store the filenames in clear, zip does, tar surely write filenames in clear, but of course bz2 and gzip compress them too).
Now let's try with two 16k (exactly 16*1024=16384 bytes) files containing the same random bytes (no special care in the random distribution), except for 1 bytes which differs in file2. The results:
Zip stores the files. Rar is a little bit better, but not enough. Interestingly, gzip beats everyone, and 7z is a bit behind. Anyway, it seems like 7z and tar.gz algorithm exploits differences between files (since gzip works at a level where it is not aware of distinct files, it means it "explores" the stream and recognizes repetition patterns? better than bzip2? this need further analysis...). If I store only the first file and the diff file, I obtain
Now almost all the archivers-compressors obtain the "same" result. Strangely, tar.bz2 gain is the worst. Rar, tar.gz and zip wins over 7z!
No conclusions. The way I have obtained the diff file is:
diff <(xxd -c 1 file1.rand) <(xxd -c 1 file2.rand) >file1.rand.file2.patch
and similar for the no-(pseudo)random case. Btw this could be not the best way, but indeed it's not important for this argument.