ShinTakezou's Blog: 2012-11

Sometimes I have to compress-archive files which differ not too much. I can see it with a glance, but computers have nothing similar to human vision... The archive formats I consider are zip (only if I have to share with windowsers who could be scared by other formats), tar.gz (tgz), tar.bz2 and if I don't care about file attributes, 7z. Among these, 7z compresses usually the most.
But since the files I am archiving have very repetitive pattern, I am not satisfied with the result of all archivers (except maybe 7z...).
Could we obtain better results in some way?

First, I have generated two files. They are the same file except for a byte in a certain position.

15274 file1.txt
15274 file2.txt

Then I have archived them:

1847 files.7z
3363 files.rar
7142 files.tar.bz2
7918 files.tar.gz
15342 files.zip

I have added rar too for the sake of completeness. Zip compresses each file at 50% more or less, so the final archive size is more or less half the size of the two files altogether. The others perform better, and 7z is simply the best. (The content of the file is important of course: it contains the number from 0 to 4095 written in ASCII without spaces or newlines: few symbols, repetitive patterns, no random).

Could it be better if I store just the first file and then a diff file? The result is:

1833 files.7z
1803 files.rar
5180 files.tar.bz2
7776 files.tar.gz
7891 files.zip

Zip reaches the performance of tar.gz (we could say it keeps making the first file half, while the diff file is small and so does not contribute too much to the final size). Rar seems slightly better than 7z (but the difference is negligible), and 7z itself does not gain too much from this pre-processing: it almost seems it is the only one that implements an algorithm able to crash the repetitions. Tar.gz's gain is small, tar.bz2 is bigger but not yet enough to keep the pace with 7z.
(Small differences could be due also to the file name of the diff file, which is "file1.txt.file2.patch"; 7z and rar do not store the filenames in clear, zip does, tar surely write filenames in clear, but of course bz2 and gzip compress them too).

Now let's try with two 16k (exactly 16*1024=16384 bytes) files containing the same random bytes (no special care in the random distribution), except for 1 bytes which differs in file2. The results:

16838 files.7z
32971 files.rar
20904 files.tar.bz2
16796 files.tar.gz
33086 files.zip

Zip stores the files. Rar is a little bit better, but not enough. Interestingly, gzip beats everyone, and 7z is a bit behind. Anyway, it seems like 7z and tar.gz algorithm exploits differences between files (since gzip works at a level where it is not aware of distinct files, it means it "explores" the stream and recognizes repetition patterns? better than bzip2? this need further analysis...). If I store only the first file and the diff file, I obtain

16824 files.7z
16606 files.rar
17114 files.tar.bz2
16654 files.tar.gz
16765 files.zip

Now almost all the archivers-compressors obtain the "same" result. Strangely, tar.bz2 gain is the worst. Rar, tar.gz and zip wins over 7z!

No conclusions. The way I have obtained the diff file is:

diff <(xxd -c 1 file1.rand) <(xxd -c 1 file2.rand) >file1.rand.file2.patch

and similar for the no-(pseudo)random case. Btw this could be not the best way, but indeed it's not important for this argument.

My tiny project Golfrun is stuck. There are several reasons: first, time. Ok, I am lying: I would have the time to, if only I would spend all my spare time on it. But of course I don't want to. Second, something that began as a "as fast as possible" C translation of "high level" Ruby powered interpreter, took its own path towards becoming an almost serious (though toy still) language. I suspect that the most horrorful part is the set functions implementation: naive, to say the least: I wanted them quick saying to myself the biggest and maybe most common lie (I will program it better later — I will fix it — I will enhance it... and so on). What I really wanted quickly done was indeed the whole thing, so that I could say ok there exists a full working interpreter for the official Golfrun language specification. Wait! I have not finished the specification either! Maybe there's another fact to learn: C is not the best language for fast "prototyping". C++ would have been better maybe but indeed the parent-cousin GolfScript teaches that languages like Ruby are good at that. (But they may fail in performance, which is the main reason I began GolfRun!)
Saying it differently: to me GolfRun has become already unmaintenable. So tiny, so dirty (not the worst anyway!), so wrong: it needs to be rewritten from scratch having now a clearer look of the goals. I had already the intent and the chosen language was C++11, yes C++11; anyway not for tough object orientation, which I don't appreciate when stretched too much. I like C and I would like to keep a C implementation of GolfRun, but I need to pause.
Furthermore the idea to write a serious (as possible) and formal-like specification of the language is cool (to me), but it erodes time changing a small project into a fake big one. So I will drop it and stick rather to a simple and maybe unexhaustive documentation.
It looks silly to apply metrics to a toy project but I have just downloaded pmccabe so I have tried a run and the uninterpreted results (with -vT options over .c files only) are:

Modified McCabe Cyclomatic Complexity: 1171
Traditional McCabe Cyclomatic Complexity: 1297
# Statements in function: 3485
# lines in function: 7872

but indeed more interesting are the single function results. Reading here suggests that there are 14 functions with moderate risk, 14 functions with high risk, and 2 functions that are "the most complex and highly unstable ~~method~~ function". It fits my own perception :) (But take into account the fact that the number of functions, as counted by pmccabe, with less than 11 are 148 — are there 178 functions?...)

Now let me drift away using as a weak hook the simple "rewrite from scratch" mantra. If a project is becoming a bog, full of decaying boiling spaghetti code, growing without a guide and design (enforced by some good software engineering principles, rules, whatever) and becoming rather unmaintainable it could be a great idea to do it. It does not mean you have to throw away all the old code (unless it sucks totally): you can recycle it as well. But it exists what I would call business driven development: the code is not important, the code is not the product, so if "it works", let it shine — until the next unexpected bug shows up and then you have to blame the programmers, the lack of some good development paradigm and tests (but topsitters ignore or pretend to ignore what TDD is and that there are clues suggesting it's the time to move on and beyond well established past bad habits and enforce some good SE principles — I like very much "don't test your own software", which pairs well with TDD despite what you may think), the skills of the enchained programmers and everything else but the lack of a plan to bring the code in the new century and reduce the bugs likelihood. It worked... but the less you modify the code, the less old bugs manifest themselves means that the code has never worked for real (and what about a bug dancing around because some compiling option has changed?!). It means we deserved the "works on the test machine badge". It is still the programmer who did not rebel or left, who decided to bow to some kind of wrong logic (sometimes with the excuse "it's the customer's will"), who did not say "I won't add features until we fix it all")... it is such a programmer to be blamed; this is the sad truth. Though you can think, my reader, that if we wait to have a bugs-free product, we can wait forever before we can add new features; true if I were talking about bugs... indeed I am thinking about something more substantial: the whole software! No, almost joking now...
Nonetheless it's also a common human pattern to shift responsibilities away (but this is also the reason why rules and methodologies exist) so that in a specific environment we can say or only think one or more of the following: I am not paid enough to worry about it (even if someone could point at you screaming "it's your fault!" and you can get fired for this not-so-correct-claim); it's not my fault (spaghetti code entanglements give strong unexpected coupling with the help of global variables, unitialized structs which by chance happen to be correctly zeroed for a while or worse before you added your code, bad "pattern" cloning and copy-pasting in order to keep the code looks like it used to look and avoid other developers screaming "oh why did you do it the hard way, we've never done it like that before" or similar claims); where's the black-on-white requirements?; I have tried to apply some good software engineering principle, but I am alone and it turned to be impossible (I won't explain why further), so I have warned you that the probability of bugs occurring despite the fact that programmers and testers did their job well is far higher than the minimum acceptable threshold; and so on... the list can be long.
I admit I have a bad habit: when I am developing something and I notice something that could be a bug, I fix it. (Could be? I wrote could be? A bug is, or a bug is not... this is a part of the problem, indeed, but I won't explain). People may think that it's ok, if it is a real bug, you have to fix it. Wrong. It must not be done (unless it is your bug!). You have to wait the bug to manifest and then, not before, be the hero fixing it quickly. This is a matter of balance. If you fix, say, 5 bugs you have not contributed to, bugs which never had the chance to do their show, then nobody will notice. If you add 1 bug that wins its run exhibiting its evil, then you are the one who added that bug (not the one that added that bug but removed 5 other someone else bugs). Thus, a bug that nobody noticed yet (and that he does not "own") is an asset for the developer, and should not be "spent" without earning publicly something. The are other interesting finding: half the doubled skilled persons could maintain the same code at the same total cost, making less frequent bugs by redesigning the software in order to reduce their probability — which makes the difference between "it works" (good) and "eureka it works!" (surprise, not good). But it means more unemployed persons and topsitters having less slaves to control; it seems like they exist metrics where quantity is more imporant (than quality) so that a topsitters managing 10 persons is more important than one managing 3 persons, if even their budget is (almost) the same. Fuck the topsitters: what about unemployed persons? In a good world they wouldn't be unemployed for so long, they would find a job their skills are suited for (needless to say, maybe not in the same industry). But things are a little bit more complicated here. There would be a lot to say about topsitters too. But it's better I stop.
Final wandering words to summarize it: it's not always only developer's fault but it's the developer to be blamed, thus it's always the developer's fault. Developers are weakly bound to their seats while topsitters are strongly bound to their seats, so developers can be replaced swiftly and more easily. Topsitters can hide their own faults, architectural deficiencies, the lack of good software engineering principles and leaders or guiding rules behind the developers faults.
By the way, ants colonies are wonderful.

(links about SE are chosen at random from a cheap google search, they are by no way reference links)

ShinTakezou's Blog

2012-11-25

Compressing

2012-11-03

Golfrun is not running (too fast or at all)

Other links