2012-11-25

Compressing

Sometimes I have to compress-archive files which differ not too much. I can see it with a glance, but computers have nothing similar to human vision... The archive formats I consider are zip (only if I have to share with windowsers who could be scared by other formats), tar.gz (tgz), tar.bz2 and if I don't care about file attributes, 7z. Among these, 7z compresses usually the most.
But since the files I am archiving have very repetitive pattern, I am not satisfied with the result of all archivers (except maybe 7z...).
Could we obtain better results in some way?

First, I have generated two files. They are the same file except for a byte in a certain position.

15274 file1.txt
15274 file2.txt


Then I have archived them:

 1847 files.7z
 3363 files.rar
 7142 files.tar.bz2
 7918 files.tar.gz
15342 files.zip


I have added rar too for the sake of completeness. Zip compresses each file at 50% more or less, so the final archive size is more or less half the size of the two files altogether. The others perform better, and 7z is simply the best. (The content of the file is important of course: it contains the number from 0 to 4095 written in ASCII without spaces or newlines: few symbols, repetitive patterns, no random).

Could it be better if I store just the first file and then a diff file? The result is:

1833 files.7z
1803 files.rar
5180 files.tar.bz2
7776 files.tar.gz
7891 files.zip


Zip reaches the performance of tar.gz (we could say it keeps making the first file half, while the diff file is small and so does not contribute too much to the final size). Rar seems slightly better than 7z (but the difference is negligible), and 7z itself does not gain too much from this pre-processing: it almost seems it is the only one that implements an algorithm able to crash the repetitions. Tar.gz's gain is small, tar.bz2 is bigger but not yet enough to keep the pace with 7z.
(Small differences could be due also to the file name of the diff file, which is "file1.txt.file2.patch"; 7z and rar do not store the filenames in clear, zip does, tar surely write filenames in clear, but of course bz2 and gzip compress them too).

Now let's try with two 16k (exactly 16*1024=16384 bytes) files containing the same random bytes (no special care in the random distribution), except for 1 bytes which differs in file2. The results:

16838 files.7z
32971 files.rar
20904 files.tar.bz2
16796 files.tar.gz
33086 files.zip


Zip stores the files. Rar is a little bit better, but not enough. Interestingly, gzip beats everyone, and 7z is a bit behind. Anyway, it seems like 7z and tar.gz algorithm exploits differences between files (since gzip works at a level where it is not aware of distinct files, it means it "explores" the stream and recognizes repetition patterns? better than bzip2? this need further analysis...). If I store only the first file and the diff file, I obtain

16824 files.7z
16606 files.rar
17114 files.tar.bz2
16654 files.tar.gz
16765 files.zip


Now almost all the archivers-compressors obtain the "same" result. Strangely, tar.bz2 gain is the worst. Rar, tar.gz and zip wins over 7z!

No conclusions. The way I have obtained the diff file is:

diff <(xxd -c 1 file1.rand) <(xxd -c 1 file2.rand) >file1.rand.file2.patch

and similar for the no-(pseudo)random case. Btw this could be not the best way, but indeed it's not important for this argument.

2012-11-03

Golfrun is not running (too fast or at all)

My tiny project Golfrun is stuck. There are several reasons: first, time. Ok, I am lying: I would have the time to, if only I would spend all my spare time on it. But of course I don't want to. Second, something that began as a "as fast as possible" C translation of "high level" Ruby powered interpreter, took its own path towards becoming an almost serious (though toy still) language. I suspect that the most horrorful part is the set functions implementation: naive, to say the least: I wanted them quick saying to myself the biggest and maybe most common lie (I will program it better later — I will fix it — I will enhance it... and so on). What I really wanted quickly done was indeed the whole thing, so that I could say ok there exists a full working interpreter for the official Golfrun language specification. Wait! I have not finished the specification either! Maybe there's another fact to learn: C is not the best language for fast "prototyping". C++ would have been better maybe but indeed the parent-cousin GolfScript teaches that languages like Ruby are good at that. (But they may fail in performance, which is the main reason I began GolfRun!)
Saying it differently: to me GolfRun has become already unmaintenable. So tiny, so dirty (not the worst anyway!), so wrong: it needs to be rewritten from scratch having now a clearer look of the goals. I had already the intent and the chosen language was C++11, yes C++11; anyway not for tough object orientation, which I don't appreciate when stretched too much. I like C and I would like to keep a C implementation of GolfRun, but I need to pause.
Furthermore the idea to write a serious (as possible) and formal-like specification of the language is cool (to me), but it erodes time changing a small project into a fake big one. So I will drop it and stick rather to a simple and maybe unexhaustive documentation.
It looks silly to apply metrics to a toy project but I have just downloaded pmccabe so I have tried a run and the uninterpreted results (with -vT options over .c files only) are:
  • Modified McCabe Cyclomatic Complexity: 1171
  • Traditional McCabe Cyclomatic Complexity: 1297
  • # Statements in function: 3485
  • # lines in function: 7872
 but indeed more interesting are the single function results. Reading here suggests that there are 14 functions with moderate risk, 14 functions with high risk, and 2 functions that are "the most complex and highly unstable method function". It fits my own perception :) (But take into account the fact that the number of functions, as counted by pmccabe, with less than 11 are 148 — are there 178 functions?...)

Now let me drift away using as a weak hook the simple "rewrite from scratch" mantra. If a project is becoming a bog, full of decaying boiling spaghetti code, growing without a guide and design (enforced by some good software engineering principles, rules, whatever) and becoming rather unmaintainable it could be a great idea to do it. It does not mean you have to throw away all the old code (unless it sucks totally): you can recycle it as well. But it exists what I would call business driven development: the code is not important, the code is not the product, so if "it works", let it shine — until the next unexpected bug shows up and then you have to blame the programmers, the lack of some good development paradigm and tests (but topsitters ignore or pretend to ignore what TDD is and that there are clues suggesting it's the time to move on and beyond well established past bad habits and enforce some good SE principles — I like very much "don't test your own software", which pairs well with TDD despite what you may think), the skills of the enchained programmers and everything else but the lack of a plan to bring the code in the new century and reduce the bugs likelihood. It worked... but the less you modify the code, the less old bugs manifest themselves means that the code has never worked for real (and what about a bug dancing around because some compiling option has changed?!). It means we deserved the "works on the test machine badge". It is still the programmer who did not rebel or left, who decided to bow to some kind of wrong logic (sometimes with the excuse "it's the customer's will"), who did not say "I won't add features until we fix it all")... it is such a programmer to be blamed; this is the sad truth. Though you can think, my reader, that if we wait to have a bugs-free product, we can wait forever before we can add new features; true if I were talking about bugs... indeed I am thinking about something more substantial: the whole software! No, almost joking now...
Nonetheless it's also a common human pattern to shift responsibilities away (but this is also the reason why rules and methodologies exist) so that in a specific environment  we can say or only think one or more of the following: I am not paid enough to worry about it (even if someone could point at you screaming "it's your fault!" and you can get fired for this not-so-correct-claim); it's not my fault (spaghetti code entanglements give strong unexpected coupling with the help of global variables, unitialized structs which by chance happen to be correctly zeroed for a while or worse before you added your code, bad "pattern" cloning and copy-pasting in order to keep the code looks like it used to look and avoid other developers screaming "oh why did you do it the hard way, we've never done it like that before" or similar claims); where's the black-on-white requirements?; I have tried to apply some good software engineering principle, but I am alone and it turned to be impossible (I won't explain why further), so I have warned you that the probability of bugs occurring despite the fact that programmers and testers did their job well is far higher than the minimum acceptable threshold; and so on... the list can be long.
I admit I have a bad habit: when I am developing something and I notice something that could be a bug, I fix it. (Could be? I wrote could be? A bug is, or a bug is not... this is a part of the problem, indeed, but I won't explain). People may think that it's ok, if it is a real bug, you have to fix it. Wrong. It must not be done (unless it is your bug!). You have to wait the bug to manifest and then, not before, be the hero fixing it quickly. This is a matter of balance. If you fix, say, 5 bugs you have not contributed to, bugs which never had the chance to do their show, then nobody will notice. If you add 1 bug that wins its run exhibiting its evil, then you are the one who added that bug (not the one that added that bug but removed 5 other someone else bugs). Thus, a bug that nobody noticed yet (and that he does not "own") is an asset for the developer, and should not be "spent" without earning publicly something. The are other interesting finding: half the doubled skilled persons could maintain the same code at the same total cost, making less frequent bugs by redesigning the software in order to reduce their probability — which makes the difference between "it works" (good) and "eureka it works!" (surprise, not good). But it means more unemployed persons and topsitters having less slaves to control; it seems like they exist metrics where quantity is more imporant (than quality) so that a topsitters managing 10 persons is more important than one managing 3 persons, if even their budget is (almost) the same. Fuck the topsitters: what about unemployed persons? In a good world they wouldn't be unemployed for so long, they would find a job their skills are suited for (needless to say, maybe not in the same industry). But things are a little bit more complicated here. There would be a lot to say about topsitters too. But it's better I stop.
Final wandering words to summarize it: it's not always only developer's fault but it's the developer to be blamed, thus it's always the developer's fault. Developers are weakly bound to their seats while topsitters are strongly bound to their seats, so developers can be replaced swiftly and more easily. Topsitters can hide their own faults, architectural deficiencies, the lack of good software engineering principles and leaders or guiding rules behind the developers faults.
By the way, ants colonies are wonderful.

(links about SE are chosen at random from a cheap google search, they are by no way reference links)
 

2012-09-02

No magic

This is a nice technically correct suggestion. But if magic numbers would exist, why shouldn't I use them? (just to say I am still here or somewhere, maybe not kicking but alive).


2012-04-09

Modelling social relationships using a knowledge base

Weeks ago I attended Codemotion 2012. Among the talks I was interested in, there were those related to graph databases (personal consideration: currently I think classical relational databases are misused more often than one can imagine at first). At the first appearing of the first example, some old memory came into my mind: it was about Prolog. Each time I have taken a look at Prolog, I've stumbled on examples like these: Prolog describing genealogy (X has A and B as parents, Y has Z as sibling...), tastes (X likes A...), ... and of course graphs.
I had not thought about the analogy until then: a Prolog collection of facts is nothing but an internal representation of a graph database (by the way it is the opposite: the knowledge base made of Prolog facts can be represented as a graph, hence the idea of a graph database).
So I fired my gprolog installation (the most unused bunch of bytes on my hard disk, but I like to have the larger possible number of programming languages/environments installed) to do a little bit of testing. Just for the fun of it. But since nowadays "social stuffs" have a lot of momentum I wanted to create a small knowledge base focused on who knows who, who likes who, who loves who... and the possible conseguences. The imagined application is for «services» like "find friends", or dating sites, or alike.

In the given example there are five persons with fantasy names Mary Brod, Carl Stuckart, Rudolf Fisher, Amanda Least, Minority Report and the following informations: birth date, gender, sex orientation, who knows who, who likes who, and who loves who. Then there are some educational rules (a rule for friendship i.e. reciprocal knowledge, another to ask the system if two persons are "sex compatible") and some helper rules (get the birth year, compute the age). Everything is of course semplified: human relationships happen to be a little bit more complicated.

The following messed up graph (generated by graphviz's dot starting from a source generated by a perl script "parsing" the prolog knowledge base file) depicts the relationships between those persons; black edges are about "who knows who", blue edges are about "who likes who", red edges are about "who loves who"; the background of each nodes represents the gender of the person, while the male-female symbols are about their sex orientation.
The knowledge base file can be now used to do queries (I mean, Prolog queries). They are rather intuitive. For example, if we want to know all the people who know Minority, we write this query:

findall(P, knows(P, 'Minority Report'), L).

which returns the list L = ['Mary Brod', 'Carl Stuckart', 'Rudolf Fisher', 'Amanda Least'], and in fact everyone knows Minority (who doesn't?).

The knowledge base as it is written allow to check if two subjects can have sex according to the following criteria: if A and B love or like each others, and their gender is in the list of preferred genders (which I've called sex-sexual orientation) of the may-be-partner, and if their age is over 18 or is over 13 but the difference between ages is no more than 5, then they might have sex. Let's try between Minority Report and Amanda Least. Unfortunately for Minority, who is very young, they can't (because of the age difference is too great). Let's try between Mary and Amanda... no luck yet, since Amanda is not bisexual, so Mary's unrequited love can't be finalized into sex.

Instead of going on this way, let's ask the system to list, for each person, a possible lover. The query I would do looks like

setof(X, canHaveSex(Person, X), List).

We are asking the set (just to be sure not to have repetition, since some person could match more than one sex compatibility criterion) of persons X the person Person can have sex with (according to the stated rules and the knowledge in the database). The answer is

List = ['Rudolf Fisher']
Person = 'Amanda Least'

which can be read as: the Person Amanda Least is sex compatible with Rudolf Fisher. They are in love, in fact.

And who has  unrequited loves?

findall(P, (loves(P, X), \+ loves(X, P)), L).

gives

L = ['Mary Brod','Carl Stuckart','Rudolf Fisher']

i.e. Mary, Carl and Rudolf love a person who does not love them. People who loves more than one person are potentially polygamous, and this is the last query we do, promise:

setof(X, loves(P1, X), L), length(L, N), N > 1.

That is, find the set L of persons X loved by P1, where length of L is greater than 1, i.e. P1 loves more than one single person. The variable matching the criteria are:

L = ['Amanda Least','Mary Brod']
N = 2
P1 = 'Rudolf Fisher'

thus, Rudolf is potentially polygamous.

The way the queries can be done is rather intuitive (once you get accustomed a bit to Prolog programming), and likely the algorithms behind the scenes (backtracking and so forth) are already exploited in "digital social contexts", I think (and if they are not, why not?)


2012-03-25

C++11

To me the most interesting feature of C++11 I am aware of are lambda functions. Though, I think also the general direction taken by C++ is leading it to a very obscure path, where other languages shine since the beginning, and where the need to use C++ rather than those other languages is not clear to say the least. Differently speaking, the most interesting features could be the one programmer could use less, simply because when you need to use them in all their beauty, you are on the edge of considering sticking to another language to do the same thing.
Moreover, C++ syntax is someway cumbersome and not clean. In some cases (not so much to be worth noting, yet they jumped in my sight) it is misleading; I believe the usage of the same keyword to mean two different things is not good (C too has this flaw, e.g. the keyword static), and if the previous standard had a third meaning for it, the situation could be confusing (I am thinking about auto). Yes, it's not a hokum but it gives the impression of something too layered and hard to handle. Modern languages without a heavy heritage have less or nothing at all of this kind of flaws.

Anyway, let's see an example of folding done with the use of a lambda function.



Not particularly useful, but it shows the equivalent in C++11 of what in other languages we can write. E.g.

Smalltalk



Common Lisp




Haskell



Erlang



Perl5



Python



All these snippets run effortlessly and without the need of adding other lines of code; C++ needs "boilerplate" code, though the working line is as simple as



(note the use of -> here, which is not syntactic sugar for (*ptr_to_obj). as it is in other contexts).

It could be funny to use some of the new C++11 features; sometimes it could be even useful; but most of the time I believe it won't be a need and C++ is becoming one of the languages with a very large set of (maybe interesting) features which are even the most misunderstood, misused, or simply ignored.

2012-02-17

Whitney Houston punished by God because of her conversion to Islam

This is not my opinion of course, but sadly it is what you can read in this article by Carlo Di Pietro, a guy belonging to a sect called "Milizia di San Michele Arcangelo" (militia of St. Michael the Archangel), a scary denomination for a scary group of mad people who hide their madness behind this description: the community of "Milizia di San Michele Arcangelo" is a group of catholic persons engaged in spreading the right Christian cult of devotion to the saint angels and in particular to St. Michael the Archangel.
The article is hosted by a "newspaper" called Pontifex, edited by Bruno Volpe and the same Carlo Di Pietro. Pontifex  is nothing but medieval propaganda against different religions, against homosexuality and in general against every other things that do not fit into their narrow minds.
I think that all Catholic and Christian persons should be deeply ashemed by the violent words these guys write in name of their own god.
But let us take a look at one set of these words about the death of Whitney Houston (I won't translate everything, just the "most important" parts; italics or other emphasis is added by me):

Whitney Houston is dead: another talent who goes away. Admonition of God because of her recent conversion to Islam?
This morning I knew that the famous singer and actress Whitney Houston has died and I do not hide the fact that I felt deeply perturbed because of the great skills of a woman who unfortunately has chosen the large door, and has paid. [...]
Let us pray for this great artist, who's died (as it seems) because of diabolic vices, of a strongly troubled soul and harassed by many temptations. Unfortunately the success is not synonymous of happiness and too often they search for help and consolation in the wrong persons, false beliefs and destructive artificiality.
You can't believe it, since (as reported by Il Messaggero) she had found serenity again thanks to her conversion to Islam. ["It is strange (the fact that is happened), since"... I have no idiomatic english translantion for "E pensare che...", lit. it would be: "And to think that (she had ...)"].
The news, spread in the recent February, had found very much resonance in the american islamic sites that, with great exultation, welcomed a great star.
The very same Houston, according to rumors, stated that she had found in Islam a dimension of serenity, after the hard personal life events the star had been involved in.
These are the results.
Admonition of God? Who can say it, we can't know certain dynamics, even if the event and the sad coincidence make it clear.
The good example must be given first by the known persons who, having "received" so much, so much they must "give" first from a spiritual point of view, then from a moral and material point of view.
God have mercy of her soul.

The main topic of the article is of course her conversion to Islam and the main purpose is to "prove" that it is a bad thing that has caused her death: punishment for her bad choices. Simplistic propaganda techniques at work, naively used.

I hope no one (except maybe the common Pontifex readers) needs to show where this article is illogical, to say the less, even in the constructs of the sentences (I've already done more than the necessary italics-izing or coloring some words). It's a clumsy usage of very basic propoganda techniques. My fear is that it is enough for the average reader of this "newspaper".