2011-06-26

Physicists are not physicians

I've just realized to have confused often the word "physicist" with the word "physician". Luckly, after a fast check, I was not so bold to label myself as a "physicist" (since I've currently discontinued physics studies alas) in cv or sites' profiles, otherwise I would have written "physician" instead. Though, in forums and blogs here or there maybe it's happened. So forgive me, it is very likely that where you read the word "physician", I meant "physicist".

2011-06-23

Who gets it right? Enterprise encoding hell

One of my preferred arguments about computers is character encodings. Years ago I began to write several lectures about computers and programming, and the first were all about this subtle issue.

The fact is that people fail to understand that computers do not understand, they just execute codes (that is, data which can be interpreted as code by a special interpreter, the processor); they are just programmed to interpret data in a particular way and maybe to represent them some way. But there's no meaning in data themselves. So if you get the wrong interpreter for some data, what you get is garbage, or a polite error message. If you have an interpreter but you feed it with data it can't interpret, you get again garbage as output, or a polite error message.

What we read as letters for a computer are data (bytes), no more no less. Data that can be interpreted as letters, but in order to do so, the interpreter must know (be programmed to use) a sort of map between the data and the symbols shown on the screen, on the mobile phone display, or wherever.

Nowadays the ASCII "standard" is so widespread that we can disregard any other standard (like EBCDIC) though likely it's still alive somewhere. The ASCII encoding "states" that the byte XY, when interpreted as a "letter", must be shown as the letter "L"; put numbers (another interpretation of bytes) in stead of XY and other letters in stead of L and you have a way of interpreting a sequence of bytes as a sequence of letters. Indeed ASCII has also bytes that have no "printable" symbols associated; e.g. there's a byte that means go to the next line (new line), or put a space (space) and so on.

Anyway, ASCII has no too much special symbols, accented letters (since it was born as an american/english stuff only) and so on. So the pain starts when you want to see that nice symbol on every computer interpreting your data as a document made of readable symbols.

The encoding babel begins at that point. It could be interesting to trace the history of character encodings. Anyway for any practical purpose it is more interesting to aknowledge the fact that there were great efforts to reduce this babel and in the same time to increase the number of symbols representable in a "unified" manner.

But there's no way to escape from the choice of the correct data to be fed to an interpreter that expects those data.

So, if you as programmer, write in your preferred IDE or editor, a line like this:

printf("il costo è di 10€"); //avoid this!

or if you feed your DB with a statement that looks to you like

INSERT INTO aTable VALUES('il costo è di 10€', 10); -- avoid this!

and you know nothing about the encoding your IDE/editor uses and your DB accepts/translates into, and the encodings that are understood by the systems where you suppose to display those texts, then you are getting it badly wrong, to say the least.

About DB, when you want to be sure it does not "scramble" your text, you must force it to believe it is data, not text. You loose something (ordering capabilities according to specific collating sequences), but you are sure that the bytes you wrote are "transmitted" as they were to the final end-point. (The question about in which encoding those bytes represent "è" and "€" remains, i.e. you must input those symbols in a desidered known encoding, or input them as bytes, if possible).

Now, some people may ask how it is possible that I wront "è€" and the reader reads "è€"? HTML (which is ASCII based) has special "metainformartions" that are used to specify the encoding of the content. If the encodings match, I succeeded showing you the symbols I see: we are saying to the "interpreter" (the browser that shall choose the right symbol according to the input) which "map" to use.

The default encoding for web pages if nothing is specified is ISO-8859-1, AKA Latin1, which is 8bit superset of the ASCII encoding and a single byte encoding — the first 128 codes of Latin1 are the same of the ASCII (which has "only" 128 codes). (Instead of ISO-8859-1, it could be ISO-8859-15 aka Latin15; this has only the "currency symbol" replaced by the "euro symbol").

Anyway, the most general encoding is UTF(8) which is a way to encode the biggest "map" available, the "map" that tries to hold every symbols humans have created (UCS i.e. Universal Character Set). Modern O.S. uses this encoding as default encoding, but of course it is not always true.

To get a little bit deeper in the matter, wikipedia has articles about the subject. It is worth reading a lot about the topic even by employees of big enterprises.

2011-05-17

Languages can't be compared simplistically

Through the codeproject mealing list I landed on that kind of article that is frustrating reading to, since it is totally pointless and someway false. So this post will be.

Reasonable note

Likely the only thing to be remembered is that every single existing programming language has its own advocates. All have their strong reasons and subtle arguments to say their language of choice is better than language X or language Y.

So the most important thing to stress is that they are all wrong, and in the same time they are all right. It just depends on the point of view, programming habits and purposes, knowledge bases, and other elements I forget.

Ok, stop with the democracy.

Everything is Number

One thing that must be remembered is that computers understand only one language. This means that no matter the language you use, it must "map" to assembly language someway. To be extremists, high-level languages are just syntactic sugar for the Only Language.

Of course, this fact does not mean that for a human it is the same thing to write code in assembly and, say, Ada. This only means that each language potentially can do everything "machine language" can, and no more.

The way humans can express an algorithm using a computer language however matters, and this is a reason to create high-level language. Likely, for an average human being, it can be considered impossible to model a complex program using assembly, though as said before assembly can do everything.

Moreover, there are other advantages when we stick to high-level computer language: readability (other programmers can say what your code does), maintainibility (you can add features and drive out bugs faster and easier), portability (HL languages can provide abstraction from the hardware), ...

But all these features and more I forget come with a price: other programmers did the hard job of writing a lot of code (at the very beginning in a very low level way). And it continuously happens so again and again, at different "layers", e.g. when we use a new expressive syntax of an evolving language, or "new" classes or methods of STL or Boost libraries, part of which are becoming a new luggage for the standard C++.

One consequence of this speech is that — even though each language deserves its niche and likely outside of it another language can be considered more powerful — we can pick our language of choice and extend it as we want to reach the expressiveness of another language, using (ad hoc) pre-processors / pre-compilers and writing proper libraries of functions. (Don't forget once C++ was a sort of pre-compiler that produced C code, and it works this way the GNU Sather compiler too...)

Of course, why would you create handmade tools/libraries when they exist libraries that do what you want, or why should you not stick to another language? This sometimes can't be done easily (I am thinking about job and tons of already written code), but even when it can be done, it does not prove that the language you are leaving is inferior to the brand new one you're going to use.

But this is, briefly, the argument of the article: C++ is superior since something can be done easily (while C can't at all... and we have already said this can't be true).

Expressiveness matters

I agree that "expressiveness" matters, as said above... However a language shows to be less expressive or more expressive in a particular context, not per se. So again expressiveness (in a particular context) does not prove that a language is superior or more useful than another.

Taking seriously this argument and picking a specific problem to be solved, we can show that e.g. Matlab is superior to C++... and repeating the procedure we can show that language X is superior to Matlab, and Y to X and so on.

This escalation theoretically ends with the language Z, the most expressive and powerful language, superior to any other language. Then, why are we losing our time with inferior languages like C, C++, Matlab...?

Focusing on specific features that "show" C badness or lack, we can find another language that has those features, and a set of other features that C++ miss; so, we have proved C++ is vastly inferior.

Here it comes the article

Now let me go silly mad on some sentences of the article.

I will not concentrate on the actual implementation of this matrix type functions, as that's rather trivial in both C and C++. What I will concentrate on, however, is the usage of such a type in each language.

Still less trivial in those languages that support matrix manipulation directly: you don't need to implement anything, just use! Anyway, is it really trivial to implement such a thing, in both C++ and C? I suspect it is not as easy as it seems.

And about C, we don't need to reimplement the GSL (GNU Scientific Library) (awaiting for sparse matrices, maybe one day we'll see them; in the meantime, we can stick to Matlab or Octave... of course, someone have written the code, in C++ or even C...); we could in C wrap some operations to provide the wanted lazyness feature; and we can wrap GSL matrix operations in a C++ class of course. We obtain syntactic sugar, basically; I would stick to GNU Octave or R, anyway.

The implementation as you say is also important... since it dictates how you can use the matrices and the kind of errors you can do, if you use the API in the wrong way... Programming errors are possibile in any language, not only in C.

The C implementation, however, is not as straightforward because C does not offer a direct paradigm for this. If the C version wishes to match the C++ version in speed and memory usage, it will have to basically simulate the C++ implementation using a struct and functions which handle objects of that struct type (including things like initialization and destruction)

C++ does not offer direct paradigm for matrices computation. Again, Matlab/GNU Octave win over C++. Are you talking about operators overloading? Yes, it is an interesting thing, that makes the code easier to be read... but clues about how performance (speed) and memory usage enter the speech? Totally unrelated, if we must ignore the implementation; the C++ implementation can't be magically faster and less memory-hungry just because it is C++, or since it allows operator overloading... Indeed it is exactly the opposite. The class/objects' stuffs consume (a bit of) cpu time and a little bit more of memory; if we wrap into a C++ class a smart C implementation, properly and smartly translated into C++, the C++ version will eat anyway more cpu time and memory!

If you don't like the operator overloading of the Matrix class for whatever strange reason, you could always write the equivalent named functions instead. It's not really the main point here.

Indeed I thought it was one of the most important point! So you are saying that instead of
Matrix c = 2*a + b;
we can write
Matrix c = a.mult(2).plus(b);
Another example could be
Matrix c = a.plus(b).mult(2);
How do you translate this? Isn't this a possible source of error?! Do you think it is not so because C++ is superior? And what about C++ programmers? Wasn't it really an important point??

Your C equivalent is not complicated; it is simply "strongly procedural" (so to say); nothing complicated. Anyway I would have done something like
int foo()
{
matrix a = NULL, b = NULL, c = NULL;
matrix_ctx ctx = NULL;

matrix_start(ctx);

a = matrix_new(ctx, 1000, 1000);
b = matrix_new(ctx, 1000, 1000);

c = matrix_assign(ctx, c, a);
c = matrix_mult(ctx, c, matrix_unit(ctx, 2));

c = matrix_sum(ctx, c, b);

matrix_end(ctx);

return some_value;
}

The context stuff is given thinking about multithreading, if it is not a concern, a static hidden context can be used in the matrix library, avoiding the extra argument. Currently there's no need for a label, all functions must be safe when null pointer are given. Of course if you need, you can free a single matrix with matrix_free(ctx, m).

coding conventions need to be obeyed to avoid memory leaks, something which was not necessary in the C++ version

Memory leaks are impossible in C++? What does it happens if your implementation uses new and you never call delete? If C++ has a pool for new allocated memory that is released at the end of the program, ... then implement a similar pool for the C matrix "object": you never allocate the object by yourself, so how it is done is hidden (and tracked) inside the implementation, you only have to remember to call matrix_end(ctx) somewhere, maybe in a atexit function? (There could be a super-context accessed properly using mutex, so that a multithreaded application has only to call something like matrix_terminate() to destroy all contexts)

The C version is also easy to break accidentally. How easy it would be to simply write something like:
Matrix c = a


... Real programmers do not accidentally write such a thing once the API is clearly mandating the use of a function like matrix_assign or whatever... In my implementation, with that code c would be simply a not-tracked (through reference counter) "alias" for a. And anyway, even your innermost "simple" implementation of the Matrix class could be accidentally broken by some simple to write mistake...

The assignment works in C++, once you've correctly implemented the needed method to do it. And if you accidentally are able to write Matrix c = a, ...

Error handling

Proper error handling is always a problem in complex code; sure, languages allowing try-catch blocks or similar (and raising/throwing exception in general) give a great help... but however proper error handling existed before these "tricks" became widespread.

In my "imagined" implementation, the foo() does not need any error check, but the one needed to know if the result of computation is valid. If the result of the computation is, as foo(), just an integer, and all integer values are possible, other methods must be used of course... We could use the super-context, that can hold (per thread) the result of the last operation; since a fail in an operation can cause the fail of all subsequent operations, we don't need to add more code to handle anything, but at the exit of the function foo(); where we shall check the "super context per thread error code" just to know if the integer holds a correct value or not.

It must be also stressed the fact that try-catch presupposes a correct error checking point-by-point in the class to throw the exception (i.e. to catch all errors you still need if-then-else or whatever); that throwing exception is after all jumping; and that C mimicing C++ way of doing error handling is a bad idea: to do a correct and easy error handling in C, you have just to think differently.

Initialization and desctruction

Derivative types embedding C matrix "object" needs a whole new set of functions to handle them. Then, they can be used without worrying about what's inside... C++ classes hide in constructors and destructors the work; a class using Matrix objects allocated using new needs to deallocate them in the destructor. Matrix class hides initialization and destruction stuffs, so they do the matrix_* functions, and my implementation vastly simplify the usage, allowing the user to be not too much worried about allocations.

Each time I need to make an "instance" of SomeCompoundStruct, I need just to call the proper function. It will do all the works. Exactly how the code produced by the C++ compiler will free the memory used by the objects when it goes out of scope (or the program ends). This is just because we are avoiding "new".
typedef struct
{
Matrix* mainMatrix;
Matrix* secondaryMatrix;
int someValue;
int anotherValue;
}
Would this work without additional code? And in the other case, would this prove C is inferior??!

Anyway, indeed C has not "compound types" that can be handled as primitive types, since it has not the hability to embed "behaviours" into a new type. You discovered that C is not a OO language (though the features could be provided even by non-OO languages)

The final question for this section is: do we need such a SomeCompoundStruct?

Data containers

void bar(int amount)
{
std::vector matrices(amount, Matrix(1000, 1000));
//...
}

The above line creates an array which contains 1000x1000-sized unit matrices. Since, as specified, Matrix uses copy-on-write, each matrix in the array shares the same matrix data, so there's only one such data block allocated after the array has been created. This can be an enormous memory saver if not all of those matrices are modified.

Small note: the fact that Matrix(1000, 1000) gives unit matrix is just an assumption on an imagined implementation; nowhere it is clear that Matrix(1000, 1000) means something different by "give me a void matrix of 1000x1000"... I would rather implemenent a sort of factory to give initialized useful matrices, and would keep Matrix(1000, 1000) for the void matrix (uninitialized here means made of all 0 or null objects).

The code translated into C is
void bar(int amount)
{
vector matrices;
matrix a;
matrix_ctx ctx;

matrix_start(ctx);
a = matrix_new(ctx, 1000, 1000);
a = matrix_assign(ctx, a, matrix_unit(ctx, 1));

matrices = vector_new(amount, a, NULL);

// ...
vector_free(matrices, NULL);
matrix_end(ctx);
}
When I need basic containers and useful functions, I use the glib; here I imagined a possible implementation of a "vector object", that mimics (from a prototype point of view) the way STL vector can be initialized (and how it is used in this example).

I don't need to go into the details of the implementation of those functions, exactly how you did not need to go into the details of the implementation of the STL vector class.

Basically one line of C++ code requires a dozen of lines of C code.

This is just because you are using libraries that already do what you need, while you are imagining to implement what you need in C from scratch; of course C code, lacking some syntactic sugar and runtime help, will be "more verbose"; but not necessarily like that. Do you say it depends on where you "move" complexity, i.e. "where" you delegate the actual "computation"? O, well, this is almost the whole point about top-down/bottom-up approaches, maybe.

The reason C doesn't offer such an utility is because it can't, and that's one of the major problems with the language

This is a totally wrong perception about what a language is. As said far above, C can. How, this is, at most, the difference. At what cost, if it were you who have to write the code from scratch. Not so for the std::vector: someone else wrote the class for you, and put as standard class for C++ (the existance of the Boost library shows that even C++ with STL lacks utilities programmers would like to have). However, here you are talking about standard libraries, confusing the richness of one and the poverty of the other with the abilities and features of the language.

Implement the code to run my snippets, the more glue code, invent syntactic sugar (handled by a pre-processor/pre-compiler), ... and you'll have another C++ ... This does not prove C++ is superior; it just proves that C and C++ are equivalent.

Linked list

Linked lists are implemented in C eons before C++, templates, ... were in a human mind... As, on the other hand, it is true for assembly... Generalization (i.e. a set of functions to handle lists of any kind of object) is possible.

That's because C cannot offer any rational generic linked list implementation which would work with any user-defined type

The use of "opaque pointers" is what could be used here. Proper casting, well-thought and written functions that uses user-defined functions to handle creation-destruction of the objects in the list, can be used to create generic lists handling functions that can "link" any kind of "object" (represented as a pointer/reference to the actual object).

The amount of code required for that is quite significant, and the code will be complicated, error-prone and hard to follow

It would not be so big; anyway once written you hide it into a library, and just use the API... and it won't be more error prone than the implementation of a class of the STL library. And you don't need to follow it, just use it... exactly how you do with the STL classes.

Nested containers

Once you have the right code for C vectors (of opaque objects), and C lists (of opaque objects), and given a set of function to handle all these objects, the rest is straightforward. Longer than C++ (not considering the code of STL classes of course and not considering the code to handle vectors and lists in C, that we built into a library), but not so complicated and inefficient as you believe.

Moreover, the above code is completely safe and efficient.

The efficiency of the code depends on the implementation. If I can trust totally (and why?) STL classes' implmentation, should I magically agree with the fact that Matrix class is inherently efficient?

Yet the executable binary produced by the compiler will be as efficient as it can get.


With respect to...?

This is the beauty of C++. When modules are properly designed, it makes it extremely easy to take a module and reuse it with something user-defined

This is the beauty of many languages. In particular, as you remembered at the beginning, C++ is not too much appreciated by some with respect to its OO features. And I largely agree, but this is another topic.

The point here is that the sentence is good for C too. If modules (libraries) are properly designed, they can be reused with something user-defined easily.

The functionality in the example above can be reproduced in C, but it will be extremely complicated and hard to follow, and very error-prone

As writing a compiler for a language like C++, implementing the STL classes... I bet the code of them would be hard to follow... And very error-prone, but we know how complex code evolves, functions/modularity hide complexity, and bugs get fixed, doesn't we?

Copying containers

This single line of code translates to hundreds of lines of complicated and unsafe C code, which requires minute attention to detail and strict following of coding conventions

Again, you are forgetting about the fact that there's actual code behind the scene, and that it was written by someone. So, I will imagine there's a set of libraries/modules that gives C containers and all the needed stuff, so that I can simply write
array2 = vector_copy(array2, array);
// ...
vector_destroy_all();
matrix_destroy_all();
// ... and so on...
This basically means: the API can be clean and simple. There's a lot of code to be written (if it was not already written), ok, as it was for STL... But this does not prove the language is inferior to this or that (in the worst case it means it lack a powerful library for containers and similar). Moreover as said at the beginning, each language has its usage domain. If you need such a "complex" structure and you can't find a good library out there doing what you need, then you can stick to another language. I claim that rarely you need such a complex "hierarchy", and that 70% of common "problems" can be solved with C, thinking the C-way (not trying to force C to imitate OO languages like C++, as you've done).

Finally

if not even thousands of lines of C code, but an experienced C programmer is used to that.

Experience and bloating? Experienced programmers using bricks (no special libraries in sight) know that first they have to write their "tools" to approach easily the problem. Once done, these tools are put apart into a library, and we need just to use them, so that the code (the actual code doing the actual thing) is kept short (likely still longer than C++, but not that longer).

About templates, it is true, C has not templates, and templates are powerful, really. Anyway again, with a smart design and bits of preprocessor magic, the final wanted result can be reached.

We must not forget anyway the question: it makes always sense, or we just have picked a problem (which is in a subset of the set of all problems) suitable in particular for C++, just to show that with C it could be harder? (Again I have to stress the fact that if it is true that in C it would be so harder, this anyway does not prove C++ is superior)

And let's focus on the chosen problem: it is easy to use the very same problem to show that another language (e.g. let's consider a functional language, just to be "original") is superior to C++.

Many of the argument you used can "prove" that also Fortran (even without the object oriented features of the 2008 standard) is currently superior to C... and because of chosen example, since you don't need to implement anything to handle matrices in Fortran, we can argue that Fortran is superior to C++; instead of Fortran, we could talk about languages like Matlab, GNU Octave, or maybe more evidently about Python or similar... all these are superior to C++ for almost the same reasons why C++ is superior to C.

Even if you try to struggle by making the matrix type and all of its functions as preprocessor macros (something which would make it some of the most horrible pieces of code ever created), it would still fall short because it wouldn't work with the GMP, MPFR and other similar libraries which do not act as primitive types.

I believe there's an elegant and smart way of doing it, thus the code would not be the most horrible pieces of code ever created. Even C++ won't work by magic with non-primitive types: you have to overload operators and use functions and pointers the C-way — if there's a class that wraps everything, it means just that you've delegated the (small) "complexity" to someone else. (BTW this made me think about C++ ABI and name mangling, which raise several problems...)

And at last, the last sentence...

C++ is inferior to C (or, if you prefer, C is superior to C++) since it is much easier to write bad code in C++ than in C. This is a simple truth.

2011-04-09

What happens if...

Back again. This time I fished out something I had put apart for future fun. It is simply a piece of code that, like a lot of pieces of code I've seen recently, drove me crazy; since it is clearly wrong, but people using it think it is right, just because it happens that it works. (Moreover names chosen for the functions are misleading, but this is a different fact).

How about that, you ask: if something happens to work, it means it is right. Doesn't it? Wrong. Of course, it does not work miraculously: it works since the "real world space" of inputs which feeds it, is smaller than the "theoretical space" of inputs that the code pretends to handle.

Less words, let's look at the code (pseudocode).


procedure Copychar (start, end : integer,
src, out dest, repl : string);

var tmp : string,
i, k : integer;

k := 0;
for i := 0 to src.length;
if i >= start-1 and i < repl.length then
tmp[k] := src[i];
k := k + 1;
else if i >= end then
tmp[k] := src[i-1];
k := k + 1;
end if;
end for;
dest := tmp;
end procedure.


Hopefully every real programmer (being using Pascal or whatever) can see how ill this code is (and the procedure name does not explain what the code's intentions were). First, the for runs from the first character (indexed by 0) to past the last one (indexed by the length of the string minus 1); the guard ring is on the length of the repl string, not on src that is the variable indexed by i. If the repl string is longer than src, bad things may happen.

So the procedure must have at least this constraint: repl.length <= src.length; with this (undeclared) constraint, the index i can index src correctly in the first if. When the index has passed that "threshold" the else if is taken into account for sure, and its body executed when i >= end.

This code is so evil that it is hard to explain why (and it is hard to imagine that it did the job it was written for! But if I would show the rest of the code, you will see that this is very specialized code, camouflaged as generic function, used in a very "special" way, in a very "special" context).

The intention of the procedure (by the name and by input parameters) seemed to be to substitute a string with another, where of the old string we know just where it starts, and where it ends. But if we take a look at it, we see it does not read from repl, and it does indexing in the wrong way...

A correct, correctly general procedure could be instead (I am using more abstractness in place of explicit for loops):


procedure ReplaceStrPart (start, end : integer,
src, out dest, repl : string);

dest := "";
dest := dest + src[0:start-1];
dest := dest + repl;
dest := dest + src[end:];

end procedure.


Hopefully everyone who can be considered a programmer can translate this with for loops and indexes, if the language does not support strings slicing or similar syntax (note: in the syntax a[i:f], if i > f, nothing is copied, while a[i:i] is the same as a[i] and picks the i-th character).

This procedure does what we expect and has a name that is a little bit better.

As for the original evil code, the programmer failed to write a senseful generic function/procedure; rather, he proceeded in a trial-error-fix fashion, in a "limited input space", and he had to "compensate" for what likely he recognized as odd behaviour, until he reached the purpose... I would say, by chance. And everything went fine since the "input space" stood limited, though (I know this) there was no reason why it should happen, there was no a special constraint or handshake about that... it simply was so since people have their habits and generated strings keep their "places"...

2011-02-01

February

Turned my head just to see the path and woops, it's already February. As usual, since January had to end. Oh, happy new year to everybody!

I'm thinking about some serious project to feed my mind. Currently I am trying to let my artistic fantasy fly, and technical or sub-technical stuffs are a little bit behind. Indeed they are stacking into my mind, one day they will pop, hopefully not too late.

I am sorry there's no a January post, this is my only current concern! :)