2011-06-23

Who gets it right? Enterprise encoding hell

One of my preferred arguments about computers is character encodings. Years ago I began to write several lectures about computers and programming, and the first were all about this subtle issue.

The fact is that people fail to understand that computers do not understand, they just execute codes (that is, data which can be interpreted as code by a special interpreter, the processor); they are just programmed to interpret data in a particular way and maybe to represent them some way. But there's no meaning in data themselves. So if you get the wrong interpreter for some data, what you get is garbage, or a polite error message. If you have an interpreter but you feed it with data it can't interpret, you get again garbage as output, or a polite error message.

What we read as letters for a computer are data (bytes), no more no less. Data that can be interpreted as letters, but in order to do so, the interpreter must know (be programmed to use) a sort of map between the data and the symbols shown on the screen, on the mobile phone display, or wherever.

Nowadays the ASCII "standard" is so widespread that we can disregard any other standard (like EBCDIC) though likely it's still alive somewhere. The ASCII encoding "states" that the byte XY, when interpreted as a "letter", must be shown as the letter "L"; put numbers (another interpretation of bytes) in stead of XY and other letters in stead of L and you have a way of interpreting a sequence of bytes as a sequence of letters. Indeed ASCII has also bytes that have no "printable" symbols associated; e.g. there's a byte that means go to the next line (new line), or put a space (space) and so on.

Anyway, ASCII has no too much special symbols, accented letters (since it was born as an american/english stuff only) and so on. So the pain starts when you want to see that nice symbol on every computer interpreting your data as a document made of readable symbols.

The encoding babel begins at that point. It could be interesting to trace the history of character encodings. Anyway for any practical purpose it is more interesting to aknowledge the fact that there were great efforts to reduce this babel and in the same time to increase the number of symbols representable in a "unified" manner.

But there's no way to escape from the choice of the correct data to be fed to an interpreter that expects those data.

So, if you as programmer, write in your preferred IDE or editor, a line like this:

printf("il costo è di 10€"); //avoid this!

or if you feed your DB with a statement that looks to you like

INSERT INTO aTable VALUES('il costo è di 10€', 10); -- avoid this!

and you know nothing about the encoding your IDE/editor uses and your DB accepts/translates into, and the encodings that are understood by the systems where you suppose to display those texts, then you are getting it badly wrong, to say the least.

About DB, when you want to be sure it does not "scramble" your text, you must force it to believe it is data, not text. You loose something (ordering capabilities according to specific collating sequences), but you are sure that the bytes you wrote are "transmitted" as they were to the final end-point. (The question about in which encoding those bytes represent "è" and "€" remains, i.e. you must input those symbols in a desidered known encoding, or input them as bytes, if possible).

Now, some people may ask how it is possible that I wront "è€" and the reader reads "è€"? HTML (which is ASCII based) has special "metainformartions" that are used to specify the encoding of the content. If the encodings match, I succeeded showing you the symbols I see: we are saying to the "interpreter" (the browser that shall choose the right symbol according to the input) which "map" to use.

The default encoding for web pages if nothing is specified is ISO-8859-1, AKA Latin1, which is 8bit superset of the ASCII encoding and a single byte encoding — the first 128 codes of Latin1 are the same of the ASCII (which has "only" 128 codes). (Instead of ISO-8859-1, it could be ISO-8859-15 aka Latin15; this has only the "currency symbol" replaced by the "euro symbol").

Anyway, the most general encoding is UTF(8) which is a way to encode the biggest "map" available, the "map" that tries to hold every symbols humans have created (UCS i.e. Universal Character Set). Modern O.S. uses this encoding as default encoding, but of course it is not always true.

To get a little bit deeper in the matter, wikipedia has articles about the subject. It is worth reading a lot about the topic even by employees of big enterprises.

No comments:

Post a comment