Saturday, August 30, 2008

Confused by character code issues?

We have all been at one point or another as the complexity of the process of taking a key press and converting it to some sort of a representation of a charecter stored somewhere and then either displaying it or transmittng it and storing it somewhere else is fraught with issues.
I have spent some time recently on multi-lingual character related issues and during that I read a seemingly endless amount of explanatory information and often came away none the wiser or tragically even more confused.

The light that led me out of the darkness is what must be one of the most useful pieces of information I have encountered on the internet.

It is "A tutorial on character code issues" by Jukka Korpela. When this man says that he is a Master of Science, he is not kidding.

The most elucidating part was when he explains that "The following definitions are not universally accepted and used. In fact, one of the greatest causes of confusion around character set issues is that terminology varies and is sometimes misleading. "

Equally useful is the assertion that "It specifically avoids the term character set, which is confusingly used to denote repertoire or code or encoding."

Armed with this knowledge, one can return to the confusing nonsense that passes as documentation, translate it to the correct nomenclature by the implication of the terms rather than the endless incorrect repetition of "character set" used inappropriately and actually unravel the issues and solve them.

Hats off to Jukka Korpela. Wouldnt life be that more difficult without guys like him?

Once you have read that and the bugs have been laid to rest you can read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for a more relaxed and amusing take by Joel Spoelsky and feel better about it all.

No comments: