ISO 8859: What everyone would like to forget

ISO 8859: What everyone would like to forget
Prev	Appendix C. About Unicode, UCS-2, and UTF-8	Next

ASCII won, it would seem, but the race goes not to the swift. ASCII has many limitations, the most egregious of which is, it's not much good for anything besides English. It encodes all the letters and punctuation (almost) of the English alphabet, but is useless for German, Russian, and Greek, to say nothing of Chinese.

ASCII assigns one byte to every character, but deals with only 7 of the 8 available bits, the range 0-127 (with the “high bit” always zero). Demand for computers that could display and print languages besides English — even English with em dashes and cent (¢) signs — arrived soon enough, with the Marketing Department way out in front of the propeller heads. The predictable result was an array of “8-bit ASCII” encoding standards for a wide variety of alphabets. Eventually, they were standardized (or at least enumerated and documented) by the ISO. These are what our friendly database vendors are referring to when they talk about character sets.

The upshot is, there is no uniform standard, no agreement on the meaning of a byte, particularly if that byte's value is greater than 127. Let's say your client machine sends HELLO and your database stores it as 72 69 76 76 79. When another client retrieves that value, it will convert it into human-readable form by applying an encoding standard. If everything's tightly wrapped, it will use the very same encoding that your database used (and the same one you had in mind when you sent it), and that client will also see HELLO. If things are not so tightly wrapped but that client is fortunate enough to be using a similar standard to what you were using, say, ISO 8859-1, he'll still see HELLO. Most languages based on the Roman alphabet can be represented by ISO 8859-1, and are thus interchangeable. Beyond that, things get quickly messy. Greek clients, for one, are not so lucky: there are three ISO 8859 standards for Greek, all mutually incompatible. For more information, see ISO 8859 Alphabet Soup. Roman Czyborra's site is very informative; take your time there if you don't want your head to spin.

Database servers need to know what encoding standard to employ, too. It's not obvious at first, but notions like “uppercase” and “lowercase”, trailing blanks, and collation rules all depend on what letter is meant by what number. (Collation even depends on what culture is interpreting the letters.)