Unicode's Pluses and Minuses

You will read from time to time that Unicode is not perfect. Surprise, surprise: it's true. From a linguistic point of view, Unicode is incomplete; in particular, UCS-2 is demonstrably too small (!) to hold all the forms of Chinese ideographs used over the centuries. (It is, however, quite useful and widely employed in representing modern Chinese.) Of more common concern to programmers are Unicode's technical problems, or rather, Unix's technical shortcomings vis-a-vis any encoding more complex than ISO 8859-x.

The basic problem, from a programmer's perspective, is the ancient agreement Unix entered into 30 years ago, the ASCII Compact, alluded to earlier. Assumptions about ASCII are littered throughout Unix-like systems, beginning with C's convention of representing strings as arrays of characters ending in a zero. Returning to our HELLO example earlier, C will store HELLO as 72 69 76 76 79 0, in very nice ASCII. Many many parts of the operating system and its associated tools and applications will recognize that as a 5-letter word because it's terminated by a null (zero). In UCS-2 Unicode, though, that same HELLO uses 2 bytes for every character and becomes 72 0 69 0 76 0 76 0 79 0 0 0. Practically the whole OS will think that's a 1-letter word, H. Not a good thing.

Even if every OS were magically rid of all ASCII assumptions and C strings, there would still be the problem of Endianism. Technical explanations on the subject are not hard to find. The long and short of it is, given a 16-bit integer (2 bytes), different hardware architectures will store the value differently. Asked to store our friend A, (0x41), for instance, a Sparc processor will put the least significant byte at the higher address (00 41) whereas an Intel processor will put it in the lower address (41 00). Put aside the questions of left, right, and wrong; architectures are a fact of life. Endianism shows up wherever integers are stored and retrieved in heterogeneous environments.

The Unicode folks knew about Endianism, of course, and had to address it. A Unicode bytestream is supposed to begin with a byte-order mark. Needless to say, perhaps, many don't.