Unicode: East meets West

Unicode: East meets West
Prev	Appendix C. About Unicode, UCS-2, and UTF-8	Next

ASCII and its 8-bit cousins are on the way out, and with them the assumption that a character can be represented by a single byte. The new kid on the block is Unicode, similar to but not precisely the same as ISO 10646. Unicode (despite its name) is a set of standards. The most widely implemented is the 16-bit form, called UCS-2. As you might guess, UCS-2 uses two bytes per character, allowing it to encode most characters of most languages. Because “most” is far from all, there are nascent 32-bit forms, too, but they are neither complete nor in common use.

In the same sense that 7-bit ASCII was extended to 8 bits, Unicode extends the most prevalent “8-bit ASCII”, ISO 8859-1, to 16 and 32 bits. The first 256 values remain in Unicode as in ISO 8859-1: 65 is still A, except instead of being 8 bits (0x40), it's 16 bits (0x0040). Unlike the 8-bit extensions, Unicode has a unique 1:1 map of numbers to characters, so no language context or “character set” name is needed to decode a Unicode string.

UCS-2 was the initial system employed by Microsoft NT-based systems while recent versions moved to UTF-16. Microsoft database servers store UCS-2/UTF-16 strings in nchar and nvarchar datatypes. Microsoft also designed version 7.0 (and up) of the TDS protocol around UCS-2/UTF-16: all metadata (table names and such) are encoded according to these encoding on the wire.

Prev	Up	Next
ISO 8859: What everyone would like to forget	Home	Unicode's Pluses and Minuses