Unicode: East meets West

ASCII and its 8-bit cousins are on the way out, and with them the assumption that a character can be represented by a single byte. The new kid on the block is Unicode, similar to but not precisely the same as ISO 10646. Unicode (despite its name) is a set of standards. The most widely implemented is the 16-bit form, called UCS-2. As you might guess, UCS-2 uses two bytes per character, allowing it to encode most characters of most languages. Because most is far from all, there are nascent 32-bit forms, too, but they are neither complete nor in common use.

In the same sense that 7-bit ASCII was extended to 8 bits, Unicode extends the most prevalent 8-bit ASCII, ISO 8859-1, to 16 and 32 bits. The first 256 values remain in Unicode as in ISO 8859-1: 65 is still A, except instead of being 8 bits (0x40), it's 16 bits (0x0040). Unlike the 8-bit extensions, Unicode has a unique 1:1 map of numbers to characters, so no language context or character set name is needed to decode a Unicode string.

UCS-2 was the initial system employed by Microsoft NT-based systems while recent versions moved to UTF-16. Microsoft database servers store UCS-2/UTF-16 strings in nchar and nvarchar datatypes. Microsoft also designed version 7.0 (and up) of the TDS protocol around UCS-2/UTF-16: all metadata (table names and such) are encoded according to these encoding on the wire.