Appendix C. About Unicode, UCS-2, and UTF-8

Table of Contents

ASCII: What everyone knows
The ASCII Compact
ISO 8859: What everyone would like to forget
Unicode: East meets West
Unicode's Pluses and Minuses
Unicode Transformation Format: UTF-8
Unicode and FreeTDS
For further information

For better or worse, FreeTDS brings the otherwise innocent programmer into contact with the arcane business of how data are stored and transported. FreeTDS is a data communications library that of course connects to databases, which are charged with storing information in a way that is neutral to all architectures and languages. On the surface, that might not seem very complex, even worth discussing. Under the surface, things are not so simple.

ASCII: What everyone knows

The world we are all familiar with, programmingwise, is ASCII. Our email (mostly), our text files, our web pages (mostly), all use ASCII to represent English (or English-like) text. Perhaps because ASCII was standardized back in 1972 by the ISO, it seems like the natural way to store information. But let's look under the hood a little bit, and examine our assumptions.

Our so-called text files are nothing special, nothing but a little agreement we enter into with our operating system. The only reason we can read them with cat or vi is that the operating system and its tools are in on the agreement. A file is only a stream of bytes, after all, no more text than an executable. The only thing distinguishing a text file from any other, is our understanding to treat it like one. We agree that the number 65 will represent the letter A, 66, B, and so on, 127 values in all. See man ascii for further details.

The important thing to understand is that the designation of 65 for A and so on is a choice. It's an encoding standard, made necessary by the old simple fact that computers store numbers, not letters. ASCII is so ubiquitous these days that it's hard sometimes to remember there was a time when it was but one of a set of competing encoding standards. Others you probably have heard of include EBCDIC and the Baudot systems, but they are by no means the only historical alternatives, nor the only modern ones.

The ASCII Compact

UNIX® and unix-like systems bought into ASCII big time. Program code, filenames, string constants (and variables), configuration files, everything but everything is encoded in ASCII. Practically every utility, command, and library assumes the text data will be ASCII. At the dawn of the 21st century, there is widespread recognition that ASCII will no longer suffice, but the art of upgrading all the computers and computer programmers is, well, an unfinished work.