Unicode Transformation Format: UTF-8

The presence of nulls embedded in character data and of byte order issues make straight Unicode i.e., UCS-2 or UCS-4 hard to work with in a heterogeneous environment. Too many opportunities arise for the data to be truncated or misinterpreted, and too many systems would fail even to transmit such data. In short, when 16-bit data are thrust into a multi-architecture 8-bit world, it frequently bodes ill for the data.

To answer that problem, to make Unicode transmissible and unambiguous to most machines, several transformation formats were adopted. Their goals were generally similar: to create a generally recognized format that would unambiguously and safely convey Unicode information between machines and across the Internet. To do that, they sought to remove nulls and endianism from the data stream. The most popular one — practically the only one used — is known as UTF-8.

UTF-8 found wide acceptance for many reasons. UTF-8 represents any Unicode character as a combination of 1-4 bytes. The number of bytes required depends on the integer value of the Unicode character, and only one byte is used to represent the old ASCII range (0-127). UTF-8 does not use zero to represent any part of any character (except for the ASCII NUL). In consequence, UTF-8 is efficient with respect to space, has no endianism issues, and embeds no nulls. UTF-8 strings can be treated as plain old ASCII strings. These properties make UTF-8 data relatively easy for systems accustomed to processing ASCII data.

Here's a small example showing the difference between UCS-2 and UTF-8.

Example C.1. HELLO in UCS-2 and UTF-8

	$ echo HELLO | iconv -f ascii -t UCS-2 | hexdump -C
	00000000  00 48 00 45 00 4c 00 4c  00 4f 00 0a              |.H.E.L.L.O..|
	0000000c
	$ echo HELLO | iconv -f ascii -t utf-8 | hexdump -C
	00000000  48 45 4c 4c 4f 0a                                 |HELLO.|
	00000006
	$ echo HELLO | hexdump -C
	00000000  48 45 4c 4c 4f 0a                                 |HELLO.|
	00000006


It is the similarity of the last two outputs that makes UTF-8 so attractive. It behaves like ASCII when ASCII's all that's needed. But it lacks ASCII's limitations.

While UTF-8 solves many technical problems, it doesn't magically transform every ASCII-assuming system into a Unicode system. For example, to display Unicode data correctly — even Unicode data in UTF-8 format — the system still needs a suitable font. And it must distinguish the buffer size (and byte count) from the character count.