Localization and TDS 7.0

TDS 7.0 uses 2-byte Unicode (technically, UCS-2) to transfer character data between servers and clients. Included in "character data" are query text (i.e., SQL), metadata (table names and such), and bona fide data of datatypes nchar, nvarchar, and ntext. (Background information on Unicode and how it affects FreeTDS can be found in the appendix.)

Because most Unix tools and environments do not support UCS-2, FreeTDS provides for conversion by the client to other character sets. The mechanism used is determined by the configure script, which looks for a iconv(3) function, an implementation of the iconv standard. If no iconv library is found, or if it is explicitly disabled, FreeTDS will use its built-in iconv substitute, and will be capable of converting among only ISO 8859-1, UTF-8, and UCS-2.

To learn what character set the client wants, FreeTDS prefers the applicable freetds.conf client charset property. If that is not set, it parses the LANG environment variable. In either case, the found string is passed to iconv(3) (or its built-in replacement). [1]. If neither is found, UCS-2 data are converted to ISO 8859-1.

To list all supported iconv character sets try iconv(1). GNU's does:

	$ iconv --list

For other systems, consult your documentation (most likely man iconv will give you some hints).

In this example a server named mssql will return data encoded in the GREEK character set.

Example 5-2. Configuring for GREEK freetds.conf setting

	[mssql]
	host = ntbox.mydomain.com
	port = 1433
	tds version = 7.0
	client charset = GREEK
	

If FreeTDS runs into a character it can not convert, its behavior varies according to the severity of the problem. On retrieving data from the server, FreeTDS substitutes an ASCII '?' in the character's place, and emits a warning message stating that some characters could not be converted. On sending data to the server, FreeTDS aborts the query and emits an error message. It is well to ensure that the data contained in the database is representable in the client's character set.

If you have a mix of character data that can not be contained in a single-byte character set, you may wish to use UTF-8. UTF-8 is a variable length unicode encoding that is compatible with ASCII in the range 0 to 127. With UTF-8, you are guaranteed to never have an unconvertible character.

Important

FreeTDS is not fully compatible with multi-byte character sets such as UCS-2. You must use an ASCII-extension charset (e.g., UTF-8, ISO-8859-*)[2]. Great care should be taken testing applications using these encodings. Specifically, many applications do not expect the number of characters returned to exceed the column size (in bytes).

In the following example, a server named mssql will return data encoded in the UTF-8 character set.

Example 5-3. Configuring for UTF-8 freetds.conf setting

	[mssql]
	host = ntbox.mydomain.com
	port = 1433
	tds version = 7.0
	client charset = UTF-8
	

It is also worth clarifying that TDS 7.0 and above do not accept any specified character set during login, as 4.2 does. A TDS 7.0 login packet uses UCS-2.

Microsoft Server Note

String literals in SQL must be prefixed with 'N' unless the enclosed string can be represented in the server's single-byte character set, irrespective of the column's datatype. For example, in the SQL statement

	INSERT INTO tablename (greeting) VALUES ('Hallå')

the string is subject to somewhat surprising treatment by the server.

When the server parses the SQL, it extracts the data values for insertion (or update, or comparison, etc.) Unprefixed strings are converted to the single-byte character set of the server/database.[3] Inserted data are then of course stored in the column. In the case of UCS-2 columns — nchar, nvarchar, and ntext — the value stored is that which results from a second conversion: from the single-byte form to the UCS-2 form.

The only safe way to enclose strings in SQL text is with an 'N' prefix:

	INSERT INTO tablename (greeting) VALUES (N'Hallå')

Commentary

What's surprising about this? Versions 7.0 and later of the TDS protocol use UCS-2 to send SQL text. No matter how your local client is configured — with UCS-2 or ISO 8859-1 or anything else — it's converted to UCS-2 before it's sent to the server. And obviously arrives at the server as UCS-2. If the column into which it's being inserted is also UCS-2, there's no need of any conversion, much less two, and certainly no need to lose infomation.

Why this happens is anyone's guess. Here's one: it makes the datatype of the column unimportant. Regardless of whether you use char/varchar/text or nchar/nvarchar/ntext or a mixture of the two, the arriving SQL (if naïvely written) will store exactly the same characters.

Notes

[1]

The built-in replacement expects GNU iconv names: ISO-8859-1, US-ASCII, or UTF-8.

[2]

not EBCDIC or other weird charsets

[3]

The precise rules are unknown to the author.