A strange bunch of characters

Most, if not all, translators nowadays use some kind of CAT tool to help them with their work.

has been the standard for the past few years, and it still is the most widely used CAT tool.

The encoding used by this tool for its working documents (TTX, sdlxliff) and exported memories (TMX) is . Now, the difference between and is interesting, and I’ll dwell a bit on it.

The ASCII code, as we have seen before, is a seven bit encoding used for English characters, including most of its punctuation. A byte has got eight bits, so there is a spare bit in every byte that can be used to signal that the character is outside this range.

The 8 in UTF-8 refers to these eight bits. UTF-8 is a variable-width encoding. A particular character could be one, two, three or four bytes.

On the other hand, in UTF-16, being also variable width, the characters use 16 bits, or 2 bytes for each character for the most common ones, or 4 bytes if the character to be encoded falls outside this range.

One of the reasons why UTF-8 has become the standard encoding in the industry is that its encoding of the English alphabet coincides with ASCII. This means that taking a plain ASCII file and adding a few characters encoded on UTF-8 works perfectly well, and the file is a valid UTF-8 encoded file. This is not the case with UTF-16 at all.

For an enlightening experience, try forcing your browser to change the encoding with which it displays this web page (or any other) by going to View-> (in Firefox) or Customize->Tools->Encoding in Chrome. After recovering your breath, you can change it back with no harmful consequences (simply reloading the page will do).

Another inconvenient result of the two-byte encoding of UTF-16 is that for the geeks among us who like working on the command line of a sharp Unix system, all text operations that can be performed on the files break down. I’ve opted for detecting the encoding of the file automatically with the command

file -b --mime-encoding

and converting it if necessary with

iconv -f utf16 -t utf8 filename.tmx > filename_utf8.tmx

and then processing it as usual.

Tagged with: