Monday, February 7, 2011

UTF-8 good, UTF-16 bad

Lately I’ve been programming in Go. I started by writing a few popular Unix tools in Go, before moving on to a lexer that supports structural regular expressions and UTF-8.

Unicode support was easy to add because Go lives and breathes UTF-8. But why did the Go designers favour UTF-8, and not UTF-16? Widespread software such as Windows, Java, and Python use UTF-16 internally, so why not Go?

Further reading revealed the answer. UTF-16 is a historical accident that persists mainly due to inertia. UTF-16 has no practical advantages over UTF-8, and it is worse in some ways. The Go designers knew what they were doing, so they chose UTF-8.

Down with UTF-16

Before, from vague hunches, I had thought that UTF-16 was simpler because each character took exactly 2 bytes (so it handled only a strict subset of Unicode), and that UTF-16 was more compact for certain languages.

The first statement is patently false. UTF-16 is a variable-length encoding, and hence not much simpler than UTF-8. As for the second: consider HTML. The amount of whitespace and tags in a typical document is high enough that UTF-8 is more compact than UTF-16. This generalizes to other formats, such as Go source code.

So there’s no upside. But there are plenty of downsides to UTF-16:

  1. We must worry about byte ordering because UTF-16 deals with 16 bits at a time. A related drawback is the loss of self-synchronization at the byte level.

  2. We lose backwards compatibility with ASCII.

  3. We lose backwards compatibility with code treating NUL as a string terminator.

  4. We cannot losslessly convert from byte streams.

  5. The encoding is inflexible: hopefully 0x110000 code points should be enough, but if it were extended (again), UTF-16 would need major surgery. In contrast, by simply allowing up to 6 bytes, UTF-8 can handle up to 231 code points.

  6. Ugly encoding: it’s the reason why U+D800 to U+DFFF is reserved (though thankfully UTF-16 is self-synchronizing with 16-bit granularity).

I tired of thinking about the drawbacks of UTF-16, but there are likely more. Thus even if we only worked with a strange set of documents where UTF-16 is more efficient, the savings are too small to justify the complexity.

A living dead standard

How did so many projects settle on such a horrible standard? It’s a sad story. The original Unicode supported exactly 0x10000 code points, making UCS-2 (the precursor to UTF-16) attractive. Each character neatly corresponded to one 16-bit word. Although UTF-8 might still be preferable thanks to byte ordering and ASCII compatibility, UCS-2 is certifiably sane. People built systems based on UCS-2, and rejoiced that the codepage Stone Age was over.

Then an unnatural disaster struck: the Unicode Consortium decided more code points were needed. The supple UTF-8 got away unscathed, but there’s no way to squeeze more than 16 bits into UCS-2. Apart from abandoning UCS-2, the only recourse was to reserve a chunk of values for encoding the higher code points. Thus UTF-16 was born. Hopefully, developers could simply tweak masses of existing UCS-2 code for the new standard. Sadly, various UTF-16 bugs suggest that much code remains unaware UTF-16 is a variable-length encoding.

Despite the trauma it would cause, they should have killed UCS-2 when they had the chance. Now we have a grotesque encoding scheme that shambles alongside UTF-8.

Gauss was here

According to Wikipedia, UTF-8 was built on ASCII by Rob Pike and Ken Thompson. ASCII was developed from telegraph codes by a committee. The telegraph codes stemmed from a Western Union code, which evolved from a code by Donald Murray, who modified a 5-bit code invented by Émile Baudot, who based his work on a code designed by Carl Friedrich Gauss and Wilhelm Weber. It’s awe-inspiring that the Prince of Mathematicians was involved. And if he were alive today, I bet he’d disparage UTF-16!

16 comments:

Scaevolus said...

You came so close, but you seem to have missed the most obvious answer: "UTF-8 was built on ASCII by Rob Pike and Ken Thompson". Rob Pike and Ken Thompson also happen to be two of the primary designers of Go.

It's even easier to decide to use good engineering when you're directly responsible for it.

Yuhong Bao said...

You forgot that UTF-8 did not exist in the early days of Unicode, and UTF-1 sucked

fnl said...

Not that I want to advocate UTF-16, but there is a huge plus you get in UTF-16 you missed: character counting. If you ensure there are no Surrogate Chars (which is a simple byte-mask comparison using every second byte (only!) as in: "0xD800 & byte == 0xD800" [using the entire range to avoid having to think about byte order]) in your array, you can be 100% sure about the character length just by dividing the byte-length in two. With UTF-8, you have to look at each and every byte one by one and count the number of characters to establish the actual char-length. Finally, given how rare the Supplementary Range characters are, the Surrogate Range check can be safely skipped in many scenarios, making it significantly faster to establish char lengths in UTF-16 than UTF-8.

spankyhoot said...

You can't count characters. You can count code points. So again, not really a victory.

fnl said...

Really? So think about how you'd count code-points... Hint: it is somewhere hidden in my post...

Bg Porter said...

Your point about the superiority of UTF-8 as a native unicode notwithstanding, I don't think it's right to say that Python's internal representation is UTF-16:


Unicode strings are expressed as instances of the unicode type, one of Python’s repertoire of built-in types. It derives from an abstract type called basestring, which is also an ancestor of the str type; you can therefore check if a value is a string type with isinstance(value, basestring). Under the hood, Python represents Unicode strings as either 16- or 32-bit integers, depending on how the Python interpreter was compiled.


(from http://docs.python.org/howto/unicode.html#the-unicode-type)

mcherm said...

@fnl:

I disagree with your claim that UFT-16 supports character counting in a way that UTF-8 does not.

(1) Yes, UTF-16 supports character counting if you check every second byte to make sure it has no surrogate chars. Also, UTF-8 supports character counting if you check every byte to keep track of the multi-byte characters. The only difference is checking every byte or every OTHER byte -- either way, the entire string must be brought into memory and processed.

(2) In UTF-16, in addition to the string itself, you can also keep a boolean which tracks whether it has surrogate characters (very rare). You would create this boolean when the string was first generated. Of course, in UTF-8 you can keep a length field in addition to the string itself. You would create this integer when the string was first generated.

(3) If you don't actually need accuracy, with UFT-16 you can just *assume* there are no surrogate characters, in which case it is a simple matter of dividing the byte length by 2. And if you don't actually need accuracy, with UTF-8 you can make guesses also -- for instance, a surprisingly large number of strings are pure ASCII. This is the one place where UFT-16 does a little better: that heuristic is more likely to be accurate than the UTF-8 heuristic.

So I suppose UTF-16 is superior in cases where you are NOT able to carry around additional length information along with the string, and where you frequently need to know the number of characters, but don't mind if it is occasionally inaccurate. Oh yes, and if you don't care about using up significantly more memory.

fnl said...

Quite a simple counter-argument to all your points, even if assuming we did everything as you said. Think about how to get certain characters, say, 5 through 10, in a string. This is trivial once I've ensured my UTF-16 strings are UCS-2 compliant. But it is a pain with UTF-8, except if you work with ASCII data only (in which case the entire Unicode issue probably doesn't matter to you, anyways...). Less important, but still: I never store flags with my strings, I just ensure they all are UCS-2 compliant and treat them as such, even if they are not (but make the user aware of possibly wrong results by emitting a warning). But with UTF-8 you'd always be forced to store the string length, you cannot build around it as easily as in UTF-16. So I still don't see how UTF-8 is better than UTF-16 for text processing. UTF-8 is better for text transmission and storage in general, although. I think this is what this whole argument boils down to.

Ben Lynn said...

@Scaevolus: I'm aware Pike and Thompson work on Go (their office is on the floor above mine!), but if UTF-16 were a better choice, I like to think they would have picked it despite having designed a competitor.

@Yuhong Bao: Excellent point! If it were UTF-1 versus UCS-2, I'd take the latter.

@Bg Porter: You're right; I misremembered what I read on Wikipedia:

The Python language environment officially only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to "Unicode" produces correct UTF-16. Python can be compiled to use UCS-4 (UTF-32) but this is commonly only done on Unix systems.

So it uses UCS-2 or UTF-32 internally ("16- or 32-bit integers"), not UTF-16. But it is certainly not UTF-8, and according to the Stack Overflow post I linked to, UCS-2 Python handles higher code points incorrectly. Presumably one should compile Python in UTF-32 mode to avoid Unicode bugs.

@fnl: I had thought that the characters outside the BMP might be rare enough that perhaps you could simply ignore all other planes and pretend UTF-16 is UCS-2. If enough developers followed suit, it could force Unicode to go back to 16-bits: committees write standards, but they die if no one follows them (e.g. XHTML 2.0). A tiny minority might be annoyed, but they can use special-purpose software for Phaistos Disc symbols or whatever.

However, it looks like there are characters beyond BMP that are sufficiently common to deserve widespread support.

1. CJK characters: Wikipedia states you might need up to 40000 of these for "reasonably complete coverage", and perhaps over 100000 exist in total. UCS-2 only contains 20940 CJK characters.

2. Mathematical Alphanumeric Symbols. All other mathematical symbols lie within the BMP. The mathematical alphanumeric symbols are also useful in formulas, but lie beyond the BMP.

3. Emoji: though they might seem frivolous, these are becoming so popular that character encodings for them sprung up, supported by a growing number of phones. Thankfully, they will be part of the next major Unicode version, partially outside the BMP.

You could probably make a case for other characters, such as musical notation.

fnl said...

@Benn Yes, that is precisely my thinking. As I don't work with CJK text, it's not an issue - I work exclusively with scientific text. However, interestingly, I yet have to encounter a single piece of text I need to process that uses the math symbols from the SMP1 - all (really all!) texts I have processed in my life have used the math symbols that you already find scattered throughout the BMP; virtually all math symbols in SMP1 can be encoded in BMP characters. Therefore, I have been blissfully ignorant about the "astral planes" except for checks when I process new text the first time, and I am living quite happily with that approach and "UTF-16 as UCS-2"... :)

spankyhoot said...

@fnl - the fact that combining characters aren't generally considered "characters" by end users brings the whole character counting thing into the land of full unicode support confusion. Not to mention some scripts don't even really have a concept of characters, you need to start thinking graphemes. When counting "characters" you are normalising to a consistent unicode canonical form right? My point is, there is enough crazy stuff going on that counting characters is not a win in either direction.

Foudres said...

@rmcherm

You imagine what you say ? Inacurate length, or complicated algorithm just to justify how UTF16 could be better... And thus introduce subtle bugs. In computer science we can't tolerate to have approximation for a public API to be used by thousand or even millions peoples.

Tāṇḍava said...

I'm still waiting for the first language to natively use UTF32!

Yakov Galka said...

@fnl: All those who keep crying that UTF-32, UCS-2 or UTF-16+additional assumptions give constant-time code point counting or indexing, just miss one point—the number of real-world applications of this "code point counting" is exactly zero.

Typical counter argument I hear is, e.g., limiting a size of some field. But such limits arise when some low-level code (like databases or protocols) allocate a fixed amount of memory for that field, in which case the limit is on the number of code units, not code points.

Theodore Seeber said...

The sad part is that given 64 bit processors and operating systems, ALL lesser word encoding systems are actually obsolete. Your UTF-8 encoding is just going to end up taking up 8 bytes, just like the UTF-16 and UTF-32 will, and you're going to have a bunch of wasted bits regardless.

Shay said...

Theodore, I can't tell if you're making a joke or serious. Assuming the latter, it's true that during examination of an 8-bit value on a modern processor it "wastes" a 64-bit register, but the moment the examination is done that 64-bit register gets used for something else. If you insisted that it be stored as 64 bits in memory and on disk, you would be wasting lots of bits, and introducing endian issues.