Comments on Ben Lynn's Online Garbage: UTF-8 good, UTF-16 bad

I admire what you have done here. I like the part ...

2019-03-19T08:27:47.059-07:00

I admire what you have done here. I like the part where you say you are doing this to give back but I would assume by all the comments that this is working for you as well. https://charactercount.org

If I am not mistaken there are just a couple of la...

2015-09-17T03:32:27.547-07:00

If I am not mistaken there are just a couple of languages like chinese dialects and korean dialects that need the 4 byte utf 16 char. If you do not have characters from that language you can safely say each character is a wchar and that an ascii is 256*ascii value and that the end char is \0\0 . The utf 8 can use 1,2,3,4,5 chars depending on the case. It you have a text with both ASCII and wide-chars in UTF-8 is a pain to work with it.

Theodore, I can't tell if you're making a ...

2014-01-19T18:16:33.865-08:00

Theodore, I can't tell if you're making a joke or serious. Assuming the latter, it's true that during examination of an 8-bit value on a modern processor it "wastes" a 64-bit register, but the moment the examination is done that 64-bit register gets used for something else. If you insisted that it be stored as 64 bits in memory and on disk, you would be wasting lots of bits, and introducing endian issues.

The sad part is that given 64 bit processors and o...

2013-07-19T09:17:46.348-07:00

The sad part is that given 64 bit processors and operating systems, ALL lesser word encoding systems are actually obsolete. Your UTF-8 encoding is just going to end up taking up 8 bytes, just like the UTF-16 and UTF-32 will, and you're going to have a bunch of wasted bits regardless.

@fnl: All those who keep crying that UTF-32, UCS-2...

2012-05-12T14:13:57.976-07:00

@fnl: All those who keep crying that UTF-32, UCS-2 or UTF-16+additional assumptions give constant-time code point counting or indexing, just miss one point—the number of real-world applications of this "code point counting" is exactly zero.

Typical counter argument I hear is, e.g., limiting a size of some field. But such limits arise when some low-level code (like databases or protocols) allocate a fixed amount of memory for that field, in which case the limit is on the number of code units, not code points.

I'm still waiting for the first language to na...

2012-01-19T03:10:38.199-08:00

I'm still waiting for the first language to natively use UTF32!

You came so close, but you seem to have missed the...

2011-02-07T22:19:14.639-08:00

You came so close, but you seem to have missed the most obvious answer: "UTF-8 was built on ASCII by Rob Pike and Ken Thompson". Rob Pike and Ken Thompson also happen to be two of the primary designers of Go.

It's even easier to decide to use good engineering when you're directly responsible for it.

You forgot that UTF-8 did not exist in the early d...

2011-02-07T22:39:10.483-08:00

You forgot that UTF-8 did not exist in the early days of Unicode, and UTF-1 sucked

Not that I want to advocate UTF-16, but there is a...

2011-02-08T03:00:29.393-08:00

Not that I want to advocate UTF-16, but there is a huge plus you get in UTF-16 you missed: character counting. If you ensure there are no Surrogate Chars (which is a simple byte-mask comparison using every second byte (only!) as in: "0xD800 & byte == 0xD800" [using the entire range to avoid having to think about byte order]) in your array, you can be 100% sure about the character length just by dividing the byte-length in two. With UTF-8, you have to look at each and every byte one by one and count the number of characters to establish the actual char-length. Finally, given how rare the Supplementary Range characters are, the Surrogate Range check can be safely skipped in many scenarios, making it significantly faster to establish char lengths in UTF-16 than UTF-8.

You can't count characters. You can count code...

2011-02-08T03:45:23.719-08:00

You can't count characters. You can count code points. So again, not really a victory.

Really? So think about how you'd count code-po...

2011-02-08T03:56:27.888-08:00

Really? So think about how you'd count code-points... Hint: it is somewhere hidden in my post...

Your point about the superiority of UTF-8 as a nat...

2011-02-08T05:12:12.774-08:00

Your point about the superiority of UTF-8 as a native unicode notwithstanding, I don't think it's right to say that Python's internal representation is UTF-16:

Unicode strings are expressed as instances of the unicode type, one of Python’s repertoire of built-in types. It derives from an abstract type called basestring, which is also an ancestor of the str type; you can therefore check if a value is a string type with isinstance(value, basestring). Under the hood, Python represents Unicode strings as either 16- or 32-bit integers, depending on how the Python interpreter was compiled.

(from http://docs.python.org/howto/unicode.html#the-unicode-type)

@fnl:I disagree with your claim that UFT-16 suppor...

2011-02-08T07:15:00.119-08:00

@fnl:

I disagree with your claim that UFT-16 supports character counting in a way that UTF-8 does not.

(1) Yes, UTF-16 supports character counting if you check every second byte to make sure it has no surrogate chars. Also, UTF-8 supports character counting if you check every byte to keep track of the multi-byte characters. The only difference is checking every byte or every OTHER byte -- either way, the entire string must be brought into memory and processed.

(2) In UTF-16, in addition to the string itself, you can also keep a boolean which tracks whether it has surrogate characters (very rare). You would create this boolean when the string was first generated. Of course, in UTF-8 you can keep a length field in addition to the string itself. You would create this integer when the string was first generated.

(3) If you don't actually need accuracy, with UFT-16 you can just *assume* there are no surrogate characters, in which case it is a simple matter of dividing the byte length by 2. And if you don't actually need accuracy, with UTF-8 you can make guesses also -- for instance, a surprisingly large number of strings are pure ASCII. This is the one place where UFT-16 does a little better: that heuristic is more likely to be accurate than the UTF-8 heuristic.

So I suppose UTF-16 is superior in cases where you are NOT able to carry around additional length information along with the string, and where you frequently need to know the number of characters, but don't mind if it is occasionally inaccurate. Oh yes, and if you don't care about using up significantly more memory.

Quite a simple counter-argument to all your points...

2011-02-08T07:38:15.022-08:00

Quite a simple counter-argument to all your points, even if assuming we did everything as you said. Think about how to get certain characters, say, 5 through 10, in a string. This is trivial once I've ensured my UTF-16 strings are UCS-2 compliant. But it is a pain with UTF-8, except if you work with ASCII data only (in which case the entire Unicode issue probably doesn't matter to you, anyways...). Less important, but still: I never store flags with my strings, I just ensure they all are UCS-2 compliant and treat them as such, even if they are not (but make the user aware of possibly wrong results by emitting a warning). But with UTF-8 you'd always be forced to store the string length, you cannot build around it as easily as in UTF-16. So I still don't see how UTF-8 is better than UTF-16 for text processing. UTF-8 is better for text transmission and storage in general, although. I think this is what this whole argument boils down to.

@Scaevolus: I'm aware Pike and Thompson work o...

2011-02-08T09:31:53.909-08:00

@Scaevolus: I'm aware Pike and Thompson work on Go (their office is on the floor above mine!), but if UTF-16 were a better choice, I like to think they would have picked it despite having designed a competitor.

@Yuhong Bao: Excellent point! If it were UTF-1 versus UCS-2, I'd take the latter.

@Bg Porter: You're right; I misremembered what I read on Wikipedia:

The Python language environment officially only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to "Unicode" produces correct UTF-16. Python can be compiled to use UCS-4 (UTF-32) but this is commonly only done on Unix systems.

So it uses UCS-2 or UTF-32 internally ("16- or 32-bit integers"), not UTF-16. But it is certainly not UTF-8, and according to the Stack Overflow post I linked to, UCS-2 Python handles higher code points incorrectly. Presumably one should compile Python in UTF-32 mode to avoid Unicode bugs.

@fnl: I had thought that the characters outside the BMP might be rare enough that perhaps you could simply ignore all other planes and pretend UTF-16 is UCS-2. If enough developers followed suit, it could force Unicode to go back to 16-bits: committees write standards, but they die if no one follows them (e.g. XHTML 2.0). A tiny minority might be annoyed, but they can use special-purpose software for Phaistos Disc symbols or whatever.

However, it looks like there are characters beyond BMP that are sufficiently common to deserve widespread support.

1. CJK characters: Wikipedia states you might need up to 40000 of these for "reasonably complete coverage", and perhaps over 100000 exist in total. UCS-2 only contains 20940 CJK characters.

2. Mathematical Alphanumeric Symbols. All other mathematical symbols lie within the BMP. The mathematical alphanumeric symbols are also useful in formulas, but lie beyond the BMP.

3. Emoji: though they might seem frivolous, these are becoming so popular that character encodings for them sprung up, supported by a growing number of phones. Thankfully, they will be part of the next major Unicode version, partially outside the BMP.

You could probably make a case for other characters, such as musical notation.

@Benn Yes, that is precisely my thinking. As I don...

2011-02-08T11:30:22.215-08:00

@Benn Yes, that is precisely my thinking. As I don't work with CJK text, it's not an issue - I work exclusively with scientific text. However, interestingly, I yet have to encounter a single piece of text I need to process that uses the math symbols from the SMP1 - all (really all!) texts I have processed in my life have used the math symbols that you already find scattered throughout the BMP; virtually all math symbols in SMP1 can be encoded in BMP characters. Therefore, I have been blissfully ignorant about the "astral planes" except for checks when I process new text the first time, and I am living quite happily with that approach and "UTF-16 as UCS-2"... :)

@fnl - the fact that combining characters aren'...

2011-02-08T14:21:00.878-08:00

@fnl - the fact that combining characters aren't generally considered "characters" by end users brings the whole character counting thing into the land of full unicode support confusion. Not to mention some scripts don't even really have a concept of characters, you need to start thinking graphemes. When counting "characters" you are normalising to a consistent unicode canonical form right? My point is, there is enough crazy stuff going on that counting characters is not a win in either direction.

@rmchermYou imagine what you say ? Inacurate lengt...

2011-02-09T01:41:26.224-08:00

@rmcherm

You imagine what you say ? Inacurate length, or complicated algorithm just to justify how UTF16 could be better... And thus introduce subtle bugs. In computer science we can't tolerate to have approximation for a public API to be used by thousand or even millions peoples.