Saturday, January 14, 2006

Unicode on Debian GNU/Linux

Unicode assigns each character a unique code [Wikipedia article]. My webserver now serves pages in UTF-8, and I have also switched my desktop to use Unicode.

There was no single compelling reason for this. Instead there were many minor advantages that together pushed me to adopt it. Small annoyances would arise. I wanted to add Chinese characters [Chinese characters dictionary] [English-Chinese dictionary] to some of my pages, and also Unicode curling quotes [ in HTML]. Some of the other characters look useful. Unicode is the rapidly becoming the dominant standard: even Windows uses it now. It is also far superior to the old way, i.e. dealing with a myriad of mutually incompatible character encodings. Every programmer ought to know about Unicode.

Migrating the server was easy. A minor tweak in a configuration file (AddDefaultCharset UTF-8 in srm.conf for Apache), and changing XML headers was all that was needed.

The desktop side was more involved. For an in-depth explanation, see a step by step guide to switching Debian to UTF-8 and a guide for Unix in general.

First I reconfigured the Debian locales package (i.e. dpkg-reconfigure locales) and selected en_AU.UTF-8.

Next I checked if the programs I frequently use can handle UTF-8. Unfortunately for me, Eterm does not support Unicode.

After a brief investigation I settled on rxvt-unicode (aka urxvt). I miss some of the Eterm eyecandy, and it took me a while to supply command-line options to get urxvt looking like Eterm. On the other hand, urxvt is leaner, especially when run in client-server mode.

One minor inconvenience is that "TERM=rxvt-unicode" is not entrenched as it should be. For example, the dircolors supplied in Debian does not know about it, so either a custom file has to be written, or dircolors has to be fooled.