Matt Neuburg 8 April 2002

Two Bytes of the Cherry: Unicode and Mac OS X, Part 2

In the first part of this article, I introduced you to Unicode, a grand unification scheme whereby every character in every writing system would be represented by a unique value, up to a potential one million distinct characters and symbols. Mac OS X has Unicode built in. In this concluding part of the article, we’ll look for it.

<https://tidbits.com/getbits.acgi?tbart=06774>

Forced Entry — To prove to yourself that Unicode is present on your computer, you can type some of its characters. Now, clearly you won’t be able to do this in the ordinary way, since the keyboard keys alone, even including the Option and Shift modifiers, can’t differentiate even 256 characters. Thus there has to be what’s called an "input method." Here’s a simple one: open the International preferences pane of Mac OS X’s System Preferences, go to the Keyboard Menu tab, and enable the Unicode Hex Input checkbox. Afterwards, a keyboard menu will appear in your menu bar (on my machine this looks, by default, like an American flag).

Now we’re ready to type. Launch TextEdit from your Applications folder. From the keyboard menu, choose Unicode Hex Input. Now hold down the Option key and type (without quotes or spaces) "042E 0440 0438". You’ll see the Russian name "Yuri" written as three Cyrillic characters. The values you typed were the Unicode hexadecimal (base-16) numeric codes for these characters.

<http://www.unicode.org/charts/PDF/U0400.pdf>

Observe that if you now select "Yuri" and change the font, it still reads correctly. Is this because every font in Mac OS X includes Cyrillic letters? No! It’s because, if the characters to be displayed aren’t present in the font you designate, Mac OS X automatically hunts through your installed fonts to find any font that includes them, and uses that instead. That’s important, because a font containing all Unicode characters would be huge, not to mention a lot of work to create. This way, font manufacturers can specialize, and each font can contribute just a subset of the Unicode repertoire.

Now, Unicode Hex Input, though it can generate any Unicode character if you happen to know its hex code, is obviously impractical. In real life, there needs to be a better way of typing characters. One way is through keyboard mappings. A keyboard mapping is the relationship between the key you type and the character code you generate. Normally, of course, every key generates a character from the ASCII range of characters. But consider the Symbol font. In Mac OS 9, the Symbol font was just an alternative set of characters superimposed on the ASCII range. In Mac OS X, though, Symbol characters are Unicode characters; they aren’t in the ASCII range at all. So to type using the Symbol font, you must use a different keyboard mapping: you type in the ordinary way, but your keystrokes generate different keycodes than they normally would, so you reach the area of the Unicode repertoire where the Symbol characters are.

To see this, first enable the Symbol mapping in the International preference pane. Next, open Key Caps from the Application folder’s Utilities folder, and choose Symbol from the Font menu. Now play with the keyboard menu. If you choose the U.S. keyboard mapping, Key Caps displays much of the font as blank; if you choose the Symbol keyboard mapping, the correct characters appear. In fact, it’s really the mapping (not the font) that’s important, since the Symbol characters appear in many other fonts (and, as we saw earlier, Mac OS X fetches the right character from another font if the designated font lacks it).

Another common keyboard mapping device is to introduce "dead" keys. You may be familiar with this from the normal U.S. mapping, which lets you access certain diacritical variations of vowels, such as grave, acute, circumflex, and umlaut, using dead keys. For example, in the U.S. mapping, typing Option-u followed by "u" creates u-umlaut; the Option-u tells the mapping to suspend judgment until the next typed input shows what character is intended. The Extended Roman keyboard mapping, which you can enable in the International preference pane, extends this principle to provide easy access to even more Roman diacritics; for example, Option-a becomes a dead key that puts a macron over the next vowel you type.

<http://homepage.mac.com/goldsmit/.Pictures/ ExtendedRoman.jpg>

Various other input methods exist for various languages, some of them (as for Japanese) quite elaborate. Unfortunately, Apple’s selection of these on Mac OS X still falls short of what was available in Mac OS 9; for example, there is no Devanagari, Arabic, or Hebrew input method for Mac OS X. In some cases, the input method for a language won’t appear in Mac OS X unless a specific font is also present; to get the font, you would install the corresponding Language Kit into Classic from the Mac OS 9 CD. In other cases, the material may be available through Software Update. I won’t give further details, since if you need a specific input method you probably know a lot more about the language, and Unicode, than I do.

<http://docs.info.apple.com/article.html? artnum=106484>

<http://docs.info.apple.com/article.html? artnum=120065>

Exploring the Web — An obvious benefit of Unicode standardization is the possibility of various languages and scripts becoming universally legible over the Web. For a taste of what this will be like, I recommend the UTF-8 Sampler page of Columbia University’s Kermit project; the URL is given below. You’ll need to be using OmniGroup’s OmniWeb browser; this is the only browser I’ve found that renders Unicode fonts decently. For best results, also download James Kass’s Code2000 font and drop it into one of your Fonts folders before starting up OmniWeb. (If you’re too lazy to download Code2000 you’ll still get pretty good results thanks to the Unicode fonts already installed in Mac OS X, but some characters will be replaced by a "filler" character designed to let you know that the real character is missing.)

<http://www.omnigroup.com/applications/omniweb>

<http://home.att.net/~jameskass/CODE2000.ZIP>

<http://www.columbia.edu/kermit/utf8.html>

When you look at the Sampler using OmniWeb, you should see Runic, Middle English, Middle High German, Modern Greek, Russian, Georgian, and many others. One or two characters are missing, but the results are still amazingly good. The only major problem is that the right-to-left scripts such as Hebrew and Arabic are backwards (that is to say, uh, forwards). Note that you’re not seeing pictures! All the text is being rendered character by character from your installed fonts, just as in a word processor.

You may wonder how an HTML document can tell your browser what Unicode character to display. After all, to get an ordinary English "e" to appear in a Web page, you just type an "e" in the HTML document; but how do you specify, say, a Russian "yu" character? With Unicode, there are two main ways. One is to use the numbered entity approach; just as you’re probably aware that you can get a double-quote character in HTML by saying """, so you can get a Russian "yu" by saying "ю" (because 1102 is the decimal equivalent of that character’s Unicode value). This works fine if a page contains just a few Unicode characters; otherwise, though, it becomes tedious for whoever must write and edit the HTML, and makes for large documents, since every such character requires six bytes. A better solution is UTF-8.

To understand what UTF-8 is, think about how you would encode Unicode as a sequence of bytes. One obvious way would just be to have the bytes represent each character’s numeric value. For example, Russian "yu" is hexadecimal 044E, so it could be represented by a byte whose value is 04 and a byte whose value is 4E. This is perfectly possible – in fact, it has an official name, UTF-16 – but it lacks backwards compatibility. A browser or text processor that doesn’t do Unicode can’t read any characters of a UTF-16 document – even if that document consists entirely of characters from the ASCII range. And even worse, a UTF-16 document can’t be transmitted across the Internet, because some of its bytes (such as the 04 in our example) are not legal character values. What’s necessary is a Unicode encoding such that all bytes are themselves legal ASCII characters.

That’s exactly what UTF-8 is. It’s a way of encoding Unicode character values as sequences of Internet-legal ASCII characters – where members of the original ASCII character set are simply encoded as themselves. With this encoding, an application (such as a browser or a word processor) that doesn’t understand UTF-8 will show sequences of Unicode characters as ASCII – that is, as gibberish – but at least it will show any ordinary ASCII characters correctly. The HTML way to let a browser know that it’s seeing a UTF-8 document is a <META> tag specifying the "charset" as "utf-8". OmniWeb sees this and interprets the Unicode sequences correctly. For example, the UTF-8 encoding of Russian "yu" is D18E. Both D1 and 8E are legal ASCII character bytes: on a Mac they’re an em-dash followed by an e-acute. Indeed, you can just type those two characters into an HTML document that declares itself as UTF-8, and OmniWeb will show them as a Russian "yu".

If you want to learn more about the Unicode character set and test your fonts against the standard, or if you’d like to focus on a particular language, Alan Wood’s Web pages are an extremely well-maintained portal and an excellent starting point. And TidBITS reader Tom Gewecke (who also provided some great help with this article) maintains a page with useful information about the state of languages on the Mac, with special attention to Mac OS X and Unicode.

<http://www.hclrss.demon.co.uk/unicode/ index.html>

<http://hometown.aol.com/tg3907/mlingos9.html>

Exploring Your Fonts — Meanwhile, back on your own hard disk, you may be wondering what Unicode fonts you have and what Unicode characters they contain. Unfortunately, Apple provides no way to learn the answer. You can’t find out with Key Caps, since the range of characters corresponding to keys and modifiers is minuscule in comparison with the Unicode character set. Most other font utilities are blind to everything beyond ASCII. One great exception is the $15 FontChecker, from WunderMoosen. This program lets you explore the full range of Unicode characters in any font, and is an absolute must if you’re going to make any sense of Unicode fonts on your Mac. It also features drag-and-drop, which can make it helpful as an occasional input method. I couldn’t have written this article without it.

<http://www.wundermoosen.com/wmXFCHelp.html>

Also valuable is UnicodeChecker, a free utility from Earthlingsoft that displays every Unicode character. Unlike FontChecker, it isn’t organized by font, but simply shows every character in order, and can even display characters from the supplementary planes. (Download James Kass’s Code2001 font if you want to see some of these.)

<http://homepage.mac.com/earthlingsoft/ apps.html#unicodechecker>

<http://www.unicode.org/Public/UNIDATA/>

<http://home.att.net/~jameskass/CODE2001.ZIP>

A Long Way To Go — Unicode is still in its infancy; Mac OS X is too. So if this overview has given you the sense that Unicode on Mac OS X is more of a toy than a tool, you’re right. There needs to be a lot of growth, on several fronts, for Mac OS X’s Unicode support to become really useful.

A big problem right now is the lack of Unicode support in applications. Already we saw that not all browsers are created equal; we had to use OmniWeb to view a Unicode Web page correctly (try the UTF-8 Sampler page in another browser to see the difference). And there’s good reason why I had you experiment with typing Unicode using TextEdit and not some other word processor. Also, be warned that you can’t necessarily tell from its documentation what an application can do. Software companies like to use the Unicode buzzword, but there’s many a slip ‘twixt the buzzword and the implementation. Microsoft Word X claims you can "enter, display, and edit text in all supported languages," but it doesn’t accept the Unicode Hex Input method and often you can’t paste Unicode characters into it. BBEdit can open and save Unicode text files, but its display of Unicode characters is poor – it often has layout problems, and it can display only a single font at a time (whereas, as we’ve seen, Unicode characters are typically drawn from various fonts). BBEdit also doesn’t accept the Unicode Hex Input method, so you can’t really use it to work with Unicode files.

The operating system itself must evolve too. The Unicode standard has requirements about bidirectional scripts and combining multiple characters that Mac OS X doesn’t yet fully handle. The installed fonts don’t represent the full character set. More input methods are required, and Apple needs to provide utilities for creating keyboard mappings, and perhaps even simple input methods, so that users can start accessing their favorite characters easily. The Unicode standard, meanwhile, is itself constantly being revised and extended. At the same time, Windows users are getting built-in language and Unicode support that in some respects is light-years ahead of Mac OS X. The hope is that as things progress, Apple will catch up, and the Unicode promise of Mac OS X will start to be fulfilled. Then the Mac will be not just a digital hub, but a textual hub as well.

Subscribe today so you don’t miss any TidBITS articles!