Series: Unicode & Mac OS X
Mac OS X can handle hundreds of languages using Unicode. How does that help you?
Article 1 of 2 in series
by Matt Neuburg
If you're using Mac OS X, a massive revolution is proceeding unnoticed on your computer. No, I don't mean Unix, preemptive multitasking, or any other familiar buzzwordsShow full article
If you're using Mac OS X, a massive revolution is proceeding unnoticed on your computer. No, I don't mean Unix, preemptive multitasking, or any other familiar buzzwords. I'm talking about text.
How can text be revolutionary? Text is not sexy. We take text for granted, typing it, reading it, editing it, storing it. Text is one of the main reasons most people bought computers in the first place. It's a means, a medium; it's not an end, not something explicit. The keyboard lies under our hands; strike a key and the corresponding letter appears. What could be simpler?
But the more you know about text and how it works on a computer, the more amazing it is that you can do any typing at all. There are issues of what keyboard you're using, how the physical keys map to virtual keycodes, how the virtual keycodes are represented as characters, how to draw the characters on the screen, and how store information about them in files. There are problems of languages, fonts, uppercase and lowercase, diacritics, sort order, and more.
In this article I'll focus on just one aspect of text: Unicode. Whether or not you've heard of Unicode, it affects you. Mac OS X is a Unicode system. Its native strings are Unicode strings. Many of the fonts that come with Mac OS X are Unicode fonts.
But there are problems. Mac OS X's transition to Unicode is far from complete. There are places where Unicode doesn't work, where it isn't implemented properly, where it gets in your way. Perhaps you've encountered some of these, shrugged, and moved on, never suspecting the cause. Well, from now on, perhaps you'll notice the problems a little more and shrug a little less. More important, you'll be prepared for the future, because Unicode is coming. It's heavily present on Mac OS X, and it's only going to become more so. Unicode is the future - your future. And as my favorite movie says, we are all interested in the future, since that is where we shall spend the rest of our lives.
ASCII No Questions -- To understand the future, we must start with the past.
In the beginning was writing, the printing press, books, the typewriter, and in particular a special kind of typewriter for sending information across electrical wires - the teletype. Perhaps you've seen one in an old movie, clattering out a news story or a military order. Teletype machines worked by encoding typed letters of the alphabet as electrical impulses and decoding them on the other end.
When computers started to be interactive and remotely operable, teletypes were a natural way to talk to them; and the first universal standard computer "alphabet" emerged, not without some struggle, from how teletypes worked. This was ASCII (pronounced "askey"), the American Standard Code for Information Interchange; and you can still see the teletype influence in the presence of its "control codes," so called because they helped control the teletype at the far end of the line. (For example, hitting Control-G sent a control code which made a bell ring on the remote teletype, to get the operator's attention - the ancestor of today's alert beep.)
The United States being the major economic and technological force in computing, the ASCII characters were the capital and small letters of the Roman alphabet, along with some common typewriter punctuation and the control codes. The set originally comprised 128 characters. That number is, of course, a power of 2 - no coincidence, since binary lies at the heart of computers.
When I got an Apple IIc, I was astounded to find ASCII extended by another power of 2, to embrace 256 characters. This made sense mathematically, because 256 is 8 binary bits - a byte, which was the minimum unit of memory data. This was less wasteful, but it was far from clear what to do with the extra 128 characters, which were referred to as "high ASCII" to distinguish them from the original 128 "low ASCII" characters. The problem was the computer's monitor - its screen. In those days, screen representation of text was wired into the monitor's hardware, and low ASCII was all it could display.
Flaunt Your Fonts, Watch Your Language -- When the Macintosh came along in 1984, everything changed. The Mac's entire screen displayed graphics, and the computer itself, not the monitor hardware, had the job of constructing the letters when text was to be displayed. At the time this was stunning and absolutely revolutionary. A character could be anything whatever, and for the first time, people saw all 256 characters really being used. To access high ASCII, you pressed the Option key. What you saw when you did so was amazing: A bullet! A paragraph symbol! A c-cedilla! Thus arrived the MacRoman character set to which we've all become accustomed.
Since the computer was drawing the character, you also had a choice of fonts - another revolution. After the delirium of playing with the Venice and San Francisco fonts started to wear off, users saw that this had big consequences for the representation of non-Roman languages. After all, no law tied the 256 keycodes to the 256 letters of the MacRoman character set. A different font could give you 256 more letters - as the Symbol font amply demonstrated. This, in fact, is why I switched to a Mac. In short order I was typing Greek, Devanagari (the Sanskrit syllabary), and phonetic symbols. After years of struggling with international typewriters or filling in symbols by hand, I was now my own typesetter, and in seventh heaven.
Trouble in Paradise -- Heaven, however, had its limits. Suppose I wanted to print a document. Laser printers were expensive, so I had to print in a Mac lab where the computers didn't necessarily have the same fonts I did, and thus couldn't print my document properly. The same problem arose if I wanted to give a file to a colleague or a publisher who might not have the fonts I was using, and so couldn't view my document properly.
Windows users posed yet another problem. The Windows character set was perversely different from the Mac. For example, WinLatin1 (often referred to, somewhat inaccurately, as ISO 8859-1) places the upside-down interrogative that opens a Spanish question at code 191; but that character is 192 on Mac (where 191 is the Norwegian slashed-o).
And even among Mac users, "normal" fonts came in many linguistic varieties, because the 256 characters of MacRoman do not suffice for every language that uses a variation of the Roman alphabet. Consider Turkish, for instance. MacRoman includes a Turkish dotless-i, but not a Turkish s-cedilla. So on a Turkish Mac the s-cedilla replaces the American Mac's "fl" ligature. A parallel thing happens on Windows, where (for example) Turkish s-cedilla and the Old English thorn characters occupy the same numeric spot in different language systems.
Tower of Babel -- None of this would count as problematic were it not for communications. If your computing is confined to your own office and your own printer and your own documents, you can work just fine. But cross-platform considerations introduce a new twist, and of course the rise of the Internet really brought things to a head. Suddenly people whose base systems differed were sending each other email and reading each other's Web pages. Conventions were established for coping, but these work only to the extent that people and software obey them. If you've ever received email from someone named "=?iso-8859-1?Q?St=E9phane?=," or if you've read a Web page where quotes appeared as a funny-looking capital O, you've experienced some form of the problem.
Also, since fonts don't travel across the Internet, characters that depend on a particular font may not be viewable at all. HTML can ask that certain characters should appear in a certain font on your machine when you view my page, but a fat lot of good that will do if you don't have that font.
Finally, there is a major issue I haven't mentioned yet: for some writing systems, 256 characters is nowhere near enough. An obvious example is Chinese, which requires several thousand characters.
The Premise and the Promise -- What Unicode proposes is simple enough: increase the number of bytes used to represent each character. For example, if you use two bytes per character, you can have 65,536 characters - enough to represent the Roman alphabet plus various accents and diacritics, plus Greek, Russian, Hebrew, Arabic, Devanagari, the core symbols of various Asian languages, and many others.
What's new here isn't the codification of character codes to represent different languages; the various existing character sets already did that, albeit clumsily. Nor is it the use of a double-byte system; such systems were already in use to represent Asian characters. What's new is the grand unification into a single character set embracing all characters at once. In other words, Unicode would do away with character set variations across systems and fonts. In fact, in theory a single (huge) font could potentially contain all needed characters.
It turns out, actually, that even 65,536 symbols aren't enough, once you start taking into account specialized scholars' requirements for conventional markings and historical characters (about which the folks who set the Unicode standards have often proved not to be as well informed as they like to imagine). Therefore Unicode has recently been extended to a potential 16 further sets of 65,536 characters (called "supplementary planes"); the size of the potential character set thus approximates a million, with each character represented by at most 4 bytes. The first supplementary plane is already being populated with such things as Gothic; musical and mathematical symbols; Mycenean (Linear B); and Egyptian hieroglyphics. The evolving standard is, not surprisingly, the subject of various political, cultural, technical, and scholarly struggles.
What has all this to do with you, you ask? It's simple. As I said at the outset, if you're a Mac OS X user, Unicode is on your computer, right now. But where? In the second half of this article, I'll show you.
Article 2 of 2 in series
by Matt Neuburg
In the first part of this article, I introduced you to Unicode, a grand unification scheme whereby every character in every writing system would be represented by a unique value, up to a potential one million distinct characters and symbolsShow full article
In the first part of this article, I introduced you to Unicode, a grand unification scheme whereby every character in every writing system would be represented by a unique value, up to a potential one million distinct characters and symbols. Mac OS X has Unicode built in. In this concluding part of the article, we'll look for it.
Forced Entry -- To prove to yourself that Unicode is present on your computer, you can type some of its characters. Now, clearly you won't be able to do this in the ordinary way, since the keyboard keys alone, even including the Option and Shift modifiers, can't differentiate even 256 characters. Thus there has to be what's called an "input method." Here's a simple one: open the International preferences pane of Mac OS X's System Preferences, go to the Keyboard Menu tab, and enable the Unicode Hex Input checkbox. Afterwards, a keyboard menu will appear in your menu bar (on my machine this looks, by default, like an American flag).
Now we're ready to type. Launch TextEdit from your Applications folder. From the keyboard menu, choose Unicode Hex Input. Now hold down the Option key and type (without quotes or spaces) "042E 0440 0438". You'll see the Russian name "Yuri" written as three Cyrillic characters. The values you typed were the Unicode hexadecimal (base-16) numeric codes for these characters.
Observe that if you now select "Yuri" and change the font, it still reads correctly. Is this because every font in Mac OS X includes Cyrillic letters? No! It's because, if the characters to be displayed aren't present in the font you designate, Mac OS X automatically hunts through your installed fonts to find any font that includes them, and uses that instead. That's important, because a font containing all Unicode characters would be huge, not to mention a lot of work to create. This way, font manufacturers can specialize, and each font can contribute just a subset of the Unicode repertoire.
Now, Unicode Hex Input, though it can generate any Unicode character if you happen to know its hex code, is obviously impractical. In real life, there needs to be a better way of typing characters. One way is through keyboard mappings. A keyboard mapping is the relationship between the key you type and the character code you generate. Normally, of course, every key generates a character from the ASCII range of characters. But consider the Symbol font. In Mac OS 9, the Symbol font was just an alternative set of characters superimposed on the ASCII range. In Mac OS X, though, Symbol characters are Unicode characters; they aren't in the ASCII range at all. So to type using the Symbol font, you must use a different keyboard mapping: you type in the ordinary way, but your keystrokes generate different keycodes than they normally would, so you reach the area of the Unicode repertoire where the Symbol characters are.
To see this, first enable the Symbol mapping in the International preference pane. Next, open Key Caps from the Application folder's Utilities folder, and choose Symbol from the Font menu. Now play with the keyboard menu. If you choose the U.S. keyboard mapping, Key Caps displays much of the font as blank; if you choose the Symbol keyboard mapping, the correct characters appear. In fact, it's really the mapping (not the font) that's important, since the Symbol characters appear in many other fonts (and, as we saw earlier, Mac OS X fetches the right character from another font if the designated font lacks it).
Another common keyboard mapping device is to introduce "dead" keys. You may be familiar with this from the normal U.S. mapping, which lets you access certain diacritical variations of vowels, such as grave, acute, circumflex, and umlaut, using dead keys. For example, in the U.S. mapping, typing Option-u followed by "u" creates u-umlaut; the Option-u tells the mapping to suspend judgment until the next typed input shows what character is intended. The Extended Roman keyboard mapping, which you can enable in the International preference pane, extends this principle to provide easy access to even more Roman diacritics; for example, Option-a becomes a dead key that puts a macron over the next vowel you type.
Various other input methods exist for various languages, some of them (as for Japanese) quite elaborate. Unfortunately, Apple's selection of these on Mac OS X still falls short of what was available in Mac OS 9; for example, there is no Devanagari, Arabic, or Hebrew input method for Mac OS X. In some cases, the input method for a language won't appear in Mac OS X unless a specific font is also present; to get the font, you would install the corresponding Language Kit into Classic from the Mac OS 9 CD. In other cases, the material may be available through Software Update. I won't give further details, since if you need a specific input method you probably know a lot more about the language, and Unicode, than I do.
Exploring the Web -- An obvious benefit of Unicode standardization is the possibility of various languages and scripts becoming universally legible over the Web. For a taste of what this will be like, I recommend the UTF-8 Sampler page of Columbia University's Kermit project; the URL is given below. You'll need to be using OmniGroup's OmniWeb browser; this is the only browser I've found that renders Unicode fonts decently. For best results, also download James Kass's Code2000 font and drop it into one of your Fonts folders before starting up OmniWeb. (If you're too lazy to download Code2000 you'll still get pretty good results thanks to the Unicode fonts already installed in Mac OS X, but some characters will be replaced by a "filler" character designed to let you know that the real character is missing.)
When you look at the Sampler using OmniWeb, you should see Runic, Middle English, Middle High German, Modern Greek, Russian, Georgian, and many others. One or two characters are missing, but the results are still amazingly good. The only major problem is that the right-to-left scripts such as Hebrew and Arabic are backwards (that is to say, uh, forwards). Note that you're not seeing pictures! All the text is being rendered character by character from your installed fonts, just as in a word processor.
You may wonder how an HTML document can tell your browser what Unicode character to display. After all, to get an ordinary English "e" to appear in a Web page, you just type an "e" in the HTML document; but how do you specify, say, a Russian "yu" character? With Unicode, there are two main ways. One is to use the numbered entity approach; just as you're probably aware that you can get a double-quote character in HTML by saying """, so you can get a Russian "yu" by saying "ю" (because 1102 is the decimal equivalent of that character's Unicode value). This works fine if a page contains just a few Unicode characters; otherwise, though, it becomes tedious for whoever must write and edit the HTML, and makes for large documents, since every such character requires six bytes. A better solution is UTF-8.
To understand what UTF-8 is, think about how you would encode Unicode as a sequence of bytes. One obvious way would just be to have the bytes represent each character's numeric value. For example, Russian "yu" is hexadecimal 044E, so it could be represented by a byte whose value is 04 and a byte whose value is 4E. This is perfectly possible - in fact, it has an official name, UTF-16 - but it lacks backwards compatibility. A browser or text processor that doesn't do Unicode can't read any characters of a UTF-16 document - even if that document consists entirely of characters from the ASCII range. And even worse, a UTF-16 document can't be transmitted across the Internet, because some of its bytes (such as the 04 in our example) are not legal character values. What's necessary is a Unicode encoding such that all bytes are themselves legal ASCII characters.
That's exactly what UTF-8 is. It's a way of encoding Unicode character values as sequences of Internet-legal ASCII characters - where members of the original ASCII character set are simply encoded as themselves. With this encoding, an application (such as a browser or a word processor) that doesn't understand UTF-8 will show sequences of Unicode characters as ASCII - that is, as gibberish - but at least it will show any ordinary ASCII characters correctly. The HTML way to let a browser know that it's seeing a UTF-8 document is a <META> tag specifying the "charset" as "utf-8". OmniWeb sees this and interprets the Unicode sequences correctly. For example, the UTF-8 encoding of Russian "yu" is D18E. Both D1 and 8E are legal ASCII character bytes: on a Mac they're an em-dash followed by an e-acute. Indeed, you can just type those two characters into an HTML document that declares itself as UTF-8, and OmniWeb will show them as a Russian "yu".
If you want to learn more about the Unicode character set and test your fonts against the standard, or if you'd like to focus on a particular language, Alan Wood's Web pages are an extremely well-maintained portal and an excellent starting point. And TidBITS reader Tom Gewecke (who also provided some great help with this article) maintains a page with useful information about the state of languages on the Mac, with special attention to Mac OS X and Unicode.
Exploring Your Fonts -- Meanwhile, back on your own hard disk, you may be wondering what Unicode fonts you have and what Unicode characters they contain. Unfortunately, Apple provides no way to learn the answer. You can't find out with Key Caps, since the range of characters corresponding to keys and modifiers is minuscule in comparison with the Unicode character set. Most other font utilities are blind to everything beyond ASCII. One great exception is the $15 FontChecker, from WunderMoosen. This program lets you explore the full range of Unicode characters in any font, and is an absolute must if you're going to make any sense of Unicode fonts on your Mac. It also features drag-and-drop, which can make it helpful as an occasional input method. I couldn't have written this article without it.
Also valuable is UnicodeChecker, a free utility from Earthlingsoft that displays every Unicode character. Unlike FontChecker, it isn't organized by font, but simply shows every character in order, and can even display characters from the supplementary planes. (Download James Kass's Code2001 font if you want to see some of these.)
A Long Way To Go -- Unicode is still in its infancy; Mac OS X is too. So if this overview has given you the sense that Unicode on Mac OS X is more of a toy than a tool, you're right. There needs to be a lot of growth, on several fronts, for Mac OS X's Unicode support to become really useful.
A big problem right now is the lack of Unicode support in applications. Already we saw that not all browsers are created equal; we had to use OmniWeb to view a Unicode Web page correctly (try the UTF-8 Sampler page in another browser to see the difference). And there's good reason why I had you experiment with typing Unicode using TextEdit and not some other word processor. Also, be warned that you can't necessarily tell from its documentation what an application can do. Software companies like to use the Unicode buzzword, but there's many a slip 'twixt the buzzword and the implementation. Microsoft Word X claims you can "enter, display, and edit text in all supported languages," but it doesn't accept the Unicode Hex Input method and often you can't paste Unicode characters into it. BBEdit can open and save Unicode text files, but its display of Unicode characters is poor - it often has layout problems, and it can display only a single font at a time (whereas, as we've seen, Unicode characters are typically drawn from various fonts). BBEdit also doesn't accept the Unicode Hex Input method, so you can't really use it to work with Unicode files.
The operating system itself must evolve too. The Unicode standard has requirements about bidirectional scripts and combining multiple characters that Mac OS X doesn't yet fully handle. The installed fonts don't represent the full character set. More input methods are required, and Apple needs to provide utilities for creating keyboard mappings, and perhaps even simple input methods, so that users can start accessing their favorite characters easily. The Unicode standard, meanwhile, is itself constantly being revised and extended. At the same time, Windows users are getting built-in language and Unicode support that in some respects is light-years ahead of Mac OS X. The hope is that as things progress, Apple will catch up, and the Unicode promise of Mac OS X will start to be fulfilled. Then the Mac will be not just a digital hub, but a textual hub as well.