Matt Neuburg 1 April 2002

Two Bytes of the Cherry: Unicode and Mac OS X, Part 1

If you’re using Mac OS X, a massive revolution is proceeding unnoticed on your computer. No, I don’t mean Unix, preemptive multitasking, or any other familiar buzzwords. I’m talking about text.

How can text be revolutionary? Text is not sexy. We take text for granted, typing it, reading it, editing it, storing it. Text is one of the main reasons most people bought computers in the first place. It’s a means, a medium; it’s not an end, not something explicit. The keyboard lies under our hands; strike a key and the corresponding letter appears. What could be simpler?

But the more you know about text and how it works on a computer, the more amazing it is that you can do any typing at all. There are issues of what keyboard you’re using, how the physical keys map to virtual keycodes, how the virtual keycodes are represented as characters, how to draw the characters on the screen, and how store information about them in files. There are problems of languages, fonts, uppercase and lowercase, diacritics, sort order, and more.

In this article I’ll focus on just one aspect of text: Unicode. Whether or not you’ve heard of Unicode, it affects you. Mac OS X is a Unicode system. Its native strings are Unicode strings. Many of the fonts that come with Mac OS X are Unicode fonts.

But there are problems. Mac OS X’s transition to Unicode is far from complete. There are places where Unicode doesn’t work, where it isn’t implemented properly, where it gets in your way. Perhaps you’ve encountered some of these, shrugged, and moved on, never suspecting the cause. Well, from now on, perhaps you’ll notice the problems a little more and shrug a little less. More important, you’ll be prepared for the future, because Unicode is coming. It’s heavily present on Mac OS X, and it’s only going to become more so. Unicode is the future – your future. And as my favorite movie says, we are all interested in the future, since that is where we shall spend the rest of our lives.

ASCII No Questions — To understand the future, we must start with the past.

In the beginning was writing, the printing press, books, the typewriter, and in particular a special kind of typewriter for sending information across electrical wires – the teletype. Perhaps you’ve seen one in an old movie, clattering out a news story or a military order. Teletype machines worked by encoding typed letters of the alphabet as electrical impulses and decoding them on the other end.

When computers started to be interactive and remotely operable, teletypes were a natural way to talk to them; and the first universal standard computer "alphabet" emerged, not without some struggle, from how teletypes worked. This was ASCII (pronounced "askey"), the American Standard Code for Information Interchange; and you can still see the teletype influence in the presence of its "control codes," so called because they helped control the teletype at the far end of the line. (For example, hitting Control-G sent a control code which made a bell ring on the remote teletype, to get the operator’s attention – the ancestor of today’s alert beep.)

The United States being the major economic and technological force in computing, the ASCII characters were the capital and small letters of the Roman alphabet, along with some common typewriter punctuation and the control codes. The set originally comprised 128 characters. That number is, of course, a power of 2 – no coincidence, since binary lies at the heart of computers.

When I got an Apple IIc, I was astounded to find ASCII extended by another power of 2, to embrace 256 characters. This made sense mathematically, because 256 is 8 binary bits – a byte, which was the minimum unit of memory data. This was less wasteful, but it was far from clear what to do with the extra 128 characters, which were referred to as "high ASCII" to distinguish them from the original 128 "low ASCII" characters. The problem was the computer’s monitor – its screen. In those days, screen representation of text was wired into the monitor’s hardware, and low ASCII was all it could display.

Flaunt Your Fonts, Watch Your Language — When the Macintosh came along in 1984, everything changed. The Mac’s entire screen displayed graphics, and the computer itself, not the monitor hardware, had the job of constructing the letters when text was to be displayed. At the time this was stunning and absolutely revolutionary. A character could be anything whatever, and for the first time, people saw all 256 characters really being used. To access high ASCII, you pressed the Option key. What you saw when you did so was amazing: A bullet! A paragraph symbol! A c-cedilla! Thus arrived the MacRoman character set to which we’ve all become accustomed.

Since the computer was drawing the character, you also had a choice of fonts – another revolution. After the delirium of playing with the Venice and San Francisco fonts started to wear off, users saw that this had big consequences for the representation of non-Roman languages. After all, no law tied the 256 keycodes to the 256 letters of the MacRoman character set. A different font could give you 256 more letters – as the Symbol font amply demonstrated. This, in fact, is why I switched to a Mac. In short order I was typing Greek, Devanagari (the Sanskrit syllabary), and phonetic symbols. After years of struggling with international typewriters or filling in symbols by hand, I was now my own typesetter, and in seventh heaven.

Trouble in Paradise — Heaven, however, had its limits. Suppose I wanted to print a document. Laser printers were expensive, so I had to print in a Mac lab where the computers didn’t necessarily have the same fonts I did, and thus couldn’t print my document properly. The same problem arose if I wanted to give a file to a colleague or a publisher who might not have the fonts I was using, and so couldn’t view my document properly.

Windows users posed yet another problem. The Windows character set was perversely different from the Mac. For example, WinLatin1 (often referred to, somewhat inaccurately, as ISO 8859-1) places the upside-down interrogative that opens a Spanish question at code 191; but that character is 192 on Mac (where 191 is the Norwegian slashed-o).

And even among Mac users, "normal" fonts came in many linguistic varieties, because the 256 characters of MacRoman do not suffice for every language that uses a variation of the Roman alphabet. Consider Turkish, for instance. MacRoman includes a Turkish dotless-i, but not a Turkish s-cedilla. So on a Turkish Mac the s-cedilla replaces the American Mac’s "fl" ligature. A parallel thing happens on Windows, where (for example) Turkish s-cedilla and the Old English thorn characters occupy the same numeric spot in different language systems.

Tower of Babel — None of this would count as problematic were it not for communications. If your computing is confined to your own office and your own printer and your own documents, you can work just fine. But cross-platform considerations introduce a new twist, and of course the rise of the Internet really brought things to a head. Suddenly people whose base systems differed were sending each other email and reading each other’s Web pages. Conventions were established for coping, but these work only to the extent that people and software obey them. If you’ve ever received email from someone named "=?iso-8859-1?Q?St=E9phane?=," or if you’ve read a Web page where quotes appeared as a funny-looking capital O, you’ve experienced some form of the problem.

Also, since fonts don’t travel across the Internet, characters that depend on a particular font may not be viewable at all. HTML can ask that certain characters should appear in a certain font on your machine when you view my page, but a fat lot of good that will do if you don’t have that font.

Finally, there is a major issue I haven’t mentioned yet: for some writing systems, 256 characters is nowhere near enough. An obvious example is Chinese, which requires several thousand characters.

Enter Unicode.

The Premise and the Promise — What Unicode proposes is simple enough: increase the number of bytes used to represent each character. For example, if you use two bytes per character, you can have 65,536 characters – enough to represent the Roman alphabet plus various accents and diacritics, plus Greek, Russian, Hebrew, Arabic, Devanagari, the core symbols of various Asian languages, and many others.

What’s new here isn’t the codification of character codes to represent different languages; the various existing character sets already did that, albeit clumsily. Nor is it the use of a double-byte system; such systems were already in use to represent Asian characters. What’s new is the grand unification into a single character set embracing all characters at once. In other words, Unicode would do away with character set variations across systems and fonts. In fact, in theory a single (huge) font could potentially contain all needed characters.

It turns out, actually, that even 65,536 symbols aren’t enough, once you start taking into account specialized scholars’ requirements for conventional markings and historical characters (about which the folks who set the Unicode standards have often proved not to be as well informed as they like to imagine). Therefore Unicode has recently been extended to a potential 16 further sets of 65,536 characters (called "supplementary planes"); the size of the potential character set thus approximates a million, with each character represented by at most 4 bytes. The first supplementary plane is already being populated with such things as Gothic; musical and mathematical symbols; Mycenean (Linear B); and Egyptian hieroglyphics. The evolving standard is, not surprisingly, the subject of various political, cultural, technical, and scholarly struggles.

<http://www.unicode.org/>

<http://www.unicode.org/unicode/standard/ principles.html>

What has all this to do with you, you ask? It’s simple. As I said at the outset, if you’re a Mac OS X user, Unicode is on your computer, right now. But where? In the second half of this article, I’ll show you.

Subscribe today so you don’t miss any TidBITS articles!