Matt Neuburg 21 August 2000

Talk Is Cheap – ViaVoice Enhanced Edition

[Note: I am indebted for technical assistance to my father, Ned Neuburg, who was on the ARPA steering committee in the 1970s; and to Erik Sea, IBM’s Development Lead for ViaVoice/Mac, for answering some key queries.]

Classic science fiction, by and large, has proven both myopic and optimistic when it comes to computers. Increased brain power was an obvious prediction, but few foresaw that computers would also become small, cheap, and ubiquitous, with all the tremendous attendant sociological implications. On the other hand, by all accounts we should long ago have been talking to our computers. Where is HAL 9000? The QWERTY keyboard is a clumsy dinosaur; of course you’d eventually like your computer to read your thoughts, but in the meantime, why can’t you just tell it what to do? Well, to a large extent, you can; you wouldn’t want to hand over control of a mission-critical task to a voice-driven computer just yet, but your computer need no longer be as deaf as a post either.

Wreck a Nice Beach — You’ve probably heard of ARPA, the advanced research wing of the U.S. Department of Defense during the Cold War; you’re certainly familiar with one of its creations, the Internet. Another ARPA project was to have computers know what people were saying – called "speech recognition". (I once proposed the term "autoglossomerolysis," but somehow it didn’t catch on.) In the early 1970s, ARPA threw massive amounts of funding at the problem.

The major obstacle was the acoustic model, which may be imagined as phonemic analysis. How can the computer work out whether a vowel is "ah" or "ee", whether a consonant is "p" or "t", or even where the phoneme boundaries are? Most researchers expected that computers would find the features of speech, corresponding to how the mouth produced the sounds: "this is a voiced guttural stop, that is a rounded front vowel". What the ARPA-funded research demonstrated, though, was that you could make more significant practical progress by doing something much more crude. First, characterize the raw sound by a minimal set of numbers; then, match those numbers against a template – e.g., this sound is a "p" because numerically it looks like a prerecorded "p".

The trick here lies in the notion "looks like." James Baker, then a graduate student at Carnegie-Mellon University, applied to speech recognition pattern-matching a probabilistic mathematical device called a "hidden Markov model" (HMM). The results proved so superior in that first ARPA funding round that all modern speech recognition uses HMM – a fact which is astounding for two reasons. First, HMM is fundamentally not only crude but almost certainly wrong – however our ears and brains hear and analyze speech, HMM is surely not it. Second, it’s amazing that we’ve been doing speech recognition the same way for so long. To be sure, modern HMM is vastly more sophisticated than in those days; and one should not underestimate the importance of software optimization, a direction pioneered, again, by James Baker, who went on to found Dragon Systems. But the really important development has been in hardware. Computers are now about a thousand times faster and a thousand times larger in resources, and a thousand times smaller in size and cost, than in those early days, so they have at last begun to meet speech recognition’s mathematical demands.

<http://www.dragonsystems.com/about/>

In the early 1990s, Apple created its own system-level speech recognition component, PlainTalk. But PlainTalk’s genius lies in its compromises: it doesn’t need training for a particular user, but it does only discrete speech recognition, matching a short phrase to a finite list of predefined possibilities. The holy grail is continuous speech recognition (CSR) – basically, you talk and the computer types. And CSR is definitely here, thanks to IBM’s ViaVoice Enhanced Edition.

<http://www-4.ibm.com/software/speech/mac/newmac />

<http://www-4.ibm.com/software/speech/support/ faqmacenh.html>

Hail CSR — HAL 9000 notwithstanding, the obstacles to continuous speech recognition are severe, as the history of IBM’s research illustrates. They started as early as the 1950s and were among the recipients of ARPA’s early funding; yet only within the last five years has IBM marketed consumer-level dictation software. Just consider: The acoustic model must find your phonemes despite the way sounds are disguised by word boundaries and sentence stress. Yet unlike discrete speech recognition, your "command" is never clearly over, so the acoustic model must also be extremely fast, to keep up with you. Plus, it isn’t the only model involved: there must be a linguistic model to group your phonemes into words, matched not from some tiny list but from a possible vocabulary of tens of thousands of words.

<http://www.research.ibm.com/hlt/html/ history.html>

Thus, to be at all practical, present-day continuous speech recognition requires that the acoustic model be trained for the particular speaker’s voice quality and pronunciation and the characteristics of the microphone and the environment. ViaVoice handles this by having you read certain stories that it presents to you when the program first starts up. (You can repeat this procedure later to refine your model, and ViaVoice maintains multiple models so it can be used by different people, or by the same person in different surroundings.) The linguistic model, meanwhile, requires a dictionary: ViaVoice includes a default dictionary, and presumably calculates initial pronunciations based on your acoustic model; it also includes five specialty dictionaries, such as cooking or finance, of which you can turn on one at a time.

Even so, ViaVoice clearly cannot know every word you’ll say or every quirk of your pronunciation, so it provides three features for expanding and refining the models:

You can add to your vocabulary directly through a dialog where you type a word and record a pronunciation for it.
You can have ViaVoice scour a text document for unknown words; it asks you which of these you’re likely to use and prompts you to record pronunciations.
In the course of dictating, as you correct ViaVoice’s mistakes, it learns. In particular, this happens when you select a word and dictate it again, and when you use the Correction Window, which lists alternatives to the selected problem word. Also, when you save, you are again prompted for pronunciation of unknown words.

ViaVoice also extends your vocabulary through macros and commands. Macros are expressions typed differently from their pronunciation, such as punctuation ("comma" and "period") and boilerplate like "[email protected]" (whose pronounced phrase might be "my email address"). Macros can have rules for automatically interacting with their surroundings; that’s how you ensure, for example, that a period is snug against the preceding word, has a space after, and the next word is capitalized. Commands trigger actions, not typing; they are mostly built-in, and what commands are available depends upon what environment you’re in.

Seven, They Are Seven — ViaVoice’s functionality is divided between seven main applications (and about a dozen minor ones). This sounds confusing, but the implementation isn’t: "packages" (locked folders) conceal the various applications in the Finder, and they start up and shut down automatically as necessary. In the description that follows, I give approximate RAM footprints with virtual memory off, because ViaVoice is so much faster that way.

You initiate a session by opening SpeakPad (12 MB); this starts up Background Engine (3 MB, invisible) and VoiceCenter (3 MB).

VoiceCenter appears as a windoid floating over everything on your computer, and is the command center for ViaVoice as a whole. It contains some buttons and a pop-up menu, and is where you turn the microphone on and off, and initiate management of your macros, dictionary, and acoustic model, as well as bring up the correction window.

SpeakPad looks like a rudimentary word processor, but it accepts dictation and can obey a lot of vocal commands for cursor selection and movement, cutting and pasting, and so forth. Since you can also manage the correction window vocally, a dictation session, if you’re patient, can be virtually hands-free. Furthermore, SpeakPad is scriptable, and ViaVoice has a cool feature similar to PlainTalk: you can expand its command set through AppleScripts, where a script is triggered when you say its name. I use this to increase ViaVoice’s cohesion with other applications; for example, while writing parts of this review, I dictated into SpeakPad and then said "Transfer to Nisus" to trigger a custom script which copied the text from SpeakPad and pasted it into Nisus Writer.

Besides SpeakPad, you can dictate into Microsoft Word, Internet Explorer, Outlook Express, or AppleWorks. To invoke this feature, you start up the Direct Dictation application (1 MB, invisible), which invokes Dictation Manager (4 MB, invisible), as well as Background Engine and VoiceCenter if they aren’t up already. Once VoiceCenter is floating over (let’s say) Microsoft Word, you turn on the microphone and say "Begin direct dictation", and then you can speak to type into Word.

To set up your microphone volume level and test for background noise, you run Setup Assistant (9 MB), a single window consisting of a sequence of panels you navigate through arrow buttons. You also use Setup Assistant to analyze your documents or create your voice model, in each case with a different set of panels. User and voice model management is performed through ViaVoice Settings (6 MB), which presents a control panel-type window and lets you edit your macros or vocabulary, again through a different window in each case. Each of these programs quits automatically when you close its window.

I Come To Bury CSR… From installation onwards, I have found ViaVoice buggy, bizarre, or downright infuriating. On one of my computers, it wouldn’t install; on the other, it would install but it crashed when I tried to create my acoustic model. So I sneakily installed it on the second computer and copied it to the first, where it runs great; there, I trained the model and copied the data back to the second. Direct Dictation also crashes on that computer (both crashes are due to the highly machine-specific way ViaVoice tries to tell your computer not to sleep during dictation); but I don’t miss it, as this feature is rather dubious anyway – it’s much slower than dictating into SpeakPad, and ViaVoice easily gets out of sync with what’s in the document.

As you read a story to create your acoustic model, ViaVoice highlights words to show where the computer thinks you are, but sometimes it highlights the wrong word and you can’t figure out what it wants from you. Preferences that you set are sometimes forgotten before you even click the OK button. Your Keyboard menu can end up set to the wrong keyboard after using Direct Dictation. Often the microphone won’t come on, or ViaVoice refuses to quit. If you dictate with lots of text selected, a dialog asks if you really want to overwrite the selection; if you say yes, your dictated words appear backwards!

In SpeakPad, ViaVoice insists on controlling capitalization and spacing, and often gets them wrong. Extra spaces or other characters sometimes mysteriously appear. Saying a punctuation mark sometimes causes the preceding several words to be omitted from the typescript. Little things like double-click-and-drag to select words don’t work quite right. You can’t examine any of the included dictionaries, so you can’t intelligently add a vocabulary item in advance: you must wait until ViaVoice errs.

ViaVoice initially involves some 80 MB of disk space, and hundreds of files whose purpose you’re not told; its Temp folder then grows and grows (I’m told it gets cleaned up when it hits 250 MB). The manual is cheesy, ugly, and uninformative; the command reference sheet is inaccurate and incomplete. In short, this is a huge, rather inflexible program that takes over your computer and exhibits a poor sense of design, little understanding of Mac interface and conventions, and not much idea of the user’s needs.

…And To Praise It — And yet, unless you are utterly naive, under 12, or raised entirely on science fiction, ViaVoice in action seems nothing short of miraculous. You speak, and by golly, words appear on the screen – for the most part, the right words! Certainly the recognition engine has its limitations, but these afflict all recognition engines to date. For instance, despite its showpiece examples of correctly detected homonyms ("Write the right letter to Mr. Wright"), ViaVoice often makes mistakes that even a modicum of grammatical or syntactic knowledge would have eliminated – because it has no such knowledge: it knows some likely contexts for some words, but it doesn’t know English. Also, as my father points out, the worst speech recognition problem is that when things go wrong the computer can’t tell you why ("speak louder / slower," or whatever), for the simple reason that it doesn’t know: the models being automatic and probabilistic, we can construct them and match against them, but cannot know how they actually work (like HAL 9000!).

For increased accuracy, some simple precautions are helpful. When you first train your acoustic model, read sufficient material, and use the same tone of voice in which you’ll be dictating; I find a neutral monotone works best (like HAL 9000!). Each time you start up ViaVoice, do the audio setup; this takes only a minute. When ViaVoice errs, correct it, because that’s how it learns. Finally, let ViaVoice train you: you must speak continuously but not too quickly, naturally but not sloppily, carefully but not exaggeratedly – if you force your final consonants, for example, ViaVoice will hear not a clearer consonant but an extra word. Remember, it’s only a machine!

Perhaps the hardest thing for me has been learning to dictate at all. When I start talking, I usually have only the vaguest idea what I’m going to say; so I tend to choke under the pressure of improvising a constant flow of slow, clear, well-formed phrases. It’s good practice, I’ve found, to read aloud; and one of my uses for ViaVoice has been to transcribe some old hand-written letters. However, I do often use it to compose email messages, and I did use it to draft parts of this review.

The Last Word — Computer speech recognition is here, and although I wouldn’t like to predict just how, I believe it will change everything. Perhaps certain common speech recognition homonym errors will become accepted spellings. Perhaps computer input will soon be a hybrid of mouse, keyboard, and voice. In any case, we’re on the brink of a new age, and anyone who likes can step across and put a foot into it. Now – open the pod bay doors, please, HAL.

ViaVoice Enhanced requires Mac OS 9.0.4 and a Power Mac G3/300 or better; the faster the processor and the more RAM, the better – but this will improve only speed, not accuracy. It costs $130 and comes with an Andrea USB headset, but any noise-cancelling microphone will do, such as the iParrott or the Andrea PlainTalk headset that came with the previous version.

<http://www.macsense.com/Product/iParrott103_ b.html>

If your computer doesn’t meet these requirements, you might like to try the previous version, ViaVoice Millennium. It isn’t quite as good, but it works decently, requires only Mac OS 8.5.1 and at least a Power Mac G3/233, and at $75, which isn’t much more than the value of the included headset, must be termed a bargain.

Subscribe today so you don’t miss any TidBITS articles!