For the last few years, the Macintosh community has watched helplessly as the Mac has fallen painfully behind in a field it helped pioneer on personal computers: speech recognition. Apple introduced PlainTalk in 1993 with the original AV Macs, and not long afterwards speech dictation products like Articulate Systems’ Power Secretary appeared for the Macintosh. But the situation turned bleak: Dragon Systems (the parent company of Articulate Systems) went on to develop Naturally Speaking, a successful continuous speech recognition product for Windows, while ignoring the Mac. At Apple, PlainTalk stagnated and was updated recently only so it would continue to function with current system software. Further, Dragon Systems ceased Mac development and discontinued Power Secretary after a long period of neglect (although a company called One Stop Direct is re-packaging Power Secretary for Power Macs as VoicePower Pro, at least for the U.K. market).
There have been a few glimmers of hope. Andrew Taylor and other engineers who produced Power Secretary are working on a new speech dictation product for the Macintosh. They seem to have nailed down some funding and hope to have something to show by this July’s Macworld Expo in New York City.
More interesting have been persistent rumors claiming that Apple has been working on speech recognition software. Apple has always faced a difficult situation with regard to speech recognition (and many other fundamental technologies). If Apple develops, or publicly considers developing, its own solution, it eliminates opportunities for third-party products, just as PlainTalk essentially destroyed the market for Articulate Systems’ Voice Navigator and AppleScript delivered a serious blow to UserLand’s Frontier. If, on the other hand, Apple stays out of a market to allow room for third-party development, the Macintosh platform suffers if no third parties step into the arena.
We’ve learned that Apple is playing both sides of the coin by staying out of the speech recognition field to allow for third-party development while focusing on developing its own ambitious technology that developers can integrate into future products.
The results are stunning.
Sullivan — The bad news is that Apple is not developing a continuous speech recognition technology. Although the sheer processing power of G3-based systems is more than sufficient, Apple considers development of speech software for the Mac OS beyond PlainTalk’s current capabilities and strictly a third-party opportunity. In fact, Apple employees have privately confided hopes that the MacSpeech development effort succeeds – it would provide a viable speech solution for customers Apple can’t help directly.
The good news is that Apple has been working quietly on the Apple Media Translation Engine (AMTE), an all-new technology for Mac OS X mostly known by the codename Sullivan (after Ann Sullivan, Helen Keller’s teacher and long-time friend). Sullivan is more than a speech engine; it’s best described as a “data translation matrix,” in that it can accept input in a variety of formats (audio, video, text, MIDI, etc.), interpret the data using specially developed Media Description Templates (MDTs), and output the results to similarly compiled Media Output Streams (MOSs). Both MDTs and MOSs are extensible; support dynamic inheritance and scaling; arbitrary data types and framing; and offer special copy protection, encryption, and registration schemes that let developers to protect proprietary data formats while permitting interoperability with other applications and media types.
If all this sounds abstract, it is – and that’s precisely the power of Sullivan. By divorcing itself from the specifics of a particular application space – like speech recognition – Sullivan can focus on the fundamentals of a data engine: wicked fast transformation algorithms, support for multiple processors (as well as the PowerPC G4’s AltiVec vector processing), optimized memory usage, rapid data transfer, and a modular multithreaded translation engine.
You Can Quote Me — In short, you can feed Sullivan data and it translates the data into another format, contingent upon the translation modules you have installed. A contact at DWIS, Inc., (Do What I Say), a small San Jose-based company made up of former Apple, Radius, and Silicon Graphics employees showed us what Sullivan can do using Court Reporter, a server-side module DWIS is working on for the Apache Web server. Court Reporter currently translates QuickTime movies into text, essentially providing real-time transcription. It’s quite accurate. Note the one error – “jed eye” – in the sample transcript from the Apple-promoted Star Wars movie trailer below.
Female voice 1: I will not condone a course of action that will lead us to war.
Male voice 1: A communications disruption can mean only one thing. Invasion.
Deep male voice 1: At last we will reveal ourselves to the jed eye. [pause] At last we will have revenge.
Deep male voice 2: Begin landing your troops.
Male voice 3: We haven’t much time.
[Explosions, loud music]
Female voice 1: The federation has gone too far!
Male voice 1 [distant]: The death toll is catastrophic!
Female voice 1: Our people are dying, senator. We must do something quickly!
Male voice 1 [distant]: You must contact me!
Male voice 4: There is something else behind all this, your highness. They will kill you if you stay.
Male voice 4: I can only protect you. [Noise] I can’t fight a war for you.
Male voice 5: I think we’re going to have to accept federation control for the time being.
Male voice 6: This is a battle I do not think that we can win.
Female voice 1: I will sign no treaty, senator.
Court Reporter’s MDTs provide for content profiles, thus enabling the administrator to assign specific descriptions to media element descriptors. For instance, “[Humming]” could be transcribed as “[Light sabers]”, and sounds falling within a user-definable range of similarity would be identically labeled. Along those lines, Court Reporter can track specific speakers within an audio stream, using matching techniques to identify them throughout. Once a speaker has been identified, Court Reporter enables users or administrators to assign names and other information to them. So, “Female voice 1” would become “Queen Amidala” and “Male voice 3” would become “Obi-wan Kenobi.” Court Reporter is often able to distinguish music (which typically has distinctive pitch relationships and ranges) from percussive sounds and other noises (like explosions). Although transcribing a movie in this fashion would be a lot of work, imagine a live transcription of a keynote address, with only a few speakers and no scene changes. One DWIS developer noted, “Court Reporter could deliver 90 percent of a webcast keynote’s content in about one one-thousandth of the bandwidth. Plus you could copy and paste quotes into an article without retyping.”
DWIS also revealed that it is developing a series of related modules for Sullivan, including ones that translate RealVideo and RealAudio, MP3 (for fun, they ran “Louie Louie” through it), and Windows Media to text.
Mum’s the Word — Neither Apple nor DWIS would comment on when the Sullivan foundation technologies might be available to consumers, but Sullivan isn’t likely to appear until a year or more after Mac OS X ships. However, derivative applications – up to and including continuous speech recognition – could be available sooner in stand-alone form. Other companies working with Apple on Sullivan are reported to be developing real-time language translation, high-end media servers, file format converters, and music education software.
The burden of Sullivan’s effectiveness in a particular application comes down to the quality and sophistication of the MDTs and MOSs. MTDs and MOSs can function like plug-ins for the core Sullivan engine (drop them in the correct folder and Sullivan immediately becomes aware of the newly added media “flavors”) or as part of a specific application designed to run on top of the Sullivan engine. So, support for translating to or from HTML or XML would be best implemented as a plug-in intended for wide use by Sullivan-savvy applications, while MDTs that handle a proprietary data format might be available only within the context of a single program.
Sullivan seems like a breakthrough technology for Apple, both providing a solid foundation for the Macintosh platform and ample opportunity for third party development.