In the previous chapter, we looked at core technologies that needed to be developed. As these evolved, popular products started appearing using those advances.
In this section, we’ll look at some of these products.
Apple Speech-to-Text
In 1984, Steve Jobs introduced the Macintosh by having it speak out loud and crack pre-written jokes.
Apple’s love affair with voice synthesis dates back to the Apple IIe and the Votrax Voice Synthesizer.
The voice of WOPR/Joshua computer in the 1983 movie War Games was not generated by a Votrax, but was a recording of the actor John Wood (who played Professor Falken). He recorded the words by reading them backward. The audio was then post-processed to sound more computerized.
To understand the origins of voice synthesizers, however, we must travel a little further back in time…
Press * for operator
In November 1963, Bell System (later AT&T) introduced Dual-Tone Multi-Frequency - DTMF) under the trademark Touch-Tone. Until then, to make a call, you had to ask an operator for assistance. The rotary encoders eliminated the need for operators, allowing users to enter a sequence of digits themselves.
Touch-Tone research had been conducted by a team led by Bell Labs industrial psychologist John E. Karlin. Karlin was head of the Human Factors Engineering group – a first of its kind in an American company.
Factors like the keypad’s rectangular shape, the order of buttons, and their shape and size had all been considered through meticulous user testing before settling on the final form factor.
Bell Labs
Dr. Karlin, considered the father of human-factors engineering, had an eclectic range of interests. He had obtained a bachelor’s degree in philosophy, psychology, and music and a master’s degree in psychology, going on to earn a doctorate in mathematical psychology. In addition to training as an electrical engineer, he was a professional violinist!
John E. Karlin - Alcatel Lucent
Karlin was also fond of recounting being called "[T]he most hated man in America" for his work on the TouchTone. DTMF unleashed a flood of creative uses as a ubiquitous Man-Machine Interface accessible to the masses.
The technology was also instrumental in bringing together Steve Jobs and Steve Wozniak, future co-founders of Apple.
IVR
In the 1990s, Visioneer and ScanSoft (a Xerox PARC spinoff) were the largest competitors in the sheet-fed document scanning business. And they were adrift. The scanning business was decent enough, but growth had slowed, forcing the two to merge.
Enterprises had an insatiable appetite for ways to convert paper records into digital data and store them in a structured database. Customers could call a phone number, where a synthesized voice would read out options. Users could enter commands using phone buttons and navigate the menu using DTMF tones.
The navigation workflow for IVR systems could get complex, and a large industry had sprung up around helping create and manage these interactions.
Sample IVR Flowchart
These applications were dubbed Customer Service Assistants. There wasn’t much ‘assistance’ involved since the applications were self-serve. This was especially true when customers called for everyday tasks like looking for directions, finding business hours, asking for account balances, as well as dynamic data like movie times, traffic reports, and current weather.
Tasks that required multiple steps or were too complex to handle were shunted to trained operators, who, as the late-night infomercials blared, were waiting on standby.
In the U.S., calling 1-800 toll-free numbers encouraged customer engagement. But there was also a burgeoning market in 1-900 numbers that charged callers by the minute or transaction.
The popularity of 1-900 numbers proved consumers were willing to pay for services on a per-transaction basis.
The Ghost in the Machine
In 2003, the U.S. military noticed advances in speaker-independent voice recognition technology.
To bring this system to life, DARPA approached SRI International about creating a five-year, 500-person investigation. At the time, it was the largest AI project in history.
DARPA called its project CALO (short for Cognitive Assistant that Learns and Organizes). The name was inspired by the Latin word calonis, meaning “soldier’s servant.”
…
After a half-decade of research, SRI International decided to spin off a startup called “Siri” (a phonetic version of the company’s name).
Project CALO was based on the PAL (Personal Assistant that Learns) framework from DARPA.
One of the offshoots of the CALO project was the meeting assistant CALO-MA (CALO Meeting Assistant System). CALO-MA was used to digitally assist with business meetings and test natural language and speech processing technologies. The nature of meetings, being multiparty and using domain language, made the system development challenging. The components of CALO-MA included speech recognition software based on the Hidden Markov model (HMM), which was used in another of SRI’s inventions, the DECIPHER project.
Perhaps the most famous CALO descendant is the phone-based digital assistant Siri, which is now part of Apple iOS but was originally another SRI International development.
The CALO Meeting Assistant (MA) provides for distributed meeting capture, annotation, automatic transcription, and semantic analysis of multiparty meetings and is part of the larger CALO architecture and its speech recognition and understanding components, which include real-time and offline speech transcription, dialog act segmentation and tagging, topic identification and segmentation, question-answer pair identification, action item recognition, decision extraction, and summarization.
This was far beyond simple IVR applications. It wandered into actual speech, multiple-step interactions, and access to private information. Integrating voice control into personal calendars and email went on to become one of Siri’s key selling points.
Enter Vlingo
In 2006, Vlingo of Cambridge, Massachusetts, began offering a voice-recognition system that could be embedded into other products. Vlingo had great success, gaining clients like Blackberry, Nokia, and Samsung phones, as well as many TV platforms.
The combined Visioneer/ScanSoft company acquired another SRI voice spinoff. The combined entities became known as Nuance.
SRI had also invented the Acoustic Coupler Modem, an early way to connect computers to the internet. Many of us old-timers remember the screeching sound the phone made as it connected the home computer to the internet.
Dialup Modem
Voice recognition was a logical extension of IVR/DTMF Touch-Tone input. Why limit yourself to just a dial-pad input if you could convert user requests to text commands (aka Intents)?
Nuance itself went on to be acquired by Microsoft in 2022 for $19.7B, based on its dominant medical voice transcription business.
Siri App
Before it acquired Vlingo, Nuance already had an Automatic Speech Recognition (ASR) speech-to-text engine. This was incorporated into early versions of the Siri product and released as a standalone app for iOS in 2010. With Vlingo, Nuance now had two separate voice-recognition products. It could afford to spin one off.
It chose to let go of Siri.
A mere two months after Nuance’s Siri app was released on the App Store, Apple announced that it had purchased the underlying technology, after Steve Jobs reportedly called the Siri CEO for 37 days in a row.
The positive Siri reviews must have been gratifying, given the high bar set by Apple two decades earlier.
The vision had been rolled out in the then-CEO John Sculley’s keynote address at the 1987 Educom conference. This is where he brought up the concept of Intelligent Agents, followed by the concept video for The Knowledge Navigator:
A lesser-known follow-up video was designed by the same team.
Sculley also presided over the introduction of the Apple Newton MessagePad, the first Personal Digital Assistant (PDA) with handwriting recognition, another form of Human Interaction.
The blast radius of the Newton failure had a long-lasting effect. Apple chose not to make a single mention of the Knowledge Navigator during the Siri announcement in 2011. Another factor may have been Jobs’ lingering animosity towards Sculley, who had been instrumental in ousting Jobs from Apple.
Years later, Jobs offered this explanation to his biographer, Walter Isaacson, on his decision to kill the Newton project:
If Apple had been in a less precarious situation, I would have drilled down myself to figure out how to make it [Newton] work. I didn’t trust the people running it. My gut was that there was some really good technology, but it was fucked up by mismanagement. By shutting it [Newton] down, I freed up some good engineers who could work on new mobile devices. And eventually, we got it right when we moved on to iPhones and the iPad.
I left my first job in Palo Alto to move to San Francisco, then began commuting to Cupertino to work as a consultant for Apple, working on the MPW C++ compiler.
That opened the door to joining Taligent (a joint Apple, IBM, HP venture) in the mid-90s, building a common operating system that could run on any number of hardware platforms.
I left to start a startup above a swimsuit shop in a Los Altos strip mall, alongside three other former Apple employees. We built a web browser with custom Animation Markup Language that superseded HTML. This was before Flash took over web multimedia. The company was later sold to Microsoft.
Along the way, I ended up meeting Steve Jobs at NeXT HQ in Redwood City and getting yelled at, mostly about how awful Apple had been to him.
A decade ago, on October 4, 2011, a remarkable thing happened: Apple launched Siri.
It started off a bit shaky, but with 10 years of technological advancement, it defied all odds. Instead of fixing any of its problems, creating anything new, or actually answering any of our questions with helpful answers, Siri simply maintained. For a decade, it’s continued to suck.
Extending Siri
Siri’s abilities were designed to integrate with the iPhone operating system. During a conversation, you could ask the system to set reminders, send text messages, and offer deep integration with Apple’s bundled applications. Anything beyond that, Siri would get hopelessly lost and only recite what it had found on the web.
For the first few years, Siri was a closed system. It wasn’t until iOS 10 in 2016 and the introduction of SiriKit that Siri could be opened up to third parties. The first version of SiriKit only had support for a fixed list of categories.
In this section, we looked at how Intelligent Assistants started to appear, built on top of core voice-to-text, knowledge representation, and text-to-voice advances.
Next, we’ll look at hardware devices that began implementing these services. This was the beginning of Connected Device technologies, including Internet of Things (IoT) and SmartHome devices.