Hey, Siri

The launch of Apple's AI assistant and its impact on mobile interaction

13 min read

→ Previously

In the previous chapter, we looked at core technologies that needed to be developed. As these evolved, popular products started appearing using those advances.

In this section, we’ll look at some of these products.

Apple Speech-to-Text

In 1984, Steve Jobs introduced the Macintosh by having it speak out loud and crack pre-written jokes.

Apple’s love affair with voice synthesis dates back to the Apple IIe and the Votrax Voice Synthesizer.

ℹ️ # Side Note

The voice of WOPR/Joshua computer in the 1983 movie War Games was not generated by a Votrax, but was a recording of the actor John Wood (who played Professor Falken). He recorded the words by reading them backward. The audio was then post-processed to sound more computerized.

Joshua/WOPR

To understand the origins of voice synthesizers, however, we must travel a little further back in time…

Press * for operator

In November 1963, Bell System (later AT&T) introduced Dual-Tone Multi-Frequency - DTMF) under the trademark Touch-Tone. Until then, to make a call, you had to ask an operator for assistance. The rotary encoders eliminated the need for operators, allowing users to enter a sequence of digits themselves.

Here’s what a phone dial looked like, kids:

Credit: Dhscommtech at English Wikipedia

Touch-Tone research had been conducted by a team led by Bell Labs industrial psychologist John E. Karlin. Karlin was head of the Human Factors Engineering group – a first of its kind in an American company.

Factors like the keypad’s rectangular shape, the order of buttons, and their shape and size had all been considered through meticulous user testing before settling on the final form factor.

Dr. Karlin, considered the father of human-factors engineering, had an eclectic range of interests. He had obtained a bachelor’s degree in philosophy, psychology, and music and a master’s degree in psychology, going on to earn a doctorate in mathematical psychology. In addition to training as an electrical engineer, he was a professional violinist!

Karlin was also fond of recounting being called "[T]he most hated man in America" for his work on the TouchTone. DTMF unleashed a flood of creative uses as a ubiquitous Man-Machine Interface accessible to the masses.

The technology was also instrumental in bringing together Steve Jobs and Steve Wozniak, future co-founders of Apple.

IVR

In the 1990s, Visioneer and ScanSoft (a Xerox PARC spinoff) were the largest competitors in the sheet-fed document scanning business. And they were adrift. The scanning business was decent enough, but growth had slowed, forcing the two to merge.

The big opportunity was in touch-tone phones and Interactive Voice Response (IVR) services.

Enterprises had an insatiable appetite for ways to convert paper records into digital data and store them in a structured database. Customers could call a phone number, where a synthesized voice would read out options. Users could enter commands using phone buttons and navigate the menu using DTMF tones.

The navigation workflow for IVR systems could get complex, and a large industry had sprung up around helping create and manage these interactions.

These applications were dubbed Customer Service Assistants. There wasn’t much ‘assistance’ involved since the applications were self-serve. This was especially true when customers called for everyday tasks like looking for directions, finding business hours, asking for account balances, as well as dynamic data like movie times, traffic reports, and current weather.

Tasks that required multiple steps or were too complex to handle were shunted to trained operators, who, as the late-night infomercials blared, were waiting on standby.

In the U.S., calling 1-800 toll-free numbers encouraged customer engagement. But there was also a burgeoning market in 1-900 numbers that charged callers by the minute or transaction.

ℹ️ # Side Note

The popularity of 1-900 numbers proved consumers were willing to pay for services on a per-transaction basis.

The Ghost in the Machine

In 2003, the U.S. military noticed advances in speaker-independent voice recognition technology.

To bring this system to life, DARPA approached SRI International about creating a five-year, 500-person investigation. At the time, it was the largest AI project in history.

DARPA called its project CALO (short for Cognitive Assistant that Learns and Organizes). The name was inspired by the Latin word calonis, meaning “soldier’s servant.”

…

After a half-decade of research, SRI International decided to spin off a startup called “Siri” (a phonetic version of the company’s name).

Source

Project CALO was based on the PAL (Personal Assistant that Learns) framework from DARPA.

One of the offshoots of the CALO project was the meeting assistant CALO-MA (CALO Meeting Assistant System). CALO-MA was used to digitally assist with business meetings and test natural language and speech processing technologies. The nature of meetings, being multiparty and using domain language, made the system development challenging. The components of CALO-MA included speech recognition software based on the Hidden Markov model (HMM), which was used in another of SRI’s inventions, the DECIPHER project.

Perhaps the most famous CALO descendant is the phone-based digital assistant Siri, which is now part of Apple iOS but was originally another SRI International development.

The CALO Meeting Assistant System.

The CALO Meeting Assistant (MA) provides for distributed meeting capture, annotation, automatic transcription, and semantic analysis of multiparty meetings and is part of the larger CALO architecture and its speech recognition and understanding components, which include real-time and offline speech transcription, dialog act segmentation and tagging, topic identification and segmentation, question-answer pair identification, action item recognition, decision extraction, and summarization.

This was far beyond simple IVR applications. It wandered into actual speech, multiple-step interactions, and access to private information. Integrating voice control into personal calendars and email went on to become one of Siri’s key selling points.

Enter Vlingo

In 2006, Vlingo of Cambridge, Massachusetts, began offering a voice-recognition system that could be embedded into other products. Vlingo had great success, gaining clients like Blackberry, Nokia, and Samsung phones, as well as many TV platforms.

The combined Visioneer/ScanSoft company acquired another SRI voice spinoff. The combined entities became known as Nuance.

ℹ️ # Side Note

SRI had also invented the Acoustic Coupler Modem, an early way to connect computers to the internet. Many of us old-timers remember the screeching sound the phone made as it connected the home computer to the internet.

Dialup Modem

Voice recognition was a logical extension of IVR/DTMF Touch-Tone input. Why limit yourself to just a dial-pad input if you could convert user requests to text commands (aka Intents)?

Nuance had filed over 4000 patents in the voice-recognition domain, and it now used them to go after competitors like Vlingo, which it acquired in 2011, under threat of patent litigation. It wasn’t pretty, and created a lot of press on the fairness of the patent system.

Nuance itself went on to be acquired by Microsoft in 2022 for $19.7B, based on its dominant medical voice transcription business.

Siri App

Before it acquired Vlingo, Nuance already had an Automatic Speech Recognition (ASR) speech-to-text engine. This was incorporated into early versions of the Siri product and released as a standalone app for iOS in 2010. With Vlingo, Nuance now had two separate voice-recognition products. It could afford to spin one off.

It chose to let go of Siri.

A mere two months after Nuance’s Siri app was released on the App Store, Apple announced that it had purchased the underlying technology, after Steve Jobs reportedly called the Siri CEO for 37 days in a row.

A year later Apple had integrated it into the iPhone 4S and the iPhone operating system. Siri’s new, built-in version was announced at a special event on October 4, 2011.

By then, Jobs was too ill to attend.

He died a day later, on October 5, 2011.

The response to the fully integrated Siri was universally positive:

But the honeymoon wouldn’t last long…

Knowledge Navigator and Newton

The positive Siri reviews must have been gratifying, given the high bar set by Apple two decades earlier.

The vision had been rolled out in the then-CEO John Sculley’s keynote address at the 1987 Educom conference. This is where he brought up the concept of Intelligent Agents, followed by the concept video for The Knowledge Navigator:

A lesser-known follow-up video was designed by the same team.

Sculley also presided over the introduction of the Apple Newton MessagePad, the first Personal Digital Assistant (PDA) with handwriting recognition, another form of Human Interaction.

Newton’s handwriting system used an Artificial Neural Network character classifier along with Context-Driven Search to perform recognition with minimal prior training. But the results were far from perfect, leading to Apple getting skewered in the national media:

Gary Trudeau, 1993 - Universal Press Syndicate

The blast radius of the Newton failure had a long-lasting effect. Apple chose not to make a single mention of the Knowledge Navigator during the Siri announcement in 2011. Another factor may have been Jobs’ lingering animosity towards Sculley, who had been instrumental in ousting Jobs from Apple.

Years later, Jobs offered this explanation to his biographer, Walter Isaacson, on his decision to kill the Newton project:

If Apple had been in a less precarious situation, I would have drilled down myself to figure out how to make it [Newton] work. I didn’t trust the people running it. My gut was that there was some really good technology, but it was fucked up by mismanagement. By shutting it [Newton] down, I freed up some good engineers who could work on new mobile devices. And eventually, we got it right when we moved on to iPhones and the iPad.

ℹ️ # Side Note

I left my first job in Palo Alto to move to San Francisco, then began commuting to Cupertino to work as a consultant for Apple, working on the MPW C++ compiler.

That opened the door to joining Taligent (a joint Apple, IBM, HP venture) in the mid-90s, building a common operating system that could run on any number of hardware platforms.

I left to start a startup above a swimsuit shop in a Los Altos strip mall, alongside three other former Apple employees. We built a web browser with custom Animation Markup Language that superseded HTML. This was before Flash took over web multimedia. The company was later sold to Microsoft.

Along the way, I ended up meeting Steve Jobs at NeXT HQ in Redwood City and getting yelled at, mostly about how awful Apple had been to him.

Fun times.

(PS: I wrote about it a few years later)

Speaking of Knowledge Navigator… many of its predictions have come true.

Siri, however, didn’t fare too well:

A decade ago, on October 4, 2011, a remarkable thing happened: Apple launched Siri.

It started off a bit shaky, but with 10 years of technological advancement, it defied all odds. Instead of fixing any of its problems, creating anything new, or actually answering any of our questions with helpful answers, Siri simply maintained. For a decade, it’s continued to suck.

Extending Siri

Siri’s abilities were designed to integrate with the iPhone operating system. During a conversation, you could ask the system to set reminders, send text messages, and offer deep integration with Apple’s bundled applications. Anything beyond that, Siri would get hopelessly lost and only recite what it had found on the web.

For the first few years, Siri was a closed system. It wasn’t until iOS 10 in 2016 and the introduction of SiriKit that Siri could be opened up to third parties. The first version of SiriKit only had support for a fixed list of categories.

iOS 10 Release Notes

iOS 10.2

Siri now works with the following types of apps:

Messaging apps to send, search and read back text messages

VoIP apps to place phone calls

Photos apps to search for images and photos

Ride service apps to book rides

Payment apps to make personal payments

Fitness apps to start, stop, and pause workouts

CarPlay automaker apps to adjust climate, radio, seat, and personal settings

…

iOS 10.3

Support for paying and checking status of bills with payment apps

Support for scheduling with ride booking apps

Support for checking car fuel level, lock status, turning on lights and activating horn with automaker apps

Cricket sports scores and statistics for Indian Premier League and International Cricket Council

In iOS 11, Apple enabled integration through App Intent Extensions

ℹ️ # Side Note

Keep this in mind. Extensibility is a big upcoming theme.

The Devices Are Coming

With each release of iOS, the quality of voice recognition appeared to be diminishing, leading to Siri’s slow fall from grace:

By 2014, the AI Assistant world was about to make a seismic shift.

Coming Up Next... →

In this section, we looked at how Intelligent Assistants started to appear, built on top of core voice-to-text, knowledge representation, and text-to-voice advances.

Next, we’ll look at hardware devices that began implementing these services. This was the beginning of Connected Device technologies, including Internet of Things (IoT) and SmartHome devices.

Stay tuned.

Title Photo by omid armin on Unsplash