How AI companions understand and categorize human requests
18 min read
In June 2025, Apple announced their upcoming Apple Intelligence features at their World-Wide Developer Conference
WWDC-25. The year before, at
WWDC24, there had been much talk of integrating apps with Siri.
Not so much this year. And the mood wasn't gentle.
In a post-keynote interview, Apple’s Senior Vice President of Software Engineering, Craig Federighi, admitted:
“We also talked about […] things like being able to invoke a broader range of actions across your device by app intents being orchestrated by Siri to let it do more things,” added Federighi. “We also talked about the ability to use personal knowledge from that semantic index so if you ask for things like, “What’s that podcast that ‘Joz’ sent me?’ that we could find it, whether it was in your messages or in your email, and call it out, and then maybe even act on it using those app intents. That piece is the piece that we have not delivered, yet.”
Remember Intents? They’re an encapsulation of a user’s request so that software can invoke it on behalf of the user.
Apple’s version, but for applications, is called App Intents. It defines the interaction between an app and the operating system, allowing the system to do app discovery and deep interaction. Apple has been rolling this out since iOS 16 (2022), but its roots go further back to SiriKit and iOS 10 (2016).
The key feature of App Intents has been to support:
Implicit Intents: describe high-level actions to be performed (i.e., capture a photo or show a map). The system locates a component that can best handle the task.
Whether it’s App Intents, App Actions, Alexa AI Actions, or Model Context Protocol, the problems are the same. You want an AI Assistant to smash out of its closed world and peek at what’s outside.
This might seem simple, but in fact, it opens a large can of worms around privacy, discovery, malware, security, and so much more. Coincidentally, all issues that Microsoft discovered painfully back in 1996 when they first created ActiveX Controls.
What happens if you end up with A LOT of them? You’ll need a way to organize them.
Let’s step back, say, 1800 years. There is an innate human desire to place messy knowledge into well-ordered, defined boxes. We owe this to the likes of:
This is a dynamic list and may never be able to satisfy particular standards for completeness. You can help by adding missing items with reliable sources.
In tech, a classification scheme is often used to create a central choke point where everything can be validated, vetted,
and verified. Anything that does not fall into place may be discarded.
Unfortunately, reality leaves a lot of room for interpretation:
A secondary tool of organization is the use of hierarchy, to create boxes inside boxes inside other boxes. Technologies like object-oriented programming, DNS, and SSL Certificates advocate an orderly adoption of root objects, from which everything else in the hierarchy derives.
Going back to App Intents (also App Actions, Alexa AI Actions, or MCP), the problem is how to harness and categorize all these capabilities while allowing AI Assistants to make a decision on a user's behalf.
Remember Implicit Intents? The system is responsible for locating a component that best handles the task. What if there are multiple tasks? Or none installed, but a dozen available on the Play Store?
Turns out matching a user request to an Intent is a Really Hard Problem if you don’t want to keep coming back with:
Rigid taxonomies (and their siblings, hierarchies) actually reduce the benefits of free-form user interactions, which these LLM-based AI Assistants offer.
Alexa researchers covered this problem in their paper on dynamic arbitration:
Alexa now has more than 100,000 skills, and to make them easier to navigate, we’ve begun using a technique called dynamic arbitration.
For thousands of those skills, it's no longer necessary to remember specific skill names
or invocation patterns (“Alexa, tell X to do Y”). Instead, the customer just makes a request, and the dynamic arbitration system finds the best skill to handle it.
Naturally, with that many skills, there may be several that could handle a given customer utterance. The request “make an elephant sound,” for instance, could be processed by the skills AnimalSounds, AnimalNoises, ZooKeeper, or others.
…
Accurate multilabel annotation is difficult to achieve, however, because it would require annotators familiar with the functionality of all 100,000-plus Alexa skills. Moreover, Alexa’s repertory of skills changes over time, as do individual skills’ functionality, so labeled data can quickly become out of date.
One option is to put the burden on the user. Have them explicitly ask that a particular action be performed by a specific skill.
However, Amazon pointed out in 2018
that asking users to name the best skill to match their request wasn’t user-friendly (and knowing Amazon, they likely had a ton of data to back that assertion). A better way would be to adopt what they call Name-free Interactions:
When Alexa receives a request from a customer without a skill name, such as “Alexa, play relaxing sounds with crickets,” Alexa looks for skills that might fulfill the request. Alexa determines the best choice among eligible skills and hands the request to the skill. To provide a signal for Alexa to consider when routing name-free requests and enable customers to launch your skill and intents without knowing or having to remember the skill’s name, consider adding Name Free Interactions (NFI) container to the skill.
…
Skill launch Phrases are an optional new way to teach Alexa how an end-customer might invoke your skill as a modal launch, along with the standard invocation pattern of “Alexa, open ”. For example, “Alexa, Open Tasty Recipes Skill” you might have something like “can you give me a tasty recipe.”
Finding the most relevant skill to handle a natural utterance is an open scientific and engineering challenge
for two reasons:
The sheer number of potential skills makes the task difficult. Unlike traditional digital assistants that have on the order
of 10 to 20 built-in domains, Alexa must navigate more than 40,000. And that number increases each week.
Unlike traditional built-in domains that are carefully designed to stay in their swim lanes, Alexa skills can cover
overlapping functionalities. For instance, there are dozens of skills that can respond to recipe-related utterances.
According to 42 Matters, there are over 1.95 million apps on the iOS App Store and over 2 million on Google Play.
Now consider that not all these apps will be candidates for App Intents and AI Assistant invocation. But still, the number is non-trivially large. Apple and Google can restrict your use of Siri and Google Assistant to the list of compatible apps you have installed on your phone or gateway, but that means the user is not free to just throw out a question and have it answered.
Intent Discovery turns out to be a user burden. And as Alexa showed us, that creates friction.
Device types harness the power of the Google Assistant’s natural language processing. For example, a device with a type light can be turned on in different ways:
Turn on the light.
Turn my light on.
Turn on my living room light.
But what if your light could take a dimming range? Or the name of a light (for example, Cancun Sun or Harvest Moon? What if those are not part of the standard spec?
The developer can add their own custom categories and traits. But then their device will no longer be easily accessible via The One Right Way. They must either drop the feature or provide a custom app and skill, which obviates the need for a single, built-in app.
The problem isn’t unique to technology and innovation. In the creative arts, you need a way to allow for a stream of fresh, new ideas.
“We work in an office. You know, it’s funny. People always say to me, Ah, man, you guys—it probably must be so much fun, sitting around! And it’s like, Yeah, our morning meeting starts at nine. We have to pitch out our ideas—and in some ways that is the challenge of a show. It’s to create a factory that doesn’t kill inspiration and imagination. You try to create a process that includes all the aspects of a mechanized process that we recognize as soul killing while not actually killing souls.”
Once an item has been categorized into a taxonomy, it can remain there. This is often to maintain backward compatibility. For example, let’s say you’ve classified a new type of Marine Red Algae as a sort of catch-all, plant-like Protist. They perform photosynthesis to convert carbon dioxide and water into sugar and energy.
But then, one day, someone notices they don’t really have roots, stems, or, for that matter, leaves. So now they’re classified into a separate category as Thallophytes.
This requires a re-classification as a Rhodophyceae. This is fine, but it means all the books and journals that referenced the old classification are presenting out-of-date or technically incorrect information. There is even a mechanism defined to do so as a Clade. But if the grouping and ancestry don’t exactly match up, you may end up with a Polyphyletic or Paraphyletic group.
And what if something is part of multiple structures? The single-rooted taxonomy gets complicated very quickly.
Now take all that, and add real-time, dynamic categorization, as Hofstadter and Sander do in Surfaces and Essences:
What, then, do we mean in this book by “category” and “categorization”? For us, a category is a mental structure that is created over time and that evolves, sometimes slowly and sometimes quickly, and that contains information in an organized form, allowing access to it under suitable conditions. The act of categorization is the tentative and gradated, gray-shaded linking of an entity or a situation to a prior category in one’s mind.
…
If a category has fringe members, they are made to seem extremely quirky and unnatural, suggesting that nature is really cut precisely at the joints by the categories that we know. The resulting illusory sense of the near-perfect certainty and clarity of categories gives rise to much confusion about categories and the mental processes that underlie categorization. The idea that category membership always comes in shades of gray rather than in just black and white runs firmly against ancient cultural conventions and is therefore disorienting and even disturbing; accordingly, it gets swept under the rug most of the time.
The authors recognize that classification is a way to help us create an orderly and predictable view of our universe. But they go one step further, and recognize it as a continuous process instead of a fixed stage:
In short, nonstop categorization is every bit as indispensable to our survival in the world as is the nonstop beating of our hearts. Without the ceaseless pulsating heartbeat of our “categorization engine”, we would understand nothing around us, could not reason in any form whatever, could not communicate with anyone else, and would have no basis on which to take any action.
A continuously categorizing engine, updating via external observations and events. Let’s keep that in mind.
It did more to blow the doors off my young mind than any other book I had read up to that point. Everyone should read it.
Back To The Business
Discoverability
is a big problem for the next version of Siri. Apps have to provide hints as to what category they fit in. But the meaning of what they can actually do may vary depending on the semantics of the user request. Mix into that access to private data that is only available on your own device (and not necessarily the cloud).
See how hard the problem gets?
Let’s go back to what Apple’s Federighi said at WWDC25. Here’s the extended quote:
Federighi: About half of our Siri section talked about things like being able to invoke a broader range of actions across your device by App Intents being orchestrated by Siri to let it do more things. We also talked about the ability to use personal knowledge from that semantic index, so if you asked for things like you know “What’s that podcast that Joz sent me,’ that we could find it whether it was in your messages or in your email and call it out and then maybe even act on it using those App Intents.
That piece is the piece that we have not delivered yet.
What comes next is a familiar refrain from the days of AI Winters. The first generation of a technology takes you so far. Promises are made and time estimates are offered (and missed).
Climbing a hill should not give one any assurance that if he keeps going he will reach the sky. Perhaps one may have overlooked some serious problem lying ahead.
Federighi continues:
We found that when we were developing this feature that, we had really two phases to two versions of the ultimate architecture that we were going to create, and version one we had sort of working here at the time that we were getting close to the [WWDC24] conference and had at the time high confidence that we could deliver it; we thought by December [2024], and if not, we figured by Spring [2025].
So we announced it as part of [2024] WWDC because we knew the world wanted a really complete picture of what’s Apple thinking about, the implications of Apple Intelligence, and where’s it going. We also had a v2 that was a deeper end-to-end architecture that we knew was ultimately what we wanted to create to get to the full set of capabilities that we wanted for Siri.
The part about Siri orchestrated App Intents sure sounds similar to Alexa’s Name-free interactions. Same wolf coyote, different sheep’s clothing.
The part about semantic index may seem unrelated, but it is because Apple wants to search content associated with you. If you’re not using Apple’s apps (i.e., Email, Message, or Calendar), your personal knowledge may be out of reach, locked inside an app. Then it won’t work, and the user experience is…
Federighi’s “What’s that podcast that Joz sent me?” could be sent via Signal, WhatsApp, or Messenger. That podcast could be played inside Apple Podcasts, Spotify, or many other third-party podcast apps.
If Apple wants to let you have a seamless Siri experience and answer that question (without you explicitly naming apps), they will have to implement a version of Name-free interactions. What’s more, they’ll have to do it without relying on users running everything through the App Store installer (Hello,Third Party App Stores).
As of this writing, Apple’s App Intents framework requires that developers implement and offer their capabilities as code, hard-coded into their app. Also note that as of early June 2025, Apple’s App Intent page issues a warning:
Note
Siri’s personal context understanding, onscreen awareness, and in-app actions are in development and will be available with a future software update.
This is a clue as to why Apple was mum on Siri in WWDC25. Apple did announce that developers could make use of Apple’s on-device Foundation Models inside their apps. This was a HUGE bit of functionality most reporters and analysts missed.
But this will make the problem worse by complicating how apps and on-device LLMs interact
(we’ll get to that next).
What To Do?
The problem actually goes even deeper.
We haven’t even gotten to fun topics like Accessibility, Dynamic Discovery, Security, Security, and my favorite, Payments (i.e., who pays for all this).