How Siri Works – Interview with Tom Gruber, CTO of SIRI

January 26th, 2010

Sneak Preview of Siri – The Virtual Assistant that will Make Everyone Love the iPhone, Part 2: The Technical Stuff

In Part-One of this article on TechCrunch, I covered the emerging paradigm of Virtual Assistants and explored a first look at a new product in this category called Siri. In this article, Part-Two, I interview Tom Gruber, CTO of Siri, about the history, key ideas, and technical foundations of the product:

Nova Spivack: Can you give me a more precise definition of a Virtual Assistant?

Tom Gruber: A virtual personal assistant is a software system that

  • Helps the user find or do something (focus on tasks, rather than information)
  • Understands the user’s intent (interpreting language) and context (location, schedule, history)
  • Works on the user’s behalf, orchestrating multiple services and information sources to help complete the task

In other words, an assistant helps me do things by understanding me and working for me. This may seem quite general, but it is a fundamental shift from the way the Internet works today. Portals, search engines, and web sites are helpful but they don’t do things for me – I have to use them as tools to do something, and I have to adapt to their ways of taking input.

Nova Spivack: Siri is hoping to kick-start the revival of the Virtual Assistant category, for the Web. This is an idea which has a rich history. What are some of the past examples that have influenced your thinking?

Tom Gruber: The idea of interacting with a computer via a conversational interface with an assistant has excited the imagination for some time. Apple’s famous Knowledge Navigator video offered a compelling vision, in which a talking head agent helped a professional deal with schedules and access information on the net. The late Michael Dertouzos, head of MIT’s Computer Science Lab, wrote convincingly about the assistant metaphor as the natural way to interact with computers in his book “The Unfinished Revolution: Human-Centered Computers and What They Can Do For Us”. These accounts of the future say that you should be able to talk to your computer in your own words, saying what you want to do, with the computer talking back to ask clarifying questions and explain results. These are hallmarks of the Siri assistant. Some of the elements of these visions are beyond what Siri does, such as general reasoning about science in the Knowledge Navigator. Or self-awareness a la Singularity. But Siri is the real thing, using real AI technology, just made very practical on a small set of domains. The breakthrough is to bring this vision to a mainstream market, taking maximum advantage of the mobile context and internet service ecosystems.

Nova Spivack: Tell me about the CALO project, that Siri spun out from. (Disclosure: my company, Radar Networks, consulted to SRI in the early days on the CALO project, to provide assistance with Semantic Web development)

Tom Gruber: Siri has its roots in the DARPA CALO project (“Cognitive Agent that Learns and Organizes”) which was led by SRI. The goal of CALO was to develop AI technologies (dialog and natural language understanding,s understanding, machine learning, evidential and probabilistic reasoning, ontology and knowledge representation, planning, reasoning, service delegation) all integrated into a virtual assistant that helps people do things. It pushed the limits on machine learning and speech, and also showed the technical feasibility of a task-focused virtual assistant that uses knowledge of user context and multiple sources to help solve problems.

Siri is integrating, commercializing, scaling, and applying these technologies to a consumer-focused virtual assistant. Siri was under development for several years during and after the CALO project at SRI. It was designed as an independent architecture, tightly integrating the best ideas from CALO but free of the constraints of a national distributed research project. The Siri.com team has been evolving and hardening the technology since January 2008.

Nova Spivack: What are primary aspects of Siri that you would say are “novel”?

Tom Gruber: The demands of the consumer internet focus — instant usability and robust interaction with the evolving web — has driven us to come up with some new innovations:

  • A conversational interface that combines the best of speech and semantic language understanding with an interactive dialog that helps guide people toward saying what they want to do and getting it done. The conversational interface allows for much more interactivity that one-shot search style interfaces, which aids usability and improves intent understanding. For example, if Siri didn’t quite hear what you said, or isn’t sure what you meant, it can ask for clarifying information. For example, it can prompt on ambiguity: did you mean pizza restaurants in Chicago or Chicago-style pizza places near you? It can also make reasonable guesses based on context. Walking around with the phone at lunchtime, if the speech interpretation comes back with something garbled about food you probably meant “places to eat near my current location”. If this assumption isn’t right, it is easy to correct in a conversation.
  • Semantic auto-complete – a combination of the familiar “autocomplete” interface of search boxes with a semantic and linguistic model of what might be worth saying. The so-called “semantic completion” makes it possible to rapidly state complex requests (Italian restaurants in the SOMA neighborhood of San Francisco that have tables available tonight) with just a few clicks. It’s sort of like the power of faceted search a la Kayak, but packaged in a clever command line style interface that works in small form factor and low bandwidth environments.
  • Service delegation – Siri is particularly deep in technology for operationalizing a user’s intent into computational form, dispatching to multiple, heterogeneous services, gathering and integrating results, and presenting them back to the user as a set of solutions to their request. In a restaurant selection task, for instance, Siri combines information from many different sources (local business directories, geospatial databases, restaurant guides, restaurant review sources, online reservation services, and the user’s own favorites) to show a set of candidates that meet the intent expressed in the user’s natural language request.

Nova Spivack: Why do you think Siri will succeed when other AI-inspired projects have failed to meet expectations?

Tom Gruber: In general my answer is that Siri is more focused. We can break this down into three areas of focus:

  • Task focus. Siri is very focused on a bounded set of specific human tasks, like finding something to do, going out with friends, and getting around town. This task focus allows it to have a very rich model of its domain of competence, which makes everything more tractable from language understanding to reasoning to service invocation and results presentation
  • Structured data focus. The kinds of tasks that Siri is particularly good at involve semistructured data, usually on tasks involving multiple criteria and drawing from multiple sources. For example, to help find a place to eat, user preferences for cuisine, price range, location, or even specific food items come into play. Combining results from multiple sources requires reasoning about domain entity identity and the relative capabilities of different information providers. These are hard problems of semantic information processing and integration that are difficult but feasible today using the latest AI technologies.
  • Architecture focus. Siri is built from deep experience in integrating multiple advanced technologies into a platform designed expressly for virtual assistants. Siri co-founder Adam Cheyer was chief architect of the CALO project, and has applied a career of experience to design the platform of the Siri product. Leading the CALO project taught him a lot about what works and doesn’t when applying AI to build a virtual assistant. Adam and I also have rather unique experience in combining AI with intelligent interfaces and web-scale knowledge integration. The result is a “pure play” dedicated architecture for virtual assistants, integrating all the components of intent understanding, service delegation, and dialog flow management. We have avoided the need to solve general AI problems by concentrating on only what is needed for a virtual assistant, and have chosen to begin with a finite set of vertical domains serving mobile use cases.

Nova Spivack: Why did you design Siri primarily for mobile devices, rather than Web browsers in general?

Tom Gruber: Rather than trying to be like a search engine to all the world’s information, Siri is going after mobile use cases where deep models of context (place, time, personal history) and limited form factors magnify the power of an intelligent interface. The smaller the form factor, the more mobile the context, the more limited the bandwidth : the more it is important that the interface make intelligent use of the user’s attention and the resources at hand. In other words, “smaller needs to be smarter.” And the benefits of being offered just the right level of detail or being prompted with just the right questions can make the difference between task completion or failure. When you are on the go, you just don’t have time to wade through pages of links and disjoint interfaces, many of which are not suitable to mobile at all.

Nova Spivack: What language and platform is Siri written in?

Tom Gruber: Java, Javascript, and Objective C (for the iPhone)

Nova Spivack: What about the Semantic Web? Is Siri built with Semantic Web open-standards such as RDF and OWL, Sparql?

Tom Gruber: No, we connect to partners on the web using structured APIs, some of which do use the Semantic Web standards. A site that exposes RDF usually has an API that is easy to deal with, which makes our life easier. For instance, we use geonames.org as one of our geospatial information sources. It is a full-on Semantic Web endpoint, and that makes it easy to deal with. The more the API declares its data model, the more automated we can make our coupling to it.

Nova Spivack: Siri seems smart, at least about the kinds of tasks it was designed for. How is the knowledge represented in Siri – is it an ontology or something else?

Tom Gruber: Siri’s knowledge is represented in a unified modeling system that combines ontologies, inference networks, pattern matching agents, dictionaries, and dialog models. As much as possible we represent things declaratively (i.e., as data in models, not lines of code). This is a tried and true best practice for complex AI systems. This makes the whole system more robust and scalable, and the development process more agile. It also helps with reasoning and learning, since Siri can look at what it knows and think about similarities and generalizations at a semantic level.

Nova Spivack: Will Siri be part of the Semantic Web, or at least the open linked data Web (by making open API’s, sharing of linked data, RDF, available, etc.)?

Tom Gruber: Siri isn’t a source of data, so it doesn’t expose data using Semantic Web standards. In the Semantic Web ecosystem, it is doing something like the vision of a semantic desktop – an intelligent interface that knows about user needs and sources of information to meet those needs, and intermediates. The original Semantic Web article in Scientific American included use cases that an assistant would do (check calendars, look for things based on multiple structured criteria, route planning, etc.). The Semantic Web vision focused on exposing the structured data, but it assumes APIs that can do transactions on the data. For example, if a virtual assistant wants to schedule a dinner it needs more than the information about the free/busy schedules of participants, it needs API access to their calendars with appropriate credentials, ways of communicating with the participants via APIs to their email/sms/phone, and so forth. Siri is building on the ecosystem of APIs, which are better if they declare the meaning of the data in and out via ontologies. That is the original purpose of ontologies-as-specification that I promoted in the 1990s – to help specify how to interact with these agents via knowledge-level APIs.

Siri does, however, benefit greatly from standards for talking about space and time, identity (of people, places, and things), and authentication. As I called for in my Semantic Web talk in 2007, there is no reason we should be string matching on city names, business names, user names, etc.

All players near the user in the ecommerce value chain get better when the information that the users need can be unambiguously identified, compared, and combined. Legitimate service providers on the supply end of the value chain also benefit, because structured data is harder to scam than text. So if some service provider offers a multi-criteria decision making service, say, to help make a product purchase in some domain, it is much easier to do fraud detection when the product instances, features, prices, and transaction availability information are all structured data.

Nova Spivack: Siri appears to be able to handle requests in natural language. How good is the natural language processing (NLP) behind it? How have you made it better than other NLP?

Tom Gruber: Siri’s top line measure of success is task completion (not relevance). A subtask is intent recognition, and subtask of that is NLP. Speech is another element, which couples to NLP and adds its own issues. In this context, Siri’s NLP is “pretty darn good” — if the user is talking about something in Siri’s domains of competence, its intent understanding is right the vast majority of the time, even in the face of noise from speech, single finger typing, and bad habits from too much keywordese. All NLP is tuned for some class of natural language, and Siri’s is tuned for things that people might want to say when talking to a virtual assistant on their phone. We evaluate against a corpus, but I don’t know how it would compare to standard message and news corpuses using by the NLP research community.

Nova Spivack: Did you develop your own speech interface, or are you using third-party system for that? How good is it? Is it battle-tested?

Tom Gruber: We use third party speech systems, and are architected so we can swap them out and experiment. The one we are currently using has millions of users and continuously updates its models based on usage.

Nova Spivack: Will Siri be able to talk back to users at any point?

Tom Gruber: It could use speech synthesis for output, for the appropriate contexts. I have a long standing interest in this, as my early graduate work was in communication prosthesis. In the current mobile internet world, however, iPhone-sized screens and 3G networks make it possible to do so more much than read menu items over the phone. For the blind, embedded appliances, and other applications it would make sense to give Siri voice output.

Nova Spivack: Can you give me more examples of how the NLP in Siri works?

Tom Gruber: Sure, here’s an example, published in the Technology Review, that illustrates what’s going on in a typical dialogue with Siri. (Click link to view the table)

Nova Spivack: How personalized does Siri get – will it recommend different things to me depending on where I am when I ask, and/or what I’ve done in the past? Does it learn?

Tom Gruber: Siri does learn in simple ways today, and it will get more sophisticated with time. As you said, Siri is already personalized based on immediate context, conversational history, and personal information such as where you live. Siri doesn’t forget things from request to request, as do stateless systems like search engines. It always considers the user model along with the domain and task models when coming up with results. The evolution in learning comes as users have a history with Siri, which gives it a chance to make some generalizations about preferences. There is a natural progression with virtual assistants from doing exactly what they are asked, to making recommendations based on assumptions about intent and preference. That is the curve we will explore with experience.

Nova Spivack: How does Siri know what is in various external services – are you mining and doing extraction on their data, or is it all just real-time API calls?

Tom Gruber: For its current domains Siri uses dozens of APIs, and connects to them in both realtime access and batch data synchronization modes. Siri knows about the data because we (humans) explicitly model what is in those sources. With declarative representations of data and API capabilities, Siri can reason about the various capabilities of its sources at run time to figure out which combination would best serve the current user request. For sources that do not have nice APIs or expose data using standards like the Semantic Web, we can draw on a value chain of players that do extract structure by data mining and exposing APIs via scraping.

Nova Spivack: Thank you for the information, Siri might actually make me like the iPhone enough to start using one again.

Tom Gruber: Thank you, Nova, it’s a pleasure to discuss this with someone who really gets the technology and larger issues. I hope Siri does get you to use that iPhone again. But remember, Siri is just starting out and will sometimes say silly things. It’s easy to project intelligence onto an assistant, but Siri isn’t going to pass the Turing Test. It’s just a simpler, smarter way to do what you already want to do. It will be interesting to see how this space evolves, how people will come to understand what to expect from the little personal assistant in their pocket.