How does Google Assistant understand your questions?

November 3, 2022

By Valentina Tuta

When we talk about the Google Assistant we can already think about the future of technologies but this is probably the same reason why sometimes we ask ourselves How do voice-activated virtual assistants work? Specifically, how do they understand what someone is asking, then provide a correct, useful and even delightful response?

Actually, Assistant can respond to so many different types of queries. Whether you’re curious about the biggest mammal in the world or if your favorite ice cream shop is open, chances are the Assistant can answer that for you and that’s not all because daily Google’s team works on how to make its responses better, faster and even more helpful than ever.

In this note we are going to tell you some things the Distinguished Scientist Françoise Beaufays, an engineer and researcher on Google’s speech team, told us about her work and how Assistant understands voice queries and then delivers satisfying (and often charming) answers.

What exactly Françoise does at Google?

Françoise leads the speech recognition team at Google. Her job is basically to build recognition systems for all the products at Google that are powered by voice. Additionally, she told us that her team allows Assistant to hear its users, try to understand what they want and then take action.

It also lets them write captions on YouTube videos and in Meet as people speak and allows users to dictate text messages to their friends and family. So, as you can see, the speech recognition technology is behind all of those experiences.

Why is it so key for speech recognition to work as well as possible with Assistant?

Assistant is based on understanding what someone said and then taking action based on that understanding. It’s so critical that the interaction is very smooth. Thus, you only have to decide to do something by voice that you could do with your fingers if it provides a benefit. Moreover, if you try to speak to a machine and you’re not confident it can understand you quickly, the delight disappears, isn’t it?

So how does the machine understand what you’re asking? How did it learn to recognize spoken words in the first place?

Everything in speech recognition is machine learning. Machine learning is a type of technology where an algorithm is used to help a “model” learn from data. However, this technology isn’t based on rules to try to deduce the words through the milliseconds it can take to pronounce a phoneme

In fact, it’s much smarter than that, because instead, the team would present a bunch of audio snippets to the model, and we would say to the model, Here, somebody said, “This cat is happy.” Here, someone said, “That dog is tired.”

So that the model learns the difference progressively while understanding variations of the original fragments, such as “This cat is tired” or “This dog is not happy,” no matter who says it. To achieve this, the models currently used at Assistant are called deep neural networks.

But what’s a deep neural network?

It’s a kind of model inspired by how the human brain works. As you know, your brain uses neurons to share information and cause the rest of your body to act. In artificial neural networks, the “neurons” are what we call computational units, or bits of code that communicate with each other.

These computational units are grouped into layers and these layers can stack on top of each other to create more complex possibilities for understanding and action. You end up with these “neural networks” that can get big and involved (hence, deep neural networks).

Then, for Assistant, a deep neural network is responsible for receiving an input, such as audio from someone speaking, and processing that information in a layered stack to convert it into text.

Same procedure we know as “speech recognition”. Then, the text is processed by another stack of layers to parse it into pieces of information that help the Assistant understand what it needs and help it display a result or take an action on your behalf. This is what we call “natural language processing.”

Let’s say I ask the Assistant something pretty straightforward, like, “Hey Google, where’s the closest dog park?”, how would the Assistant understand what I’m saying and respond to my query?

In this section, Françoise describes the process in detail. We tried to develop her idea as clearly as possible, so read on to not miss any of the details.

The first step is for the Assistant to process that “Hey Google” and realize: “Ah, it sounds like this person is talking to me and wants something from me”.

The Assistant picks up the rest of the audio, processes the question and pulls up the text.” As it does so, it tries to understand what your sentence is about – what kind of intention do you have?”

To determine this, Assistant analyzes the text of your question with another neural network that tries to identify the semantics, i.e. the meaning, of your question.

In this case, it will figure out that it’s a question to search for, and since it’s a location-based question, if your settings allow it, Assistant can send your device’s geographic data to Google Maps to return results for which dog park is near you.

Once it has obtained the information it needs to answer you, the Assistant will rank your possible answers based on how confident it is that you understood it correctly and the relevance of your various potential answers.

Finally, it will decide which is the best response to provide in the appropriate format for your device. For example, if it’s just a speakerphone, it might give you spoken information. Conversely, if it has a screen in front of it, it might show you a map with walking directions.

Thus, in case I had to ask something more ambiguous like “Ok Google, what is the most popular dog?”, how would I know if I meant the breed of the dog, the name of the dog, or the most popular famous dog?

What the engineer comments is that the Assistant needs to understand what it is looking for, so in the above case, it was easy to intuit that since it is a location (“where is it”), it makes sense to use Maps to help.

Furthermore, Françoise states that the Assistant would recognize that this is a more open-ended question and would just use the Find tool to resolve it. Therefore, we may decide that this comes down to identifying the best interpretation.

On the other hand, she explains that something that is very useful is that the Assistant can rank how satisfied previous users were with similar answers to similar questions, which can help you decide how confident you are in your interpretation and that, ultimately, that question would go to Search and the results would be proposed to you in the format that is best for your device.

Lastly, if someone asks a question that has bits and bobs of different languages, how does the Assistant understand them?

From Francoise’s perspective, this is the most complicated aspect concerning Google’s voice assistants. Nevertheless, she mentions that the easiest way to deal with a case where the user speaks two languages is for the Assistant to listen to a bit of what they say and try to recognize in which language they are speaking, as it is able to do this using different models, each one dedicated to understanding specific languages.

He concluded by saying that there is another way to achieve this and that is to train a model to understand many languages at the same time. However, this alternative is still being developed by the team, but it is a fact that sooner or later it will become a reality.

“In many cases, people switch from one language to the other within the same sentence. Having a single model that understands what those languages are is a great solution to that” said the engineer by way of closing the interview.