Research workshop. Voice-activated virtual assistants - what's wrong with them?

Introduction



Analysts researching chatbot and virtual assistant services promise market growth of at least 30% per year. In absolute terms, as of 2019, the market was valued at over $ 2 billion per year. Virtually all the world's leading IT companies have released virtual voice assistants, and Apple, Google and Amazon have already done the bulk of their promotion.



image



The Russian market also has its own leaders in this area. Yandex became the first major player to launch its own voice assistant in Russia. According to the company's officially published data, 45 million users a month use Alice, and the number of monthly requests to the assistant is more than 1 billion.According to experts, 2020 could be a turning point for the voice assistants market - competition between platforms and brands will lead to an increase in the recognition of assistants ...



In general, there is no doubt that the voice assistant market is an interesting niche. And the first idea that comes to mind is to take any of the available ASR (Automatic Speech Recognition) and TTS (Text To Speech) services, link them to a bot constructor that has NLU (Natural Language Understanding) support, and that's it! Moreover, all this can be implemented quite easily and quickly in cloud platforms such as Twilio and VoxImplant.



The only problem is that the result will be very mediocre. What is the reason for this? First of all, let's try to understand why a combination of pretty good technologies, put together, give such a mediocre result. This is important because in real life, the client will always give preference to the service whose voice service is more convenient, interesting, smarter and faster than others.



How a typical voice assistant works



First of all, we note that our speech is a sequence of sounds. Sound, in turn, is the superposition of sound vibrations (waves) of different frequencies. A wave, as we know from physics, is characterized by two attributes - amplitude and frequency.



image

Speech signal



Assistant's work algorithm:



  1. , , – . , «», .. .



    , , , - . ( ), «» . , , — , — . , , . , , , , .



    , , , , . , ASR .



    , – . , .



    , .
  2. The result of the voice assistant's work, obtained at the first stage, is transmitted to the bot, with NLU support for identifying intents, entities, filling slots and forming the response text.



    As a result, at the output we get a test presentation of the response phrase, which is the reaction of our voice assistant to the received request.
  3. The answer of the voice assistant is transmitted to the speech synthesis service, which is subsequently voiced to the person.


Emerging problems



Despite the seemingly obvious correctness of the implemented approach, in the case of a voice assistant, it carries a lot of problems. Here are the main ones:



  1. Delays
  2. Delays




  3. . , , 500 , .



    , 1 . - « » : «!» « ?». , , , , -, .



    , :



    • . – « »: , , .
    • .
    • .
    • .


    !

  4. . , .. . . , , , .. .
  5. . , . , , – .
  6. – . . , .



    :



    — ?

    — . , ? ?



    – « » : « » « ». « » , « » « ».
  7. -. .



    :



    — ---… ---…

    — , , … --…

    — , , --… , …



    , .. , , . .. , .
  8. , TTS-.


?



First, when implementing a voice assistant, it is imperative to ensure that the interlocutor is “listened to”, incl. in those moments when the virtual assistant himself voices the outgoing message. Choice of either listening or responding is an extremely poor implementation and should be avoided in real life.



Secondly, you should optimize the speed of all system components. However, at some point in time we will definitely run into the limits of possible reduction of latencies and complication of natural language processing scenarios. Therefore, the understanding comes to us that it is necessary to fundamentally change the approach to the implementation of voice service.



The main idea that underlies the new approach is to take an example from the process implemented by the human brain. Have you noticed that a person, in the course of a conversation, begins to analyze the message that the interlocutor says, not at the moment when it was completely finished, but almost immediately, at the very beginning of its sounding, clarifying with each new word? For this reason, we are often ready to give an answer even before the interlocutor has finished his message.



If we return to the algorithm that the voice virtual assistant should implement, it may look like this (for illustration, consider the incoming question: "Where is the nearest ATM?"):



  1. ASR , . .



    :



    a) «»

    b) «»

    c) «»

    d) «»
  2. , ,



    :



    a) «»

    b) « »

    c) « »

    d) « »
  3. , NLU, .



    :



    a) : «». :

    b) : « ». : « » 50%, « » 50%

    c) : « ». : « » 50%, « » 50%, « » = « »

    d) : « ». : « » 100%, « » = « »



    image



  4. , 1 , , , , :



    • ;
    • ;
    • , .. 3.


    , , ( – = 0%).



    , . , , , , , .
  5. As soon as it is revealed that the user has finished his message (determined by the delay in the input stream), we dump the response corresponding to the most probable detected intent into the output buffer. Better yet, to optimize for speed, keep in the output buffer not the textual representation of the response, but immediately the audio fragment received from the TTS, thereby accumulating the full version of the response audio message.
  6. We announce the contents of the output buffer to the user.


Ways to improve the quality of the assistant's work



Let's look at what methods are available to further improve the quality of our voice virtual assistant:







  1. . , . , (/ , ..) .
  2. «»



    «» , , . , «» .



    , «» , , .




  3. , , . .. , , , . , , , . , ..




  4. , -. .



    , , « ». – , . , , .




  5. , . , .. .




  6. , , . , , , . , .



    . online.


-



Until now, we have considered only the technical features of the implementation of virtual voice assistants. But we must understand that success does not always depend only on the perfection of technical implementation. Let's analyze the already considered example: "Where is the nearest ATM?" and understand what is the peculiarity for its implementation in the voice interface.



You know, there is a rule that holds true for sales managers - "What cannot be sold over the phone should not be sold over the phone." For this very reason, the answer of the form "The nearest ATM is located at ..." is not informative for a person. If he knew well the area where he is now, i.e. If he knew the names of all nearby streets and house numbers, then most likely he would have known where the nearest ATM is. So such an answer will most likely immediately cause the formation of another question: "Where then is the address just named?" A much more informative answer would be the option: "The nearest ATM is located about a hundred meters from you in the direction to the southeast", or even better, also send a person a message like location on Yandex or Google maps.



The general rule here is that if for further use of information it is required to transfer it to another perception channel, then this option is an unfortunate choice for direct implementation within the framework of the voice interface. It is required to reformulate the answer into a form that is convenient for listening.



For a number of services, their implementation within the framework of a voice assistant is generally the most successful solution. For example, if a person is in a stressful situation, then, as a rule, it is difficult for him to concentrate and quickly describe the problem in text in the chat, and he will always prefer to express everything by voice. This can become an important criterion when choosing business cases for implementation within a virtual voice assistant.



The second obvious choice of cases for the implementation of "voice" is the need to use them in situations where there are either legal restrictions on this score (for example, while driving a car, it is forbidden to carry out text correspondence), or it is simply inconvenient to use other communication channels (for example, during work or playing sports when a person's hands are simply busy).



There are no boundaries for perfection



Voice is more convenient than any other interface when the user needs a very specific function to solve a very specific task. Why is that? It is very simple - in such a situation, the need to wait for the site to load, scrolling through the page, searching through the application menu, pressing buttons, etc. always more inconvenient than a quickly spoken voice command. Websites and applications are multifunctional. And this is their advantage and disadvantage at the same time. The voice skill should be tailored to the function "here and now".



It is important to remember that you should avoid situations where voice commands need to be accompanied by any additional actions in other interfaces. Otherwise, it makes the voice channel inoperative. the principle of eyes-free is violated, since it is necessary to read, and hands-free, if you still need to clamp something.



Another important recommendation is that you should not try to teach a person to speak. He can do it perfectly well without us, because language is an already familiar and understandable interface. Illustrative example of bad style: "To listen to this message again, say: Listen again." You and I don't talk like that in ordinary life. Is not it so? Better to just ask, "Will you listen to the message again or go to the next one?"



It is a good practice to implement a voice-activated virtual assistant to avoid open-ended questions altogether. It is advisable to direct the interlocutor to specific actions. It is especially valuable where the assistant acts as a navigator or recommendation system. A voice assistant should not require too much detailed information from a person. Check it out as the conversation progresses.



And finally, I would like to note that personalization is perhaps the main thing that is lacking in the existing voice dialog interfaces. Without this, it is impossible to conduct a more or less lengthy dialogue. The assistant must collect data about the interlocutor, structure and verify the information received. It is important not to lose the thread of the dialogue, to preserve and take into account the context of the conversation. It is important. Otherwise, the assistant will be able to implement only short and fairly simple queries, and, as a result, this will not allow you to enter into a truly lively dialogue when the voice assistant communicates with the user.



All Articles