Make talk. Yandex report

The standard libraries of speech recognition and text-to-speech in iOS provide a lot of possibilities. From the reportVolkovRomanyou will learn how to teach your application to pronounce text and customize voice acting with the minimum amount of code. Roma reviewed the speech recognition API, its limitations and features, the lifecycle of a recognition request, and methods of working in offline mode. UX examples, workarounds of existing bugs and features of working with an audio session are waiting for you.





- Hello everyone, my name is Roman Volkov. Today we are going to talk about how to teach your mobile application to communicate with your users.



Before we start, briefly about me. Prior to iOS development, I was involved in the development of integration systems in the banking sector and the development of analytical systems in the oil sector. I know firsthand what the PCI DSS standard is and, for example, how engineers understand what is happening in a well during drilling, only on the basis of temperature data.



Since 2016 I have been doing iOS development. I have experience of both freelancing and remote work, experience of participating in the launch of several startups. I also made a branded application for Rolls Royce.



In 2018, I joined the Prisma team, developed the Prisma app and participated in the development and launch of Lensa Photo Editor. In 2019, I moved to Yandex as an iOS developer. Since 2020, I have been leading the Yandex.Translator mobile development group. The Translator app is no longer just an application for working with text. We have a lot of cool features such as photo translation, dialogue mode, voice input, voice acting and more.



Just starting to dive into the topic of working with sound in iOS, I did not find any compact material that would include working with an audio session, and with synthesis, and with speech recognition. That is why I decided to give this talk.







It will be in four parts. First, we will talk about what an audio session is, how to work with it correctly, how it affects the operation of your application. Next, let's move on to speech synthesis. Let's consider how you can voice text to people in a few lines of code right on the phone. Next, we will switch to speech recognition. And in conclusion, let's see how all these features can be provided to the user in offline mode and what features it has.







There are quite a few options for using voice acting and speech recognition. My favorite is converting other people's audio messages to text. And I am glad that, for example, the Yandex. Messenger team has made such a feature. Hopefully, other messengers will catch up and do it at home too.



We are smoothly moving on to the first part of the report, this is AVAudioSession.







Audio session is a layer between our application and the operating system. More precisely, between your application and the hardware for working with sound: speaker and microphone. On iOS, watchOS, and tvOS, each app has a pre-configured default audio session. This preset varies from one OS to another.



Speaking specifically about iOS, the audio session by default supports audio playback, but prohibits any recording. If the switch of the silent mode is set to the "silent" mode, then absolutely all sounds inside your application are muffled. And third, locking the device stops all sounds inside your application from playing.







Setting up an audio session consists of three points: this is a choice of category, mode and additional options. We will look at each of the points separately.



Let's start with the category. A category is a set of settings for the basic behavior of an audio session. A category is a set of parameters that allow the operating system to match as much as possible, let's say, the name of this category. Therefore, Apple recommends choosing a category for your application as close as possible to the available one. There are currently six categories available in iOS 13. There is also a seventh category, but it is marked as deprecated and should not be used.



In this talk we will look at three categories: playback, record and playAndRecord. The mode allows you to supplement the capabilities of the set category, since some modes are available only for certain categories.







For example, on the slide you see the moviePlayback mode, and it can only be set for the Playback category.



Setting the moviePlayback mode allows the audio session to automatically improve the playback sound quality for the built-in speakers and for the headphones. In this talk, we will only use the "default" mode. But I want to note that if you use an incompatible pair of category and mode, then the default Mode will be used.







The third is options, point settings for the audio session. For example, you can customize how audio from your app mixes with audio from other apps, set up the proper deactivation of the audio session so that other apps can know that your app is done with audio.







First, we'll look at setting the category for playback, namely playback. This is one of the audio-only categories. If it is set, then the activation of the audio session interrupts other playing audio, for example, from other applications.



It is also important that the audio will play even if the mute switch is set to mute.



There is also an option to play in the background state for this category, but for this your application must have Audio, AirPlay, and Picture in Picture enabled.



Consider the two options that are visible on the slide. The first is mixWithOthers. If you activate the audio session with this option, then the playing of sound inside your application will be mixed with the current playing sound, for example, with music, at one volume level. But if you want your sound to prevail in terms of volume over the current playback, you can use the duckOthers option. It lowers the volume of the sound played in the background and returns it back when the sound playing inside your application has finished.



For example, this can be observed in navigation applications: for a route announcement, what you are listening to now is muted, the announcement is played, and then everything returns to its original state.







Let's consider the option of setting up an audio session for recognition from a microphone. The Record category mutes all playing audio while an audio session with this category is active in the application. Record cannot mute system sounds such as calls, alarms - in general, standard sounds that have a higher priority.



You can also add the option allowBluetoothA2DP, this allows you to use headsets like AirPods to record sound from a microphone and play sound in them. There is an older option for this, which sounds just like allowBluetooth, but it drops the sound quality a lot.



We used to use the old option, and there were complaints from users that they were not satisfied with the quality of the played and recorded sound inside the application. We changed the option, everything got better.



If you want to use both speech recognition and speech synthesis at the same time, use the playAndRecord category. Then, within the activated audio session, you can use both recording and playback of sound.



The notifyOthersOnDeactivation option should be considered separately. It is used in the method for activating the audio session. Why is it so important?







If the audio session was deactivated with this option, then other applications will receive the AVAudioSessionInterruptionNotification identification with the AVAudioSessionInterruptionTypeEnded parameter with the parameter that the interruption of their audio session has ended and they can continue working with the audio, which was started before they were interrupted.



This scenario is possible if you use the playback category in an application without the mixWithOthers option, because otherwise you will not interrupt the sound of another application, your audio will simply be mixed with another application.



Using this option and handling the modification correctly allows you to provide users with a comfortable user experience when working with your application.







On the slide, you can see an example of how to properly handle the notification that your application was interrupted by another within the audio session, and the situation when the interruption ended. That is, we subscribe to a certain notification and, perhaps, of two types: when the interruption has just begun and when it ended.



In the first case, you can save the state, and in the second, you can continue playing the sound that was interrupted by another application.



Here's an example of how this might work:



The video will play from the moment where the example is demonstrated.



In this example, the music was played in another application, namely in VLC, then I started the voice acting inside our application. The music was interrupted, the synthesized speech was playing, then the music automatically resumed playing.



I'd like to point out that not all applications correctly handle the situation when their sound is interrupted. For example, some popular instant messengers do not resume audio playback.







Let's summarize. We have analyzed the principle of the audio session. We examined the possibilities of configuring the audio session to the requirements of your applications and learned how to conveniently activate and deactivate the audio session for the user.



Move on. Synthesis of speech.







The slide shows a diagram of the classes involved in the speech synthesis process. The main classes are AVSpeechSynthesiser, AVSpeechUtterance and AVSpeechSynthesisVoice with its settings.



Separately, I note that there is an AVSpeechSynthesizerDelegate that allows you to receive notifications about the life cycle of the entire request. Since the sound of the text is playing the sound, the previously discussed AVAudioSession will be an implicit dependency here.



You can make a request for recognition without setting up an audio session, but for any production application it is important to understand how to set it up. We talked about this earlier.







The shortest example of how you can quickly make a speech synthesis request. You need to create an object of the AVSpeechUtterance class, where you specify the text you want to speak, the desired voice and language. If you do not specify a language locale when creating a voice, your phone's default locale will be used. But we will talk about the choice of voices and how to work with them in the next slides.



Next, you create an object of the AVSpeechSynthesizer class and call the speak method. All. After that, the text will be synthesized and played, you will hear the result.



But in fact, this is just the beginning. Speech synthesis has a lot more possibilities, which we will now talk about.







First, you can set the speed when receiving sound. The speed is specified as a real number in the range from zero to one. The actual rate varies from zero to one if you set the rate property to a range of 0 to 0.5.



If you set the rate value in the range from 0.5 to 1, then the rate changes proportionally in values ​​from 1X to 4X.



An example of how you can work with speed.







In AVFoundation, there is a constant AVSpeechUtteranceDefault, which is actually 0.5, which is equivalent to the normal speed of audio playback.



You can also specify a speed that is half the usual speed, you need to specify a value of 0.25. If you specify 0.75, the speed will be increased 2.5 times the normal speed. Also, for convenience, there are constants for the minimum speed and maximum.



I will now play a few examples:



The video will play from the moment the example



is shown. This was an example of how a Macintosh spoke for the first time in its own voice at an Apple presentation. And that was an example of normal synthesized speech speed.





This is 2 times slower.





This is 2.5 times faster.



Separately, with the last lines, I brought out the preUtteranceDelay and postUtteranceDelay properties. This is the delay before the sound starts playing and the delay after it has finished playing. It is convenient to use when you mix your application with sound from other applications and want the volume to go down, after a while and you lose your result. Then they waited some more time, and only after that the volume in another application returned to its original position.







Let's see the next parameter - voice selection. Voices for speech synthesis are divided mainly by locale, language, and quality. AVFoundation offers several ways to create or get an AVSpeechSynthesisVoice object. The first is by voice ID. Each voice has its own unique ID, and a list of all available voices can be found by accessing the SpeechVoice static property. Obtaining this property has some peculiarities, we will talk about them further.



It's worth noting that if you pass a non-valid identifier to the constructor, the constructor will return "no".



The second option is to get it by the language or locale code. I also want to note that Apple says that Siri voices are not available, but this is not entirely true: we managed to get the IDs of some of the voices that are used in Siri on some devices. Perhaps this is a bug.



Voices have two qualities - default and improved. For some voices, you can download an improved version, we will talk about this in the last section, we will discuss how you can download the necessary voices.







An example of how you can select a specific voice. The first way - by a specific identifier, the second - by the line indicating the language code, the third - by a specific locale.



Now I want to play two examples of dubbing the same text by different locales.



The video will be played from the moment where the example is demonstrated. The



second option, it seems to me, is closer to the Russian pronunciation.







Gender also appeared in iOS 13. And this property is only available on iOS 13 and above, it only works for voices that were added in iOS 13. Therefore, Gender is set as enum and has three properties: Female, Male, and Unspecified.



In our application, you can select the gender of the voice with which the text will be read. For the old voices, we made a list ourselves and keep it in our application. Separating which voice we consider male and which female, for those voices for which the system returns Unspecified.







In iOS 13.1, the list of votes may return an empty list on the first call. Solution: you can re-query the entire list once in a certain number of seconds. As soon as it comes back not empty, we believe that we have finally received the current list of votes.



This bug has been fixed in subsequent iOS versions, but don't be surprised if you see this in your apps.



An interesting point I came across while researching the documentation: there is a static property AVSpeechSynthesisVoiceAlexIdentifier. This is a very interesting identifier, because, firstly, not all devices can create a voice with this identifier. Secondly, it is not clear to me why it is located separately. Thirdly, if you do get a voice with this identifier, then this voice has a unique and distinct class.



At the same time, the study of framework headers did not bring me anything useful and interesting. If you know any information about this identifier - why is it needed, why it appeared, please tell me. I have not been able to find an answer to this question.







Here you can see an example of how we made in the interface the choice of voice based on locale, gender, and how we give the ability to specify the speed of voice acting for a particular language.







I will briefly talk about the system of signs for recording transcription based on the Latin language. When you give a text for voice acting, you can specify the pronunciation of specific words within it. On iOS, this is done through NSAttributedString, with a special key. Generation of this pronunciation is available directly on the iOS device in the Accessibility section. But for large volumes, it seems to me that this is very inconvenient, and you can automate the generation of phonetic transcription in other ways.



For example, here is a repository for English that has a large dictionary of word correlation and pronunciation.







The slide shows an example of how you can replace the pronunciation of a particular word for one locale. In this case, it is | təˈmɑːtəʊ |, tomato.



The video will be played from the moment where the example



is shown. Now the option has been played with the attribute set to pronunciation and without.







In total, we examined ways to create a speech synthesis request. Learned to work with voices. We looked at a workaround for one of the bugs that you may come across, and looked at phonetic transcription, how you can use it.







Let's move on to speech recognition. It is presented in iOS as a framework called Speech, and it enables speech recognition right on your devices.



Approximately 50 languages ​​and dialects are supported, available starting with iOS 10. Speech recognition usually requires an internet connection. But for some devices and for some languages, recognition can work offline. We will talk about this in the fourth part of my talk.



Speech recognition is available from both microphone and audio file. If you want to give the user the ability to recognize speech from the microphone, then the user must give permission for two permissions. The first is for access to the microphone, the second is for the fact that his speech will be transmitted to Apple servers for recognition.



Unfortunately, where you can only use offline recognition, it is impossible not to request this permission. It must be requested anyway.







The list is taken from the Apple website. These are the languages ​​and locales available for speech recognition. But in fact, this is a list of languages ​​and locales available for dictation on a standard keyboard. And the Speech framework API under the hood refers to the implementation of dictation from the standard keyboard.







Speech recognition is free for us as developers, but it has a limit on its use. The first is the limit for devices and requests per day. The second is the total limit for the application. And third - you can recognize a maximum of one minute. The only exception is offline mode. In it you can do recognition of long recorded audio messages.



Apple, of course, does not say specific numbers for the limits, and as it was written or said in the WWDC report, you need to be prepared to handle errors and write to them if you often, let's say, stumble upon these limits. But we don't have such a problem. For the Russian language, we use SpeechKit as a speech recognition engine. And most of our users are Russian-speaking, so we have not encountered the limits.



Also, be sure to think about privacy. Do not allow voice acting on data - passwords, credit card data. Any sensitive or private information should not be available for recognition.







On the slide, you can see a conditional diagram of the classes involved in the speech recognition process. Similarly to synthesis, working with recognition is working with an audio iron, so AVAudioSession is also an explicit dependency here.







Support. To get a set of all supported locales, you need to access the supportedLocales page property. Support for a particular locale does not generally guarantee that speech recognition is available for that locale. For example, a persistent connection to Apple servers may be required.



Locale support for recognition matches the list of locales for dictation in the keyboard on iOS. Here's a complete list . To ensure that a given locale can be handled right now, you can use the isAvailable property.







In speech recognition on iOS, there is no locale priority for each language, unlike synthesis. Therefore, if we take the first locale of a specific language from the list of all locales, then there may be some not the most popular locale. Therefore, for some languages ​​in Translator, we have made the priority of a specific locale for a specific language.



For example, for English we use en-US. The first time a user tries to recognize something in English, we use the American locale.







Recognition request from file. Everything is simple here. You need to get and link to the file, create an SFSpeechRecognizer object indicating the locale you want to use. Check that recognition is available at the moment. Create SFSpeechURLRecognitionRequest using constructs where you pass the file path. And start the recognition task.



As a result, you will receive either a recognition error or a result. The result has an isFinal property, which indicates that this result is final and can be used further.







Here is a slightly more complex example - a request for recognition from a microphone. To do it, we also need an AVAudioEngine object, which is responsible for working with the microphone. We will not go into details of how this works. You set the category you want - either .record or .playRecord. Turn on the audio session. Configure AudioEngine and subscribe to receive audio buffers from the microphone. You add them to the recognition request, and when you are done recognizing, you can exit the microphone.



It is worth noting that the shouldReportPartialResults property, which is responsible for issuing temporary recognition results, is set to true. Let's take a look at the options: what an application might look like with and without the shouldReportPartialResults flag.



The video will play from the moment where the example is demonstrated



In the example on the left, I left the microphone's response to sound, to changing the volume. It can be seen that I am saying something. But until I finish speaking, you don't see anything. It takes a long time for the user to get the result of what he dictated.



If you set shouldReportPartialResults to true and handle it correctly, then the user will see what he is saying as he speaks. This is very convenient and this is the right way to do the interface in terms of dictation.







Here is an example of how we handle the work with an audio session. Inside the Translator, we use not only work with the sound that we wrote, but also other frameworks that can do something with the audio session.



We wrote a controller that, firstly, checks that the settings, categories are the ones we need, and secondly, it does not do the things that constantly turn the audio session on and off.



Even before the development of the dialogue mode, voice input and dubbing we ourselves turned on and off the audio session. When we started making the dialogue mode, it turned out that these on / off add an extra delay between the time you said something and getting the voice acting.







For a speech recognition query, you can specify a hint - the type of speech to be recognized. It can be unspecified, dictation, search, or short confirmation. More often than not, if your user is going to be saying something long, dictation is better.







Starting with iOS 13, audio analytics is available to us. The slide shows the parameters that can be obtained as a result of recognized speech. That is, you will receive as a result not only what the user said, but also what voice he said it in.







We will not dwell on this for a long time. Here is an example of how you can get analytics as a result of recognized text.







In total, we studied the capabilities of the Speech framework for speech recognition, learned how to give hints for speech recognition, and quickly looked at the capabilities of analytics.



And last but not least: working offline. The first thing I want to talk about is a list of offline languages ​​for speech synthesis. Nowhere in the documentation did I find a mention of how you can explicitly download voices for working offline. Both the reports and the documentation say that these voices can be downloaded, but where it is not written.



I searched the system and found that if you go to Settings, to the Accessibility section, then "Oral content" and "Voices", you will see, first, a list of languages ​​for which it is available. Secondly, by switching to a specific language, you can download new voices.



And that list clearly matches what AVSpeechSynthesisVoice.speechVoices returns inside the application. This means that you can teach your users that they can download the languages ​​they need to use text-to-speech offline.







List of offline languages ​​for recognition. It is not explicitly indicated anywhere in the documentation, but judging by different forums and by what we have encountered, this is the list of languages ​​and locales for them that can work offline without access to the Internet.



It should be noted that offline recognition is available on devices with an A9 chip and older.







Now comes the fun part. List of offline languages ​​for speech recognition. Unlike synthesis, there is generally no way to explicitly download languages ​​for yourself. If you add a language to the standard keyboard, an offline package can be downloaded for it. Unfortunately, this is not deterministic. Let's go to Settings> General> Keyboard> Dictation. For example, I added Spanish. After that, under the "Dictation" appears a small hint that dictation may be available for these languages. Spanish appeared there.



Then I went to our application, turned off the internet, and to my delight offline recognition in Spanish worked.



Unfortunately, this can only be influenced indirectly, the only way is to add the language to the standard keyboard. But this does not guarantee that the offline recognition package will be downloaded.







On iOS, even if you have access to the Internet on your phone, you can power the device and do speech recognition on it, if recognition is, of course, available.



There is a supportsOnDeviceRecognition property, it is available since iOS 13. But this property does not work correctly, I showed a screenshot of the error at the bottom right. The bug was only fixed in 13.2. The property always returns false on the first request. According to Apple, it will return the correct value after a few seconds.



Moreover, this property can give false, but at the same time setting the requiresOnDeviceRecognition flag to true works successfully. This means that the recognition works completely on the device, even if this test flag returns false.



There may be several solutions here. First, you can only do offline recognition on iOS 13.2. Secondly, you can choose a certain number of seconds to re-request this property and update the user interface. And thirdly, you can forget about this property: try to recognize the voice offline, and in case of an error, just show it to the user.







We looked at how you can explicitly download offline speech synthesis packages, and found a way to try forcing iOS to download offline speech recognition packages.



Now you know how to quickly add speech synthesis and recognition to your applications. I have everything, thanks for your attention.



All Articles