How did the idea come about
It all started with the post I saw about the new Maxine platform with AI for upgrading video communications from Nvidia, one of the features of this platform is simultaneous translation in the form of titles, this feature is implemented using a framework from the same Nvidia called Jarvis, this framework is designed for multimodal AI conversational services, providing real-time GPU performance. It is this concept of simultaneous interpretation that forms the basis of our audio and video communication platform. Since this is a new platform, it should have a number of features compared to other similar platforms, so we decided to add a voice to these titles, forming a user's voice profile and synthesizing speech, taking into account the tonality and color of the voice of the person who speaks.
Speech to text or speech recognition
Is it better to use Google, Yandex or Mozilla?
Google, in comparison with Yandex, has a greater recognition accuracy, we ran 5 test voice messages: 3 in English and 2 in Russian through the Google API and the recognition accuracy was 100% (5/5), Yandex 60% (3/5). Google supports 125 languages, Yandex - 3 languages.
The advantages of Mozilla Deepspeech is the recognition accuracy, since it is 92.5%, for comparison, a person recognizes with an accuracy of 94.2%, therefore the recognition accuracy of test voice messages was 100% (5/5), and the advantage is that this engine open source, unlike Google and Yandex. The disadvantage of this engine is the number of recognized languages - these are English, Russian and French.
As a result, the choice fell on Google Speech to Text due to the ratio of the number of languages to the recognition accuracy.
Text translation
To solve this problem, the first thing that comes to mind is using a ready-made API from Google or Yandex. The first problem we encountered was the inaccuracy of the translation. For example, the translation of the sentence “The people in China are apparently invisible” from Russian into English. Yandex Translator: “People in China are apparently invisible” and Google Translator: “There are a lot of people in China”, in this case Google did better.
There is currently no panacea for solving this problem. The main task of these translators today is to teach the algorithm to understand the meaning of a sentence / text. If the algorithm makes sense, then the translation will be of much better quality.
Translation of a number of sentences related to business topics through Google Translate and Yandex Translator showed that Google does it more competently, so we will use Google Translate.
Analyze and get a voice profile
To get a voice profile, we need to collect some dataset. Since the task is to synthesize the translated text by the speaker's speech, we need to collect a dataset from each user. This is done by reading specialized text containing the required set of letter combinations, syntactic constructions and punctuation marks. The duration of reading the text is about 15 minutes, so we get a sufficient amount of information about the frequency and intonation characteristics of each user. The reading of the text can be repeated to improve the final results.
Synthesis of speech taking into account the voice profile
Synthesizing a person's speech in a language that he never spoke is not an easy task. To do this, it is necessary to collect the primary dataset with the help of bilingual people who will also read the specialized text, then read the similar text in another language, and then additional texts to expand the dataset. Based on this training and the identified relationships, users' speech in another language will subsequently be generated. Also, already existing automated solutions for the synthesis of speech in various languages will help in this process, since the collection of a completely independent dataset of the required scale does not seem to be either effective or realistic.
Output
Our main task at the moment is to combine a voice profile with speech synthesis, since transferring a voice profile to another language is not an easy task and you need to train a neural network so that it understands how to do it, having only two datasets in different languages ...
During the development of the project, we will make publications related to more specific tasks and ways to solve them.