Open Source SOVA Dataset: Audio for Speech Recognition and Synthesis

Hello everyone! We are a team of Nanosemantics, and recently we launched the SOVA project, where we are collecting a dataset for use in training neural networks and creating virtual assistants based on artificial intelligence.





We have prepared a large dataset for training speech recognition engines and we want to share it so that companies can implement it in their own country to solve various business problems. Data is the new oil, and one of the most important reasons for the advancement of machine learning recognition systems is the presence of tagged datasets. If you are interested in research and development in the field of speech analytics, go under cat.



In 2019, Nanosemantics received a grant from the RVC Foundation, within the framework of which it is necessary to prepare one of the largest open datasets in Russia by the end of 2022. This is a great opportunity for us to make a really useful dataset. It will include 30,000 hours of audio recordings with lyrics, 3 languages โ€‹โ€‹(Russian, English and Chinese) and a huge number of speakers, the audio from which will be used in the dataset. The dataset will be made publicly available in stages (free of charge) so that developers from all over the world can use it to train neural networks, create their own virtual assistants with artificial intelligence and train speech recognition systems. 





, , : .





:





  1. , , . , โ€” . .





  2. . , , - , . . . , . , , .









  3. . , .





, , โ€”  .





, : - , - . , , , .









( ) โ€“ , - . , , , , , , , . . () Wikipedia





?





: , , , ,   .





ยซ ยป: , . , , , 70 . , . , . , .





, , . , . , , .





, (, , . .), , , . , , , , . , " / ", " " . ., .





, ?





:





















:





  • -









  • Creative Commons Attribution โ€“ CC BY ( , )





  • Creative Commons Zero โ€“ CC0





  • WTFPL โ€“ Do What The Fuck You Want To Public License





, .





( )?





, .





5.1. 1235 , .





( )?





, - . .





, , , .





 





. , . 20 . , โ€“ - .





:













  • ,









, , . . , , , . , .





, : , , , .





โ€“ .









. , . 





:





  • , ,





  • , ,









  • , ,





:





  • .









, โ€“ . : , , . , , , . , , : , , , , . .





. , 20 . - , - . , , ; , , ? . .





:





  • .





  • .





  • , , .





  • , โ€” .





  • ( ), , .





  • , , .





  • , .





, .





Voice-over recording software

.









, , , , . 





VoicyBot, ยซยป . , , . , , . 





. , , โ€” , . Open Source : . : , , , . , , , . . 





Youtube





. Youtube (), . , , .





. , (FEFU) , .





, , Creative Commons โ€“ CC BY. .





YouTube โ€œ Creative Commonsโ€. API Youtube. 





EngAudiobooksOriginal โ€” , , .





EngAudiobooksNoisy โ€” .





RuAudiobooksDevices โ€” , , .





RuDevices โ€” , .





Open Source dataset SOVA
Open Source SOVA

โ€” , . .





CER โ€” Char Error Rate. . , . 





CER โ€” 5.





, , 95% - โ€” .





, : 





Standard settings for all audio recordings

, , : -, .





.





: . , Youtube ( ), โ€”  . .









, , .





โ€”  forced alignment ยซยป , . , , , . , , , . ยซยป .   : NLab Speech ยซยป . -.





, ยซยป, . , - .









, โ€”  , . Voice Activity Detector โ€” , . : 30 100 . - , 100 10 . โ€”  , : .





: , , .





ยซยป, . , : , , .





/

. .





Common Voice. , . 7 335 60





Russian Speech Database (STC Russian). 1996-1998 89 . 5 . 15 1-3 . , 200 4000 EUR . . , 10-30 .





CSS10 Russian: Single Speaker Speech Dataset. CSS10 (A Collection of Single Speaker Speech Datasets for 10 Languages) 22 , LibriVox. CC0: Public Domain.





M-AILABS Speech Dataset. 46 , LibriVox. .





Russian LibriSpeech (RuLS). , LibriVox. 98 .





Russian Open Speech To Text (STT/ASR) Dataset, OpenSTT. , . 20000 ( 2,3 TB .wav). , , YouTube, , . . CC-BY-NC ( ).





, :





  • , OpenSTT, , ,





  • OpenSTT , . , .





  • OpenSTT : + .





, . , SOVA . , SOVA .





, ,  .





2021 SOVA Dataset 11,402 . 1,1 TB .wav. , .





Open Source CC-BY 4.0. , , .





SOVA Dataset GitHub.





, . .





2021 . 10000 , . , , Youtube .





, 2022 30000 .





SOVA Dataset โ€“ Open Source SOVA.ai: . . Open Source , , ยซ ยป. , , - Open Source .





. , SOVA Dataset , . 





, . , , , partnership@sova.ai.








All Articles