We've made our public speech synthesis even better

6cc6e0011d4d26aeded6f052080b1890







Update - forgot the link to the repository - https://github.com/snakers4/silero-models#text-to-speech and to the colab with examples - https://colab.research.google.com/github/snakers4/silero- models / blob / master / examples_tts.ipynb







We were very glad that Habr liked our last article . We have received a lot of positive and negative feedback. Also in it, we made a number of promises to develop our synthesis.







We have made significant progress on these points, but the ultimatum release with all the new features and speakers can take a relatively long time, so I would not want to go into radio silence for a long time. In this article, we will answer fair and not so criticism and share the good news about the development of our synthesis.







In short:







  • We made our vocoder 4x faster;
  • We've made packaging models more convenient;
  • We made a multi-speaker / multilingual model and "forced" the speakers to speak "foreign" languages;
  • ;
  • 15 — 1 ( 3-7 ) 5 ( ). ;
  • , . (, , , , ). — ;
  • , , - 5-10 , ;




, ,



. . . — ( ).







, ( — ). … . .







warning



, ( ), . ( , ).







(, , , ), :







  • ;
  • ( , , , );
  • ( 2 - 1-2);




silero-vad



, , silero-models



. , :







  • torch.hub



    , (omegaconf



    yaml- torchaudio



    ). PyTorch. , , ( , c " "). colab, standalone . - # Minimal Example to Run Locally



    ;
  • , , . ;


, ( - ), PyTorch 1.9. , .







torch.hub



:







import torch

language = 'ru'
speaker = 'kseniya_v2'
sample_rate = 16000
device = torch.device('cpu')

model, example_text = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                     model='silero_tts',
                                     language=language,
                                     speaker=speaker)
model.to(device)  # gpu or cpu

audio = model.apply_tts(texts=[example_text],
                        sample_rate=sample_rate)
      
      





standalone :







import os
import torch

device = torch.device('cpu')
torch.set_num_threads(4)
local_file = 'model.pt'

if not os.path.isfile(local_file):
    torch.hub.download_url_to_file('https://models.silero.ai/models/tts/ru/v2_kseniya.pt',
                                   local_file)  

model = torch.package.PackageImporter(local_file).load_pickle("tts_models", "model")
model.to(device)

example_batch = ['     + +    .',
                 ' -  !',
                 '+ + +  +.']
sample_rate = 16000

audio_paths = model.save_wav(texts=example_batch,
                             sample_rate=sample_rate)
      
      







, 15 — 20 . , , , . 2-3 .







:







15 — 20
5-6 , ,
3 ,
1
-------------------------- ------------------------ ---------------------------------------------------------
5-6 , , ,
5-6 , , ,
5-6 , , ,
3 , , ,
1 , , ,
3 — 15 , , 3


, 6 :











, 6 :









, 3 :







, 3 .









, 3 :









, 1 :







1 .









, 1 :







1 .









, 3 — 15 :







, ? , - 3 .













:







, - ( ).







, , .







, , .









, , . . , . , , , .







, , "" ( LTV , - ), , , .







:







, :













, .







:







-. , : Mein König, das Fichtenbaum, Bundesausbildungsförderungsgesetz, die Ubüng.



.









. . , . , .







:







  • 5-6 (, );
  • , , 15 — 1 ;
  • , ;
  • ;
  • , , (40 — 100 ) ;




, "" -. . , .







, .































, , / , , .







photo_2021-05-17_18-05-35







, . . , , .







, .









, . , "" . , , " ".







4 ( 0.1 — 0.2 MOS ) :







8 kHz 16 kHz
v1 , 1 18 8
v2 , 1 70 35


, . , 10 . v2



.









, ( abandonware). , , " ".







:







  • . 100 — 200 . ;
  • , - ;
  • , ;


, , :







  • ( 130 ), 99.9%;
  • ( 540 ), 99.9%;
  • 1,300 ( );
  • ( , 2 ), 99%;
  • 3% -, (







    ,







    ). , ;


3% , , ( ). , — . .











:







  • ,



    (







    ,







    ), , . (



    , hard negative );




  • 99% ( hard positive, hard negative );
  • , . - —



    ;
  • ,



    .



    ;


, :



(



, -



,



).

(



) , .









, "", , .







:







  • middleware



    ;
  • / - - ;
  • , ;
  • ;


middleware



. / - , — + (1-2 ), .







— . , - . : " — ". "" , ( 0.25 — 0.5 ).







— , 1 1 . , STT . , . , ( ), — .







.







, - — . , - — .









:







  • 4 ;
  • ( );
  • ;




  • ;
  • ;


:







  • ;
  • (10+ );
  • , ;
  • ;
  • ;
  • ;




— — https://github.com/snakers4/silero-models#text-to-speechhttps://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb







— , . . 1 . .








All Articles