Update - forgot the link to the repository - https://github.com/snakers4/silero-models#text-to-speech and to the colab with examples - https://colab.research.google.com/github/snakers4/silero- models / blob / master / examples_tts.ipynb
We were very glad that Habr liked our last article . We have received a lot of positive and negative feedback. Also in it, we made a number of promises to develop our synthesis.
We have made significant progress on these points, but the ultimatum release with all the new features and speakers can take a relatively long time, so I would not want to go into radio silence for a long time. In this article, we will answer fair and not so criticism and share the good news about the development of our synthesis.
In short:
- We made our vocoder 4x faster;
- We've made packaging models more convenient;
- We made a multi-speaker / multilingual model and "forced" the speakers to speak "foreign" languages;
- ;
- 15 — 1 ( 3-7 ) 5 ( ). ;
- , . (, , , , ). — ;
- , , - 5-10 , ;
, ,
. . . — ( ).
, ( — ). … . .
warning
, ( ), . ( , ).
- ;
- ( , , , );
- ( 2 - 1-2);
silero-vad
, , silero-models
. , :
-
torch.hub
, (omegaconf
yaml-torchaudio
). PyTorch. , , ( , c " "). colab, standalone . -# Minimal Example to Run Locally
; - , , . ;
torch.hub
:
import torch
language = 'ru'
speaker = 'kseniya_v2'
sample_rate = 16000
device = torch.device('cpu')
model, example_text = torch.hub.load(repo_or_dir='snakers4/silero-models',
model='silero_tts',
language=language,
speaker=speaker)
model.to(device) # gpu or cpu
audio = model.apply_tts(texts=[example_text],
sample_rate=sample_rate)
standalone :
import os
import torch
device = torch.device('cpu')
torch.set_num_threads(4)
local_file = 'model.pt'
if not os.path.isfile(local_file):
torch.hub.download_url_to_file('https://models.silero.ai/models/tts/ru/v2_kseniya.pt',
local_file)
model = torch.package.PackageImporter(local_file).load_pickle("tts_models", "model")
model.to(device)
example_batch = [' + + .',
' - !',
'+ + + +.']
sample_rate = 16000
audio_paths = model.save_wav(texts=example_batch,
sample_rate=sample_rate)
, 15 — 20 . , , , . 2-3 .
:
15 — 20 | ||
5-6 | , , | |
3 | , | |
1 | ||
-------------------------- | ------------------------ | --------------------------------------------------------- |
5-6 | , | , , |
5-6 | , | , , |
5-6 | , | , , |
3 | , | , , |
1 | , | , , |
3 — 15 | , | , 3 |
, 6 :
, 6 :
, 3 :
, 3 .
, 3 :
, 1 :
1 .
, 1 :
1 .
, 3 — 15 :
, ? , - 3 .
:
, - ( ).
, , .
, , .
, , . . , . , , , .
, , "" ( LTV , - ), , , .
:
, :
, .
:
-. , : Mein König, das Fichtenbaum, Bundesausbildungsförderungsgesetz, die Ubüng.
.
. . , . , .
:
- 5-6 (, );
- , , 15 — 1 ;
- , ;
- ;
- , , (40 — 100 ) ;
, "" -. . , .
, .
, , / , , .
, . . , , .
, .
, . , "" . , , " ".
4 ( 0.1 — 0.2 MOS ) :
8 kHz | 16 kHz | |
---|---|---|
v1 , 1 | 18 | 8 |
v2 , 1 | 70 | 35 |
, . , 10 . v2
.
, ( abandonware). , , " ".
:
- . 100 — 200 . ;
- , - ;
- , ;
, , :
- ( 130 ), 99.9%;
- ( 540 ), 99.9%;
- 1,300 ( );
- ( , 2 ), 99%;
- 3% -, (
—
,
—
). , ;
3% , , ( ). , — . .
:
- ,
(
—
,
—
), , . (
, hard negative ); -
99% ( hard positive, hard negative ); - , . - —
; - ,
.
;
, :
(
, -
,
).
(
) , .
, "", , .
:
-
middleware
; - / - - ;
- , ;
- ;
— middleware
. / - , — + (1-2 ), .
— . , - . : " — ". "" , ( 0.25 — 0.5 ).
— , 1 1 . , STT . , . , ( ), — .
.
, - — . , - — .
:
- 4 ;
- ( );
- ;
-
; - ;
:
- ;
- (10+ );
- , ;
- ;
- ;
- ;
— — https://github.com/snakers4/silero-models#text-to-speech — https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb
— , . . 1 . .