🈁 🧖🏾 ❄️ We have published a modern Voice Activity Detector and more 😂 🤱🏽 🚣

When working with speech, several very "simple" questions always arise, for the solution of which there are not many convenient, open and simple tools: detection of the presence of a voice (or music), detection of the presence of numbers, and classification of languages .

To solve the problem of voice detection (Voice Activity Detector, VAD), there is a fairly popular tool from Google - webRTC VAD . It is undemanding in resources and compact, but its main disadvantage is its instability to noise, a large number of false positives and the impossibility of fine tuning. It is clear that if we reformulate the problem not into voice detection, but into silence detection (silence is the absence of both voice and noise), then it is solved in very trivial ways (energy threshold, for example), but with the same disadvantages and limitations. The most unpleasant thing is that often such decisions are fragile and some hard-code thresholds are not transferred to other domains.

STT ( PyTorch ONNX), , , , VAD , MIT. .

"VAD"?

VAD — , ;
Number detector — , ;
Language classifier — ;
4 (, , , ), VAD ( — - , , VAD !);

"" :

4 ;
VAD WebRTC ;
;
, 1 ;
edge ;
(PyTorch, ONNX);
WebRTC , ;
PyTorch (JIT), ONNX;

;
;
(- , , STT);
edge ;
ONNX ;
VAD 16 kHz, 8 kHz;

colab . , :

PyTorch ONNX;
— VAD — , / ;
— . VAD ;
, ( , 1 , - );

, VAD :

import torch
torch.set_num_threads(1)

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=True)

(get_speech_ts,
 _, read_audio,
 _, _, _) = utils

files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'

wav = read_audio(f'{files_dir}/en.wav')
speech_timestamps = get_speech_ts(wav, model,
                                  num_steps=4)
print(speech_timestamps)

VAD

, VAD. .

250 . , 100 , 30-50. ( 100 250 );
VAD ( );
500 ( 200 ) 4 8 ;
;
, "" "". - ;

1 AMD Ryzen Threadripper 3960X. :

torch.set_num_threads(1) # pytorch
ort_session.intra_op_num_threads = 1 # onnx
ort_session.inter_op_num_threads = 1 # onnx

, :

num_steps — "";
number of audio streams — ;
, num_steps * number of audio streams;

Batch size	Pytorch latency, ms	Onnx latency, ms
2	9	2
4	11	4
8	14	7
16	19	12
40	36	29
80	64	55
120	96	85
200	157	137

, 1 :

Batch size	num_steps	Pytorch model RTS	Onnx model RTS
40	4	68	86
40	8	34	43
80	4	78	91
80	8	39	45
120	4	78	88
120	8	39	44
200	4	80	91
200	8	40	46

, , VAD . WebRT, 0 1?

WebRTC 0 1. - 30 , 250 8 . , 0 1 .

, VAD , . , . VAD.

We have published a modern Voice Activity Detector and more

VAD

More articles: