When working with speech, several very "simple" questions always arise, for the solution of which there are not many convenient, open and simple tools: detection of the presence of a voice (or music), detection of the presence of numbers, and classification of languages .
To solve the problem of voice detection (Voice Activity Detector, VAD), there is a fairly popular tool from Google - webRTC VAD . It is undemanding in resources and compact, but its main disadvantage is its instability to noise, a large number of false positives and the impossibility of fine tuning. It is clear that if we reformulate the problem not into voice detection, but into silence detection (silence is the absence of both voice and noise), then it is solved in very trivial ways (energy threshold, for example), but with the same disadvantages and limitations. The most unpleasant thing is that often such decisions are fragile and some hard-code thresholds are not transferred to other domains.
STT ( PyTorch ONNX), , , , VAD , MIT. .
"VAD"?
- VAD — , ;
- Number detector — , ;
- Language classifier — ;
- 4 (, , , ), VAD ( — - , , VAD !);
"" :
- 4 ;
- VAD WebRTC ;
- ;
- , 1 ;
- edge ;
- (PyTorch, ONNX);
- WebRTC , ;
- PyTorch (JIT), ONNX;
- ;
- ;
- (- , , STT);
- edge ;
- ONNX ;
- VAD 16 kHz, 8 kHz;
colab . , :
- PyTorch ONNX;
- — VAD — , / ;
- — . VAD ;
- , ( , 1 , - );
, VAD :
import torch
torch.set_num_threads(1)
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True)
(get_speech_ts,
_, read_audio,
_, _, _) = utils
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
wav = read_audio(f'{files_dir}/en.wav')
speech_timestamps = get_speech_ts(wav, model,
num_steps=4)
print(speech_timestamps)
VAD
, VAD. .
1 AMD Ryzen Threadripper 3960X. :
torch.set_num_threads(1) # pytorch
ort_session.intra_op_num_threads = 1 # onnx
ort_session.inter_op_num_threads = 1 # onnx
, :
- num_steps — "";
- number of audio streams — ;
- , num_steps * number of audio streams;
:
Batch size | Pytorch latency, ms | Onnx latency, ms |
---|---|---|
2 | 9 | 2 |
4 | 11 | 4 |
8 | 14 | 7 |
16 | 19 | 12 |
40 | 36 | 29 |
80 | 64 | 55 |
120 | 96 | 85 |
200 | 157 | 137 |
, 1 :
Batch size | num_steps | Pytorch model RTS | Onnx model RTS |
---|---|---|---|
40 | 4 | 68 | 86 |
40 | 8 | 34 | 43 |
80 | 4 | 78 | 91 |
80 | 8 | 39 | 45 |
120 | 4 | 78 | 88 |
120 | 8 | 39 | 44 |
200 | 4 | 80 | 91 |
200 | 8 | 40 | 46 |
, , VAD . WebRT, 0 1?
WebRTC 0 1. - 30 , 250 8 . , 0 1 .
: