We have published a modern Voice Activity Detector and more

image







When working with speech, several very "simple" questions always arise, for the solution of which there are not many convenient, open and simple tools: detection of the presence of a voice (or music), detection of the presence of numbers, and classification of languages .







To solve the problem of voice detection (Voice Activity Detector, VAD), there is a fairly popular tool from Google - webRTC VAD . It is undemanding in resources and compact, but its main disadvantage is its instability to noise, a large number of false positives and the impossibility of fine tuning. It is clear that if we reformulate the problem not into voice detection, but into silence detection (silence is the absence of both voice and noise), then it is solved in very trivial ways (energy threshold, for example), but with the same disadvantages and limitations. The most unpleasant thing is that often such decisions are fragile and some hard-code thresholds are not transferred to other domains.







STT ( PyTorch ONNX), , , , VAD , MIT. .









"VAD"?







  • VAD — , ;
  • Number detector — , ;
  • Language classifier — ;
  • 4 (, , , ), VAD ( — - , , VAD !);


"" :







  • 4 ;
  • VAD WebRTC ;
  • ;
  • , 1 ;
  • edge ;
  • (PyTorch, ONNX);
  • WebRTC , ;
  • PyTorch (JIT), ONNX;




  • ;
  • ;
  • (- , , STT);
  • edge ;
  • ONNX ;
  • VAD 16 kHz, 8 kHz;




colab . , :







  • PyTorch ONNX;
  • — VAD — , / ;
  • — . VAD ;
  • , ( , 1 , - );


, VAD :







import torch
torch.set_num_threads(1)

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=True)

(get_speech_ts,
 _, read_audio,
 _, _, _) = utils

files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'

wav = read_audio(f'{files_dir}/en.wav')
speech_timestamps = get_speech_ts(wav, model,
                                  num_steps=4)
print(speech_timestamps)
      
      





VAD



, VAD. .







  • 250 . , 100 , 30-50. ( 100 250 );
  • VAD ( );
  • 500 ( 200 ) 4 8 ;
  • ;
  • , "" "". - ;




1 AMD Ryzen Threadripper 3960X. :







torch.set_num_threads(1) # pytorch
ort_session.intra_op_num_threads = 1 # onnx
ort_session.inter_op_num_threads = 1 # onnx
      
      





, :







  • num_steps — "";
  • number of audio streams — ;
  • , num_steps * number of audio streams;


:







Batch size Pytorch latency, ms Onnx latency, ms
2 9 2
4 11 4
8 14 7
16 19 12
40 36 29
80 64 55
120 96 85
200 157 137


, 1 :







Batch size num_steps Pytorch model RTS Onnx model RTS
40 4 68 86
40 8 34 43
80 4 78 91
80 8 39 45
120 4 78 88
120 8 39 44
200 4 80 91
200 8 40 46




, , VAD . WebRT, 0 1?







WebRTC 0 1. - 30 , 250 8 . , 0 1 .







:







image









, VAD , . , . VAD.








All Articles