Ultimate comparison of speech recognition systems: Ashmanov, Google, Sber, Silero, Tinkoff, Yandex

sandwich_fake







Some time ago, we wrote a series of articles about how to correctly measure the quality of speech recognition systems, and actually took metrics from available solutions (series of articles - 1 , 2 , 3 ) (at that time, both commercial and non-commercial solutions). On Habré there was an extract from this cycle within the framework of this article , but hands did not reach a large-scale update of the study worthy of publication on Habré (this requires at least a lot of effort and preparation).







Some time has passed and it's time to update our research, making it a truly ultimatum. The following has changed or added compared to past studies:







  • Many validation sets have been added from different real domains;
  • , ;
  • , ;
  • (, );
  • , - "", "";




(. ) :







  • wav



    ( PCM);
  • 8 ( , );
  • - -, "" , , ;
  • — WER. 20% WER, 5% WER ( , );
  • 1 . 2-3 ( "" ). 500 !;
  • ( , " "), ;
  • , . 1 .. WER, ;
  • ogg/opus



    , , , "" ;
  • (8 16 kHz), ;




, Silero bleeding egde, production . — WER ( WER ).







Ashmanov Google Google Sber Sber Silero Silero new Tinkoff Yandex
default enhanced IVR prod bleeding edge
10 11 10 7 7 6 8 13
35 24 6 30 27 27 14
24 39 41 20 16 11 15 13
() 47 16 18 22 32 13 12 21 15
28 27 24 18 14 12 20 21
() 31 37 37 24 33 25 24 23 22
31 36 37 26 21 22 25 21
22 60 54 19 24 20 28 22
24 61 40 26 18 15 27 23
() 42 49 8 41 27 52 18
62 30 32 24 28 39 35 28 25
(e-commerce) 34 45 43 34 45 29 29 31 28
34 29 29 31 20 20 31 29
Yellow pages 45 43 49 41 32 29 31 30
() 43 55 59 41 67 38 37 33 32
YouTube 32 50 41 34 28 25 38 32
() 44 72 66 46 41 35 38 35
50 37 40 50 35 33 42 38
61 68 68 54 41 32 43 42
, 54 70 60 61 43 41 56 54
39 50 53 32 25 20 27


WER, .







( , , , - ). . ( , ).







Ashmanov Google Google Sber Sber Silero Tinkoff Yandex
default enhanced IVR
0% 0% 0% 0% 0% 5% 4%
0% 2% 0% 0% 4% 0%
1% 12% 13% 6% 0% 2% 1%
() 0% 0% 0% 1% 0% 0% 7% 0%
0% 1% 0% 0% 0% 2% 0%
() 0% 0% 0% 2% 0% 0% 6% 0%
0% 8% 10% 4% 0% 4% 0%
0% 22% 6% 2% 0% 1% 0%
0% 19% 2% 3% 1% 4% 0%
() 0% 12% 0% 0% 1% 0%
0% 2% 3% 1% 1% 0% 5% 1%
(e-commerce) 0% 0% 0% 7% 1% 0% 7% 0%
0% 0% 0% 1% 0% 4% 0%
Yellow pages 1% 13% 9% 14% 0% 2% 2%
() 0% 0% 7% 35% 9% 0% 5% 0%
YouTube 0% 13% 1% 6% 0% 1% 0%
() 1% 33% 12% 17% 5% 1% 1%
0% 1% 0% 7% 0% 6% 1%
3% 26% 28% 25% 0% 2% 4%
, 2% 19% 3% 25% 0% 1% 1%
1% 12% 14% 9% 0% 3% 0%


, .









, , . Tinkoff — , , . " " (, 1/10 ) . IVR , 8 kHz, , . — , , . — Google, .







, production / ( "" 10% ):







Ashmanov 0 7
Google 1 13 (9 enhanced)
Sber 2 0
Sber IVR 4 4
Silero 13 0
Tinkoff 6 2
Yandex 10 1


— , . " " — . bleeding edge ( ), " " , 17 21. , .









gRPC API. SMB , . ( , ). , "" , . 40 ( PDF), .







. , , . . , .







Tinkoff gRPC, ( , ). enterprise ( , ) , , . , .







… , , . , b2b , , . 500- 200 . -, "" .







ashmanov







2 ( gRPC ) . gRPC , . , / / .









, ( ) ( — ). 1 (RTS = 1 / RTF):







RTS per Thread Threads
Ashmanov 0.2 8
Ashmanov 1.7 1
Google 4.3 8
Google enhanced 2.9 8
Sber 13.6 8
Sber 14.1 1
Silero 2.5 8 4-core, 1080
Silero 3.8 4 4-core, 1080
Silero 6.0 8 12 cores, 21080 Ti
Silero 9.7 1 12 cores, 21080 Ti
Tinkoff 1.4 8
Tinkoff 2.2 1
Yandex 5.5 2 8 —


RTS, .







( , ) ( ), . VDS, Nvidia Tesla, - ( — ). .







, EX51-SSD-GPU, . , , .







. 12 + GPU ~150 RTS. , 12+ , . , - . aspirational 2-3 .







( ), ( ) . — ( ), . - … 60 !







photo_2021-05-27_09-18-04









, Open STT, , , . - . , . , .







/



, 1080 Ti, 2080 Ti. , .







It was to Yandex that we sent data in the format opus



. We tested a little, it seems that Yandex has no particular difference between wav



and opus



not.








All Articles