Some time ago, we wrote a series of articles about how to correctly measure the quality of speech recognition systems, and actually took metrics from available solutions (series of articles - 1 , 2 , 3 ) (at that time, both commercial and non-commercial solutions). On Habré there was an extract from this cycle within the framework of this article , but hands did not reach a large-scale update of the study worthy of publication on Habré (this requires at least a lot of effort and preparation).
Some time has passed and it's time to update our research, making it a truly ultimatum. The following has changed or added compared to past studies:
- Many validation sets have been added from different real domains;
- , ;
- , ;
- (, );
- , - "", "";
(. ) :
-
wav
( PCM); - 8 ( , );
- - -, "" , , ;
- — WER. 20% WER, 5% WER ( , );
- 1 . 2-3 ( "" ). 500 !;
- ( , " "), ;
- , . 1 .. WER, ;
-
ogg/opus
, , , "" ; - (8 16 kHz), ;
, Silero bleeding egde, production . — WER ( WER ).
Ashmanov | Sber | Sber | Silero | Silero new | Tinkoff | Yandex | |||
---|---|---|---|---|---|---|---|---|---|
default | enhanced | IVR | prod | bleeding edge | |||||
10 | 11 | 10 | 7 | 7 | 6 | 8 | 13 | ||
35 | 24 | 6 | 30 | 27 | 27 | 14 | |||
24 | 39 | 41 | 20 | 16 | 11 | 15 | 13 | ||
() | 47 | 16 | 18 | 22 | 32 | 13 | 12 | 21 | 15 |
28 | 27 | 24 | 18 | 14 | 12 | 20 | 21 | ||
() | 31 | 37 | 37 | 24 | 33 | 25 | 24 | 23 | 22 |
31 | 36 | 37 | 26 | 21 | 22 | 25 | 21 | ||
22 | 60 | 54 | 19 | 24 | 20 | 28 | 22 | ||
24 | 61 | 40 | 26 | 18 | 15 | 27 | 23 | ||
() | 42 | 49 | 8 | 41 | 27 | 52 | 18 | ||
62 | 30 | 32 | 24 | 28 | 39 | 35 | 28 | 25 | |
(e-commerce) | 34 | 45 | 43 | 34 | 45 | 29 | 29 | 31 | 28 |
34 | 29 | 29 | 31 | 20 | 20 | 31 | 29 | ||
Yellow pages | 45 | 43 | 49 | 41 | 32 | 29 | 31 | 30 | |
() | 43 | 55 | 59 | 41 | 67 | 38 | 37 | 33 | 32 |
YouTube | 32 | 50 | 41 | 34 | 28 | 25 | 38 | 32 | |
() | 44 | 72 | 66 | 46 | 41 | 35 | 38 | 35 | |
50 | 37 | 40 | 50 | 35 | 33 | 42 | 38 | ||
61 | 68 | 68 | 54 | 41 | 32 | 43 | 42 | ||
, | 54 | 70 | 60 | 61 | 43 | 41 | 56 | 54 | |
39 | 50 | 53 | 32 | 25 | 20 | 27 |
WER, .
( , , , - ). . ( , ).
Ashmanov | Sber | Sber | Silero | Tinkoff | Yandex | |||
---|---|---|---|---|---|---|---|---|
default | enhanced | IVR | ||||||
0% | 0% | 0% | 0% | 0% | 5% | 4% | ||
0% | 2% | 0% | 0% | 4% | 0% | |||
1% | 12% | 13% | 6% | 0% | 2% | 1% | ||
() | 0% | 0% | 0% | 1% | 0% | 0% | 7% | 0% |
0% | 1% | 0% | 0% | 0% | 2% | 0% | ||
() | 0% | 0% | 0% | 2% | 0% | 0% | 6% | 0% |
0% | 8% | 10% | 4% | 0% | 4% | 0% | ||
0% | 22% | 6% | 2% | 0% | 1% | 0% | ||
0% | 19% | 2% | 3% | 1% | 4% | 0% | ||
() | 0% | 12% | 0% | 0% | 1% | 0% | ||
0% | 2% | 3% | 1% | 1% | 0% | 5% | 1% | |
(e-commerce) | 0% | 0% | 0% | 7% | 1% | 0% | 7% | 0% |
0% | 0% | 0% | 1% | 0% | 4% | 0% | ||
Yellow pages | 1% | 13% | 9% | 14% | 0% | 2% | 2% | |
() | 0% | 0% | 7% | 35% | 9% | 0% | 5% | 0% |
YouTube | 0% | 13% | 1% | 6% | 0% | 1% | 0% | |
() | 1% | 33% | 12% | 17% | 5% | 1% | 1% | |
0% | 1% | 0% | 7% | 0% | 6% | 1% | ||
3% | 26% | 28% | 25% | 0% | 2% | 4% | ||
, | 2% | 19% | 3% | 25% | 0% | 1% | 1% | |
1% | 12% | 14% | 9% | 0% | 3% | 0% |
, .
, , . Tinkoff — , , . " " (, 1/10 ) . IVR , 8 kHz, , . — , , . — Google, .
, production / ( "" 10% ):
Ashmanov | 0 | 7 |
1 | 13 (9 enhanced) | |
Sber | 2 | 0 |
Sber IVR | 4 | 4 |
Silero | 13 | 0 |
Tinkoff | 6 | 2 |
Yandex | 10 | 1 |
— , . " " — . bleeding edge ( ), " " , 17 21. , .
gRPC API. SMB , . ( , ). , "" , . 40 ( PDF), .
Tinkoff gRPC, ( , ). enterprise ( , ) , , . , .
… , , . , b2b , , . 500- 200 . -, "" .
, ( ) ( — ). 1 (RTS = 1 / RTF):
RTS per Thread | Threads | ||
---|---|---|---|
Ashmanov | 0.2 | 8 | |
Ashmanov | 1.7 | 1 | |
4.3 | 8 | ||
Google enhanced | 2.9 | 8 | |
Sber | 13.6 | 8 | |
Sber | 14.1 | 1 | |
Silero | 2.5 | 8 | 4-core, 1080 |
Silero | 3.8 | 4 | 4-core, 1080 |
Silero | 6.0 | 8 | 12 cores, |
Silero | 9.7 | 1 | 12 cores, |
Tinkoff | 1.4 | 8 | |
Tinkoff | 2.2 | 1 | |
Yandex | 5.5 | 2 | 8 — |
RTS, .
( , ) ( ), . VDS, Nvidia Tesla, - ( — ). .
, EX51-SSD-GPU, . , , .
. 12 + GPU ~150 RTS. , 12+ , . , - . aspirational 2-3 .
( ), ( ) . — ( ), . - … 60 !
, Open STT, , , . - . , . , .
/
, 1080 Ti, 2080 Ti. , .
It was to Yandex that we sent data in the format opus
. We tested a little, it seems that Yandex has no particular difference between wav
and opus
not.