Speech recognition with machine learning tools

In my work, I was faced with the need to check the call records for employees' compliance with the conversation script with clients. Usually an employee is allocated for this, who spends a large amount of time listening to recordings of conversations. We set ourselves the task of reducing the time spent on verification using automatic speech recognition (ASR) tools. We will take a closer look at one of these tools.





Nvidia NeMo is a  set of machine learning tools for building and training GPU-powered models.





The models in NeMo use a modern approach to speech recognition called Connectionist Time Classification (CTC).





Before CTC, an approach was used in which the input audio file was divided into separate speech segments and tokens were predicted from them. Then the tokens were combined, the duplicate ones were collapsed into one, and the result was fed to the model output.





At the same time, the recognition accuracy suffered, since a word with repeated letters was not considered 100% correctly recognized. For example, "coOperation" was reduced to "coOperation".





With CTC - still predicting one token per time segment of speech and additionally using an empty token to figure out where to fold duplicate tokens. The appearance of an empty token helps to separate duplicate letters that should not be folded.





For my task, I took one of the models (Jasper 10 × 5) and trained it from scratch. For training, a public dataset of telephone conversations was chosen, containing cut audio recordings and their transcriptions.





To train the model, you need to prepare a manifest file containing information about the audio file and the transcription of this file. The manifest file has its own format:





{{"audio_filepath": "path/to/audio.wav", "duration": 3.45, "text": "sometext"}…{"audio_filepath": "path/to/audio.wav", "duration": 3.45, "text": "sometext"}}
      
      



The model accepts audio files only in * .wav format. It is necessary to loop through the entire list of audio files and use the console utility to recode audio files with a resolution other than the required one:





def convertToWav(self, ext):
        if not os.path.exists(self.datadir + '/dataset'):
            tar = tarfile.open(self.an4Path);
            tar.extractall(path=self.datadir);
        sphList = glob.glob(self.datadir + '/dataset/**/*' + ext, recursive=True);
        for sph in sphList:
            wav = sph[:-4] + '.wav';
            cmd = ["sox", sph, wav];
            subprocess.run(cmd);
            print('renamed ' + ext + ' to ' + wav);
      
      



To build a test and training manifest, I used the following function, in which we got the duration of an audio file using the get duration (filename = audio path) function of the Librosa library, we know the path to the transcription files and audio files:





def buildManifest(self, transcript_path, manifest_path, wav_path):
        with open(transcript_paths, 'r') as fin:
            with open(manifest_path, 'w') as fout:
                for line in fin:
                    transcript = line[: line.find('(')-1].lower();
                    transcript = transcript.replace('<s>', '').replace('</s>', '');
                    transcript = transcript.strip();
                    file_id = line[line.find('(')+1 : -2];
                    audio_path = os.path.join(self.datadir, wav_paths, file_id[file_id.find('-')+1 : file_id.rfind('-')], file_id +'.wav');
                    duration = librosa.core.get_duration(filename=audio_path);
                    metadata = {
                        "audio_filepath": audio_path,
                        "duration": duration,
                        "text": transcript
                    }
                    print(metadata);
                    json.dump(metadata, fout);
                    fout.write('\n');
      
      



, :





config.yaml:
name: &name "Jasper10x5"
model:
  sample_rate: &sample_rate 16000
  labels: &labels [" ", "a", "", "", "", "", "", "", "", "", "", "", "", "",
                   "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "'"]
 preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    normalize: "per_feature"
    sample_rate: *sample_rate
    features: &n_mels 64
    n_fft: 512
    frame_splicing: 1
    dither: 0.00001
    stft_conv: false
      
      



. pytorch_lighting :





import nemo;
class NemoASR:
    def __init__(self, dataDir):
        self.datadir = dataDir;
        self.CONF_PATH = './config.yaml';
        yaml = YAML(typ="safe");
        with open(self.CONF_PATH) as f:
            self.CONFIG = yaml.load(f);

    def train(self, transcriptionPATH, manifestPATH, wavPATH, testTranscriptionPATH, testManifestPATH, testWavPATH):
        print("begin train");
        train_transcripts = self.datadir + transcriptionPATH;
        train_manifest = self.datadir + manifestPATH;
        if not os.path.isfile(train_manifest):
            self.buildManifest(train_transcripts, train_manifest, wavPATH);
        test_transcripts = self.datadir + testTranscriptionPATH;
        test_manifest = self.datadir + testManifestPATH;
        if not os.path.isfile(test_manifest):
            self.buildManifest(test_transcripts, test_manifest, testWavPATH);
        # params from ./config.yaml
        self.CONFIG['model']['train_ds']['manifest_filepath'] = train_manifest;
        self.CONFIG['model']['validation_ds']['manifest_filepath'] = test_manifest;
        trainer = pl.Trainer(max_epochs=500, gpus=1);
        self.model = nemo_asr.models.EncDecCTCModel(cfg=DictConfig(self.CONFIG['model']), trainer=trainer);
        trainer.fit(self.model);
        print("end train");
#-------------------------------------------------------------
nemoASR = NemoASR('.');
if (nemoASR.checkExistsDataSet()):
    print('dataset loaded');
    nemoASR.train('./dataset/etc/train.transcription',  './dataset/train_manifest.json','./dataset/wav/an4_clstk', './dataset/etc/test.transcription', './dataset/test_manifest.json', './dataset/wav/an4test_clstk');
    nemoASR.model.save_to('./model.sbc');
      
      



:





files = ['./an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'];
    for fname, transcription in zip(files, nemoASR.model.transcribe(paths2audio_files=files)):
        print(f"Audio in {fname} was recognized as: {transcription}");
      
      



, .





NeMo   :





  • GPU;





  • , ;





  • .





Among the shortcomings, we can note the  need to include a large number of heavyweight libraries, as well as the fact that the tool is relatively fresh and some functions of the model are in beta test.





When solving a speech recognition problem, I got an interesting experience with ASR models. I was able to train the model on a random dataset and received sufficient accuracy for confident recognition of telephone conversations.





We suggest using this tool not only for speech recognition, but also for generating audio files based on text (TTS) and speaker recognition.








All Articles