Voice bot + telephony on full OpenSource. Part 1 - creating and training a text bot RU



Nowadays, voice robots are gaining immense popularity, from banal ordering a taxi to selling to customers. Creating a voice bot comes down to three basic steps.



  1. Voice recognition ASR.
  2. Clarification of the meaning of what was said and search for the necessary entities in the text (for example, address, amount, full name, etc.)
  3. Generating a response, converting text to speech TTS. We will go from the path of creating a simple text bot to integration with the freeswitch telephony system with voice recognition and voice-over of prepared responses. This article describes the tools used and the way to integrate them together to create a voice robot.


In the first part, we'll talk about creating a simple text bot that you can embed in a chat.



Example of conversation B-bot W-man



:       
:
: 

:    ?
: 

:  ?
:?     

:
:  


A bit of theory The



bot works on the principle of user intent. Each intention has a list of prepared answers. In order for the bot to understand the user's intention, it is necessary to train the model on the dataset with intentions and phrases that can activate this intention



For example



Intention: Say hello

Possible phrases: hello, good afternoon, gratuti ...

Answer: Hello



Intention: Say goodbye

Possible phrases: Bye, bye , Farewell ...

Answer: Bye



Step 1: preprocessing the dataset



It is based on a dataset from the skillbox open training on writing a chat bot in telegrams that can talk to you about cinema. I can't post it for obvious reasons.

Pre-processing is a very important step.



The first step is to remove all symbols and numbers from the text and bring everything to lower case.



Next, you need to correct typos and mistakes in words.



 - 


This task is not an easy one, there is a good tool from Yandex called Speller, but it is limited in the number of requests per day, so we will look for free alternatives.

For python there is a wonderful jamspell library that corrects typos well. There is a good pre - trained Russian language model for her. Let's run all input data through this library. For a voice bot, this step is not so relevant, since the speech recognition system should not give out words with errors, it can give out the wrong word. This process is necessary for a chat bot. Also, to minimize the influence of typos, you can train the network not in words, but in n-grams.



N-grams are n-letter parts of words. for example, the 3-grams for the word hello will be

at, riv, willow, vet. This will help you be less dependent on typos and increase recognition accuracy.



Next, you need to bring the words to their normal form, the so-called process of word lemmatization.



  -  


The rulemma library is well suited for this task .



You can also remove stop words from phrases that carry little semantic load, but increase the size of the neural network (I took stopwords.words ("russian") from the nltk library), but in our case it is better not to remove them, since the user can answer a robot with only one word, but it can be from the list of stop words.



Step 2: Converting the dataset into an understandable form for NN



First, you need to create a dictionary of all the words in the dataset.



To train the model, you need to translate all words into oneHotVector

This is an array that is equal to the length of the word dictionary, in which all values ​​are 0 and only one is 1 at the word position in the dictionary.



Further, all input phrases are converted into a 3-dimensional array that contains all phrases, the phrase contains a list of words in the oneHotVector format - this will be our X_train input dataset.



Each input phrase needs to be matched with an appropriate intent in the sameHotVector format - this is our y_train output.



Step 3: creating the model



For a small bot, a small model with two lstm layers and two fully connected layers is enough:



model = Sequential()
model.add(LSTM(64,return_sequences=True,input_shape=(description_length, num_encoder_tokens)))
model.add(LSTM(32))
model.add(Dropout(0.25))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(len(set(y)), activation='softmax'))


We compile the model and select an optimizer, I chose adam as it gave the best result.



Step 4: train the model



After preparing the dataset and compiling the model, you can start training it. Since the dataset is small, we had to train the model for 250-500 epochs, after which the retraining took place.



Step 5: trying to talk to our bot



To talk to our bot, you need to submit correctly prepared data to the input of the trained model. User input needs to be processed in the same way as the dataset from the first step. Then transform it into a form understandable to NN as in the second step using the same dictionary of words and their indices so that the input words correspond to the words on which the training was carried out.



The processed input is fed into the model and we get an array of values, in which the probabilities of our phrase hitting a particular intention are present, but we need to select the intention with the highest probability, this can be done through the numpy library



np.argmax(results)


It is necessary to assess the confidence of the network in this answer and select the threshold at which to issue failure phrases to the user, like - I don't understand you. For my purposes, I set a threshold of 50% confidence, below which the bot will say that it did not understand you.



Next, from the list of our intentions, we select the appropriate answer and give it to the user



PS: The model can be trained not only on the basis of words, but also by dividing phrases into letters or n-grams, in which case a more serious model will be needed.



model = Sequential()
model.add(LSTM(512,return_sequences=True,input_shape=(description_length, num_encoder_tokens)))
model.add(LSTM(256))
model.add(Dropout(0.25))
model.add(Dense(len(set(y)), activation='softmax'))



All Articles