15 best datasets for chatbot training

To quickly resolve user issues without human intervention, an effective chatbot requires a huge amount of training data. However, the main bottleneck in chatbot development is getting realistic, task-oriented conversational data to train these systems using machine learning techniques. Especially for the start of the new thread of the Machine Learning course, I am sharing with you a list of the best datasets of conversations from chats, broken down into questions and answers, customer support data, interactive data and multilingual data.














To quickly resolve user issues without human intervention, an effective chatbot requires a huge amount of training data. However, the main bottleneck in chatbot development is getting realistic, task-oriented conversational data to train these systems using machine learning techniques. We have compiled a list of the best conversation datasets from chats, broken down into Q&A, customer service data. interactive and multilingual data.



Q&A dataset for training chatbots



Link . This corpus includes Wikipedia articles, hand-generated factual questions, and hand-generated answers to those questions for use in scientific research.



WikiQA corpus . A publicly available set of question and sentence pairs collected and annotated to explore answers to open domain questions. To reflect the true need for information from ordinary users, they used Bing query logs as a source of questions. Each question is linked to a Wikipedia page that potentially has an answer.



Yahoo Language Data . This page presents hand-picked QC datasets from Yahoo Answers from Yahoo.



TREC (Text REtrieval Collection) QA Collection: TREC has answered questions since 1999. In each sequence of questions and answers, the problem was defined in such a way that the systems received small fragments of text containing the answer to open domain questions with possible answers only "yes" or "no".



Ubuntu Support Dataset



The Ubuntu Conversations Corpus consists of nearly a million conversations of two people extracted from Ubuntu chat logs used to obtain technical support on various Ubuntu-related issues. The set contains 930,000 dialogues and over 100,000,000 words.



Customer Service Relationship Strategy Kit : Collect travel-related customer service data from four sources. Conversation logs from three IVA commercial customer services and Airline forums on TripAdvisor.com during August 2016.



Twitter customer support . This dataset on Kaggle includes over 3,000,000 tweets and replies from the biggest brands on Twitter.



Chatbot training dialog dataset



Semantic Web Interest Group IRC Chat Logs . This automatically generated IRC chat log is available in RDF that has been maintained daily since 2004, including timestamps and aliases.



Cornell Corps of Film Dialogues . This corpus contains a large collection of metadata rich in fictional dialogues from movie scripts: there are 220,579 dialogues between 10,292 pairs of film heroes with 9035 characters from 617 films.



ConvAI2 Dataset . This dataset contains over 2,000 conversations for the PersonaChat contest , where people working on the Yandex.Toloka crowdsourcing platform chatted with bots from participating teams.



Santa Barbara. Spoken American English Corpus: This dataset includes approximately 249,000 words in transcription, audio and timestamps at the level of individual intonation units.



NPS chat corpus . This corpus consists of 10,567 messages from approximately 500,000 messages collected in various online chat rooms in accordance with the terms of service.



Goal-oriented dialogues in Maluuba . A dataset of conversations in which the conversation is focused on completing a task or making a decision, such as finding flights and hotels. Contains comprehensive information covering over 250 hotels, flights and destinations.



Wizard of Oz Multidomain Dataset (MultiWOZ)... A fully tagged collection of written conversations spanning multiple domains and topics. The set contains 10,000 dialogues and at least an order of magnitude more than all previous annotated corpora, which are focused on solving problems.



Dataset for training multilingual bots



NUS Corpus . This corpus was created to normalize text from social networks and translate it. It is built by randomly selecting 2,000 messages from the NUS English SMS corpus and then translated into formal Chinese.



EXCITEMENT dataset . Available in English and Italian, these kits contain negative customer testimonials, in which customers indicate reasons for dissatisfaction with the company.



Still can't find the data you're looking for? Lionbridge AI provides custom data to train a chatbot using machine learning in 300 languages โ€‹โ€‹to make your conversations more interactive and support customers around the world. And if you want to improve yourself in machine learning - come to our advanced course.by ML and don't forget about the HABR promo code , which adds 10% to the discount on the banner.



image










Recommended articles






All Articles