To quickly resolve user issues without human intervention, an effective chatbot requires a huge amount of training data. However, the main bottleneck in chatbot development is getting realistic, task-oriented conversational data to train these systems using machine learning techniques. We have compiled a list of the best conversation datasets from chats, broken down into Q&A, customer service data. interactive and multilingual data.
Q&A dataset for training chatbots
Link . This corpus includes Wikipedia articles, hand-generated factual questions, and hand-generated answers to those questions for use in scientific research.
WikiQA corpus . A publicly available set of question and sentence pairs collected and annotated to explore answers to open domain questions. To reflect the true need for information from ordinary users, they used Bing query logs as a source of questions. Each question is linked to a Wikipedia page that potentially has an answer.
Yahoo Language Data . This page presents hand-picked QC datasets from Yahoo Answers from Yahoo.
TREC (Text REtrieval Collection) QA Collection: TREC has answered questions since 1999. In each sequence of questions and answers, the problem was defined in such a way that the systems received small fragments of text containing the answer to open domain questions with possible answers only "yes" or "no".
Ubuntu Support Dataset
The Ubuntu Conversations Corpus consists of nearly a million conversations of two people extracted from Ubuntu chat logs used to obtain technical support on various Ubuntu-related issues. The set contains 930,000 dialogues and over 100,000,000 words.
Customer Service Relationship Strategy Kit : Collect travel-related customer service data from four sources. Conversation logs from three IVA commercial customer services and Airline forums on TripAdvisor.com during August 2016.
Twitter customer support . This dataset on Kaggle includes over 3,000,000 tweets and replies from the biggest brands on Twitter.
Chatbot training dialog dataset
Semantic Web Interest Group IRC Chat Logs . This automatically generated IRC chat log is available in RDF that has been maintained daily since 2004, including timestamps and aliases.
Cornell Corps of Film Dialogues . This corpus contains a large collection of metadata rich in fictional dialogues from movie scripts: there are 220,579 dialogues between 10,292 pairs of film heroes with 9035 characters from 617 films.
ConvAI2 Dataset . This dataset contains over 2,000 conversations for the PersonaChat contest , where people working on the Yandex.Toloka crowdsourcing platform chatted with bots from participating teams.
Santa Barbara. Spoken American English Corpus: This dataset includes approximately 249,000 words in transcription, audio and timestamps at the level of individual intonation units.
NPS chat corpus . This corpus consists of 10,567 messages from approximately 500,000 messages collected in various online chat rooms in accordance with the terms of service.
Goal-oriented dialogues in Maluuba . A dataset of conversations in which the conversation is focused on completing a task or making a decision, such as finding flights and hotels. Contains comprehensive information covering over 250 hotels, flights and destinations.
Wizard of Oz Multidomain Dataset (MultiWOZ)... A fully tagged collection of written conversations spanning multiple domains and topics. The set contains 10,000 dialogues and at least an order of magnitude more than all previous annotated corpora, which are focused on solving problems.
Dataset for training multilingual bots
NUS Corpus . This corpus was created to normalize text from social networks and translate it. It is built by randomly selecting 2,000 messages from the NUS English SMS corpus and then translated into formal Chinese.
EXCITEMENT dataset . Available in English and Italian, these kits contain negative customer testimonials, in which customers indicate reasons for dissatisfaction with the company.
Still can't find the data you're looking for? Lionbridge AI provides custom data to train a chatbot using machine learning in 300 languages โโto make your conversations more interactive and support customers around the world. And if you want to improve yourself in machine learning - come to our advanced course.by ML and don't forget about the HABR promo code , which adds 10% to the discount on the banner.
- Machine Learning Course
- Advanced Course "Machine Learning Pro + Deep Learning"
- Course "Mathematics and Machine Learning for Data Science"
More courses
Recommended articles
- How Much Data Scientist Earns: An Overview of Salaries and Jobs in 2020
- How Much Data Analyst Earns: An Overview of Salaries and Jobs in 2020
- How to Become a Data Scientist Without Online Courses
- 450 free courses from the Ivy League
- How to learn Machine Learning 5 days a week for 9 months in a row
- Machine Learning and Computer Vision in the Mining Industry
- Machine Learning and Computer Vision at beneficiation plants