MilkQA Dataset

Introduction

MilkQA is dataset of dense questions for the task of answer selection. It contains questions and answers of the dairy farming domain that were collected by the customer service of Embrapa Dairy Cattle between the years of 2003 and 2012.

The dataset currently contains 2,657 anonymized pairs of questions and answers and is organized in three partitions: training, development and tests, that contains 2,307, 50 and 300 questions, respectively. Each question is associated to a pool of 50 candidate answers where only one answer is correct.

MilkQA is composed of challenging questions that are different from those typically approached in Question Answering. In our work, we call them consumer questions. These questions usually occur in situations where people seek solutions for some problem and present very particular characteristics, such as the larger size and the lack of objectivity.

Details about MilkQA are provided in the paper referenced below (available here). Please, consider citing this paper if you use our dataset.

Citation

@inproceedings{criscuolo2017milkqa,
    author = {Marcelo Criscuolo and Erick Rocha Fonseca and Sandra Maria Aluísio and Ana Carolina Sperança-Criscuolo},
    title = {{MilkQA}: a Dataset of Consumer Questions for the Task of Answer Selection},
    booktitle = {Proceedings of the 6th Brazilian Conference on Intelligent Systems (BRACIS)},
    year = {2017},
    month = {October},
    date = {2-5},
    address = {Uberlândia, Brazil},
    publisher = {IEEE},
    isbn = {978-1-5386-2407-4},
    pages = {354--359},
    volume = {1},
    doi = {10.1109/BRACIS.2017.12},
}

License

MilkQA is published by the Interinstitutional Center for Computational Lisguistics (NILC) of the University of Sao Paulo (USP) under the license Creative Commons, with the clauses Attribution, NonCommercial and NoDerivatives (CC BY-NC-ND).

Download

Download MilkQA here.


© 2017 NILC - Núcleo Interinstitucional de Linguística Computacional

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.