NOTE: This dataset is a newer version than the one referred in our EMNLP paper.
Please cite the following paper if you use this data or baseline code:
@inproceedings{feng-etal-2020-doc2dial,
title = "doc2dial: A Goal-Oriented Document-Grounded Dialogue Dataset",
author = "Feng, Song and Wan, Hui and Gunasekara, Chulaka and Patel, Siva and Joshi, Sachindra and Lastras, Luis",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.652",
}
"doc2dial_doc.json" contains the document data that are indexed by key domain
and doc_id
. Each document instance includes the following,
doc_id
: the ID of a document;
title
: the title of the document;
domain
: the domain of the document;
doc_text
: the text content of the document (without HTML markups);
doc_html_ts
: the document content with HTML markups and the annotated spans that are indicated by text_id
attribute, which corresponds to id_sp
.
doc_html_raw
: the document content with HTML markups and without span annotations.
spans
: key-value pairs of all spans in the document, with id_sp
as key. Each span includes the following,
id_sp
: the id of a span as noted by text_id
in doc_html_ts
;start_sp
/ end_sp
: the start/end position of the text span in doc_text
;text_sp
: the text content of the span.id_sec
: the id of the (sub)section (e.g. <p>
) or title (<h2>
) that contains the span.start_sec
/ end_sec
: the start/end position of the (sub)section in doc_text
.text_sec
: the text of the (sub)section.title
: the title of the (sub)section.parent_titles
: the parent titles of the title
."doc2dial_dial_train.json" and "doc2dial_dial_dev.json" contain the training and dev split of dialogue data that are indexed by key domain
and doc_id
. The dialogues in "wOOD" folder includes irrelevant turns. Each dialogue instance includes the following,
dial_id
: the ID of a dialogue;
doc_id
: the ID of the associated document;
domain
: domain of the document;
turns
: a list of dialogue turns. Each turn includes,
turn_id
: the time order of the turn;role
: either "agent" or "user";da
: dialogue act;reference
: the corresponding span (id_sp
) in the associated document. If a turn is an irrelevant turn, i.e., da
ends with "ood", reference
is empty. Note that spans with labels "precondition"/"solution" are the actual grounding spans. Spans with label "reference" are the related titles or contextual reference, which is used for the purpose of describing a dialogue scene better to crowd contributors.utterance
: the human-generated utterance based on the dialogue scene.