## Doc2Dial Dataset - v0.9.1
### Reference
Please cite this paper paper if you use the dataset or baseline code.
```bibtex
@inproceedings{feng-etal-2020-doc2dial,
title = "doc2dial: A Goal-Oriented Document-Grounded Dialogue Dataset",
author = "Feng, Song and Wan, Hui and Gunasekara, Chulaka and Patel, Siva and Joshi, Sachindra and Lastras, Luis",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.652",
}
```
### Dataset Description
- *doc2dial_doc.json* contains the documents that are indexed by key domain and doc_id . Each document instance includes the following,
- doc_id: the ID of a document;
- title: the title of the document;
- domain: the domain of the document;
- doc_text: the text content of the document (without HTML markups);
- doc_html_ts: the document content with HTML markups and the annotated spans that are indicated by text_id attribute, which corresponds to id_sp.
- doc_html_raw: the document content with HTML markups and without span annotations.
- spans: key-value pairs of all spans in the document, with id_sp as key. Each span includes the following,
- id_sp: the id of a span as noted by text_id in doc_html_ts;
- start_sp/ end_sp: the start/end position of the text span in doc_text;
- text_sp: the text content of the span.
- id_sec: the id of the (sub)section (e.g. <p>) or title (<h2>) that contains the span.
- start_sec / end_sec: the start/end position of the (sub)section in doc_text.
- text_sec: the text of the (sub)section.
- title: the title of the (sub)section.
- parent_titles: the parent titles of the title.
- *doc2dial_dial_train.json* and *doc2dial_dial_validation.json* contain the training and dev split of dialogue data that are indexed by key domain and doc_id. Each dialogue instance includes the following,
- dial_id: the ID of a dialogue;
- doc_id: the ID of the associated document;
- domain: domain of the document;
- turns: a list of dialogue turns. Each turn includes,
- turn_id: the time order of the turn;
- role: either "agent" or "user";
- da: dialogue act;
- reference: the corresponding span (id_sp) in the associated document. If a turn is an irrelevant turn, i.e., da ends with "ood", reference is empty. *Note* that spans with labels "precondition"/"solution" are the actual grounding spans. Spans with label "reference" are the related titles or contextual reference, which is used for the purpose of describing a dialogue scene better to crowd contributors.
- utterance: the human-generated utterance based on the dialogue scene.