MultiDoc2Dial Dataset

MultiDoc2Dial Dataset - v1.0

Reference

Please cite this paper if you use the dataset or baseline code.


@inproceedings{feng2021multidoc2dial,
    title={MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents},
    author={Feng, Song and Patel, Siva Sankalp and Wan, Hui and Joshi, Sachindra},
    booktitle={EMNLP},
    year={2021}
}

Dataset Description

mutldoc2dial_doc.json contains the documents that are indexed by key domain and doc_id . Each document instance includes the following,
- doc_id: the ID of a document;
- title: the title of the document;
- domain: the domain of the document;
- doc_text: the text content of the document (without HTML markups);
- doc_html_ts: the document content with HTML markups and the annotated spans that are indicated by text_id attribute, which corresponds to id_sp.
- doc_html_raw: the document content with HTML markups and without span annotations.
- spans: key-value pairs of all spans in the document, with id_sp as key. Each span includes the following,
  - id_sp: the id of a span as noted by text_id in doc_html_ts;
  - start_sp/ end_sp: the start/end position of the text span in doc_text;
  - text_sp: the text content of the span.
  - id_sec: the id of the (sub)section (e.g. <p>) or title (<h2>) that contains the span.
  - start_sec / end_sec: the start/end position of the (sub)section in doc_text.
  - text_sec: the text of the (sub)section.
  - title: the title of the (sub)section.
  - parent_titles: the parent titles of the title.
multidoc2dial_dial_train.json and multidoc2dial_dial_validation.json contain the training and dev split of dialogue data that are indexed by key domain . Please note: For test split, we only include a dummy file in this version.
Each dialogue instance includes the following,
- dial_id: the ID of a dialogue;
- turns: a list of dialogue turns. Each turn includes,
  - turn_id: the time order of the turn;
  - role: either "agent" or "user";READ
  - da: dialogue act;
  - references: a list of spans with id_sp , label and doc_id. references is empty if a turn is for indicating previous user query not answerable or irrelevant to the document. Note that labels "precondition"/"solution" are fuzzy annotations that indicate whether a span is for describing a conditional context or a solution.
  - utterance: the human-generated utterance based on the dialogue scene.