LUNA Corpus Discourse Data Set consists of 60 dialogs from Italian LUNA Human-Human Corpus in the hardware/software help desk domain annotated following Penn Discourse Treebank (PDTB) guideline. The data set contains a total of 1,606 discourse relations; 1,052 are explicit discourse relations.
The dialogs are split into training ( section 02), development ( section 01), and test ( section 03) sets as:
42, 6, and 12 respectively.
Each dialog (file) is stored as a JSON file that has the following structure:
{
"DOC_ID": "numeric part of a filename",
"tokens": "flat list of tokens",
"blocks": "list of token start & end indices for blocks in text file (tab-separated)",
"groups": "list of token start & end indices for groups in text file (newline-separated)",
"relations": "list of discourse relations"
}For example (reduced):
{
"DOC_ID": "0703000001",
"tokens": [
"helpdesk", "buongiorno", "sono", "<PER>",
"s\u00ec", "sono", "<PER>", "un", "collega",
"ho", "il",
"PC",
"che", "presumibilmente", "non", "funziona", "da",
"s\u00ec", "stamattina"
],
"blocks": [[0, 4], [4, 9], [9, 11], [11, 12], [12, 17]],
"groups": [[0, 4], [4, 17]],
"relations": [
{
"label": "Implicit",
"sense": "Expansion.Conjunction",
"conns": "e",
"conn": [],
"arg1": [[5, 9]],
"arg2": [[9, 17], [18, 19]],
"sup1": [],
"sup2": []
},
{
"label": "Explicit",
"sense": "Expansion.Restatement.Equivalence",
"conn": [[59, 60]],
"arg1": [[5, 7]],
"arg2": [[60, 63]],
"sup1": [],
"sup2": []
},
{
"label": "AltLex",
"sense": "Expansion.Restatement",
"conn": [[159, 161]],
"arg1": [[141, 144], [151, 154], [169, 171]],
"arg2": [[157, 164]],
"sup1": [[137, 141]],
"sup2": []
}
]
}Below are the schemas for a relation and a dialog (in dataclass format).
import typing as t
class DiscourseRelation:
# label(s)
label: str # type
sense: str # relation sense
conns: str # connective string (for Implicit)
# spans
conn: t.List[t.Tuple[int, int]] = None
arg1: t.List[t.Tuple[int, int]] = None
arg2: t.List[t.Tuple[int, int]] = None
sup1: t.List[t.Tuple[int, int]] = None
sup2: t.List[t.Tuple[int, int]] = None
class Dialog:
doc_id: str
tokens: t.List[str]
blocks: t.List[t.Tuple[int, int]]= None
groups: t.List[t.Tuple[int, int]] = None
relations: t.List[DiscourseRelation] = NoneA Discourse Relation can contain 5 spans: a discourse connective (conn),
its arguments (arg1 and arg2), and supplementary materials to the arguments (sup1 and sup2).
Each span can be composed of 0 or more non-adjacent segments.
Consequently, all spans are lists of start & end indices with respect to tokens;
e.g. [[141, 144], [151, 154], [169, 171]],
Since LUNA is following PDTB format, Discourse Relation types are the same. The distribution is given below.
| Type | ALL | TRN | DEV | TST |
|---|---|---|---|---|
| Explicit | 1,052 | 659 | 135 | 258 |
| Implicit | 490 | 294 | 74 | 122 |
| AltLex | 11 | 8 | 2 | 1 |
| EntRel | 56 | 33 | 7 | 16 |
A Discourse Relation can have several senses with respect to the Relation Type:
Explicitrelations can have only 2 senses.Implicitrelations can have up to 4 senses: 2 connectives with 2 senses each.AltLexrelations are asExplicitrelations.EntRelrelations have no senses.
The observed sense counts are the following:
0- no sense (errors)1s- 1 sense2s- 2 senses2c- 2 connectives, 1 sense each
| Type | ALL | 0 | 1s | 2s | 2c |
|---|---|---|---|---|---|
| Explicit | 1,052 | 4 | 1,045 | 3 | NA |
| Implicit | 490 | 3 | 481 | 3 | 3 |
| AltLex | 11 | 1 | 10 | NA | NA |
| EntRel | 56 | NA | NA | NA | NA |
Since the amount of discourse relations having a second sense is very little
(3 Explicit & 3 Implicit with a second sense and 3 Implicit with a second connective);
all the discourse relations have been "simplified" to have exactly 1 sense (or 0, if missing).
In case more than 1 sense is available, the selected sense is the first one.
For Implicit 2 connective relations it is the 1st sense of the 1st connective.
LUNA (and PDTB) Discourse Relations Senses are 3+ level:
e.g. Comparison.Concession.Epistemic concession.
It is often the case that relations are annotated up to a certain level;
i.e. not all relations have all 3 levels.
PDTB has 4 Level 1 senses: Comparison, Contingency, Expansion and Temporal.
LUNA adds 3 more which have only 1 level:
Discourse MarkerInterruptedRepetition
While Interrupted and Repetition senses are quite frequent, Discourse Marker appears only once.
| Sense | Explicit | Implicit | AltLex |
|---|---|---|---|
| Comparison | 187 | 47 | 0 |
| Contingency | 462 | 106 | 3 |
| Expansion | 213 | 161 | 4 |
| Temporal | 156 | 64 | 0 |
| Interrupted | 29 | 1 | 0 |
| Repetition | 0 | 108 | 0 |
| Discourse Marker | 1 | 0 | 0 |
| MISSING | 4 | 3 | 1 |
Even though mose relations have level 2 sense, a relation can have a level 1 sense only.
The 3rd level further categorizes L2 relations into the following types:
(as Comparison.Concession.Epistemic concession, Contingency.Cause.Semantic cause, etc.).
Refer to Tonelli et al. (2010) for further detail.
- Epistemic
- Inferential
- Pragmatic
- Propositional
- Semantic
- Speech act
Temporal sense has no 3rd level, i.e. only
Temporal.AsynchronousTemporal.Synchrony
Expansion.Restatement on level 3 is further categorized into:
Expansion.Restatement.EquivalenceExpansion.Restatement.Specification
The table below contains sense counts as they appear in the data.
| Sense | Explicit | Implicit | AltLex |
|---|---|---|---|
| Comparison (no L2) | 1 | 0 | 0 |
| Comparison.Concession | 144 | 27 | 0 |
| Comparison.Contrast | 42 | 20 | 0 |
| Contingency (no L2) | 1 | 0 | 0 |
| Contingency.Cause | 265 | 88 | 2 |
| Contingency.Condition | 124 | 8 | 1 |
| Contingency.Goal | 73 | 10 | 1 |
| Expansion (no L2) | 1 | 0 | 0 |
| Expansion.Alternative | 28 | 3 | 1 |
| Expansion.Conjunction | 111 | 70 | 1 |
| Expansion.Instantiation | 8 | 3 | 1 |
| Expansion.Restatement (no L3) | 4 | 8 | 1 |
| Expansion.Restatement.Equivalence | 25 | 22 | 0 |
| Expansion.Restatement.Specification | 36 | 55 | 2 |
| Temporal (no L2) | 0 | 0 | 0 |
| Temporal.Asynchronous | 128 | 55 | 3 |
| Temporal.Synchrony | 28 | 9 | 3 |
| Interrupted | 29 | 1 | 0 |
| Repetition | 0 | 108 | 0 |
| Discourse Marker | 1 | 0 | 0 |
| MISSING | 4 | 3 | 1 |
The data has been anonymized at token-level using the following conversions:
| Replacement | Freq | Description |
|---|---|---|
<NUM> |
337 | number-words; e.g. duomilasei |
<ORD> |
29 | ordinals; e.g. quarto |
<DIGIT> |
740 | digit-words; e.g. due |
<CHAR> |
86 | letter; e.g. C |
<PUNC> |
18 | punctuation; e.g. barra |
<WORD> |
11 | a word to be masked; e.g. password, spelling |
<CHARS> |
5 | a sequence of letters (abbreviation); e.g. SG |
<BRAND> |
36 | brands (hardware); e.g. Fujitsu |
<SW> |
159 | software; e.g. Windows |
<PER> |
278 | person names; e.g. Monica |
<ORG> |
54 | named organizations; e.g. CSI |
<LOC> |
126 | locations; e.g. Italia |
<LOC.SPELL> |
25 | locations for spelling; e.g. Ancona |
<WD> |
13 | week days; e.g. domenica |
<MM> |
13 | month names; e.g. gennaio |
<MISC> |
2 | other; not covered above |
-
0704000020:connandarg2spans overlap inExplicitrelation (DONE) -
0 sense relations (8):
-
Relation Types
- Explicit: 4
- Implicit: 3
- AltLex: 1
-
IDs
0703000006: 10704000001: 10704000025: 10704000031: 10704000034: 10704000051: 20705000003: 1
-
If you use this dataset for publication, please cite the following papers:
-
Sara Tonelli, Giuseppe Riccardi, Rashmi Prasad, and Aravind K. Joshi, "Annotation of discourse relations for conversational spoken dialogs.", In Proceedings of the International Conference on Language Resources and Evaluation (LREC), 2010.
-
Giuseppe Riccardi, Evgeny A. Stepanov, and Shammur Absar Chowdhury. "Discourse connective detection in spoken conversations.", IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2016.