This repository contains the text of De Latinae Linguae Reparatione authored by Marcus Antonius Sabellicus (1436–1506), annotated with respect to lemmas, part-of-speech tags, morphological features and syntactic dependencies.
A first version of the text has been included in the test data of EvaLatin Shared Task (Sprugnoli et al., 2022), which however did not focus on syntactic dependencies. Since that first version, some changes have been implemented - see below Annotation formalism and choices for further details.
Acknowledgments
Annotator: Federica Gamba
Editor: Flavio Massimiliano Cecchini
We thank Timo Korkiakangas (Helsingin yliopisto, Helsinki, Finland) for funding the annotation through the Suomen Akatemia (Research Council of Finland) project grant no. 315176, Digital philology and Latin text production: a multimodal analysis of writing in the past.
The raw text was originally downloaded from ALIM - Archivio della Latinità Italiana del Medioevo as a txt file, and is reproduced in this repository.
The text in ALIM is based on the critical edition by G. Bottari (1999).
The text is composed of four sections:
- Epistola a Marcantonio Morosini
- De Latinae Linguae Reparatione Marci Antonii Sabellici Dialogus, Qui Et Latinae Linguae Reparatio Inscribitur
- Baptistae Guarini Dissertatio
- Epistola ad Antonio Moretto
The document is annotated according to the typological formalism of Universal Dependencies (UD), in particular as it is applied for Latin. It follows the annotation's state of the art as represented especially by the UDante treebank (Cecchini et al., 2020) and the harmonisation effort by Gamba & Zeman (2023a, 2023b), however implementing some slight (and compatible) twists. These are highlighted in the following, along with more general annotation choices which might be of interest to the user.
- Lemmas are normalised: they are always lowercase and the couple v/u is neutralised in favour of u.
- The annotation follows the current UD praxis of assigning "adverbs" (
ADV) their own adverbial lemma.- This means that we observe some etymologically related couples treated as different words (e.g. facile/faciliter, nimis/nimium,...), or also, to keep compatibility with other Latin treebanks, elements possibly belonging other parts of speech not annotated as such (e.g.
PRONfor ubi,DETfor eo,NOUNfor sponte).
- This means that we observe some etymologically related couples treated as different words (e.g. facile/faciliter, nimis/nimium,...), or also, to keep compatibility with other Latin treebanks, elements possibly belonging other parts of speech not annotated as such (e.g.
- The current convention of other Latin treebanks is followed in mostly not splitting (i.e. treating as multi-word tokens) compounds of any kind (orthograhical or morphological).
- This means for example that tokens like admodum or praeterquam are analysed as unitary elements. The only exception to this is the token rempublicam, consisting of two morphologically independent words. Clitics such as que and ue are instead regularly split (also see below Statistics - Words).
- A complete annotation for foreign words is implemented, choosing the code-switching strategy as detailed in UD guide lines. Lemmas are represented as in their original language, including possibly different writing systems. On the other hand, the feature
OrigLangfor integrated foreign words has not yet been implemented.- We note that while this CoNLL-U overall passes the official UD validation process, some morphological features used for
Foreignwords are not (yet) recognised by their respective treebanks (e.g.InflClassforgrc).
- We note that while this CoNLL-U overall passes the official UD validation process, some morphological features used for
- The lexical features
NameType,NumValueandProperare moved from the features to the miscellaneous (MISC) field.- The motivation with regard to
NameTypeandProperis that these features are of purely semantic type, with no reflexes whatsoever on morphosyntax, and they actually represent a different, independent layer of annotation (which will need to be better formalised in future), i.e. the one for named entities and multi-word expressions. - As for
NumValue, the main reason is technical, as its values correspond to the infinite set of natural numbers, and so cannot be listed exhaustively in a UD documentation page.
- The motivation with regard to
NumValueis added to all words which are assigned aNumType.- The tripartite system of
VerbForms as described in (Cecchini, 2021) is used.- This excludes the values
Gdv,GerandSup. The valueInffor infinitives instead of a typologically more transparentVnounis maintained for compatibility with the other Latin treebanks (we notice that they are in fact equivalent notations for the same object). - The
MISCfield stores traditional denominations for Latin tenses and moods by means of the fieldsTraditionalMoodandTraditionalTense.
- This excludes the values
- The Latin Perfect tense (corresponding to forms with Perfectum stems) is analysed as having
Aspect=PerfandTense=Pres. This implies that Plusquamperfectum is represented byAspect=PerfandTense=Past(as opposed to ImperfectumAspect=ImpandTense=Past), and soTense=Pqpis not used. - So-called "Inchoative" verbs (i.e. sc-verbs) are always annotated with
Aspect=Inchin their imperfective forms. This is also valid for lemmas which are sometimes considered to "have lexicalised", e.g. cresco. - The features
VerbFormandPronTypeare also used in an "etymological" sense: that is, they can be assigned to words which are not tagged asVERBorPRON/DET/ADVrespectively, meaning that the original form was one, even if their current distribution belongs to another part of speech now.
- The semantic transversal subtypes
lmodandtmodfor space and time arguments are applied consistently through the annotation, and so are the corresponding values forAdvType. - So-called "free relatives", or also "double pronouns", are annotated as clauses containing their relative element (e.g. quando or quis) and as depending on the root of their main clause with the needed clausal relation type (
advcl,ccomp/xcomp,csubj), using therelclsubtype for distinction.- This is in contrast to another style of annotation where the relative element is made dependent as a nominal argument of the main clause, and the rest of the free relative depends on the relative element as a "regular" relative clause (
acl:relcl).
- This is in contrast to another style of annotation where the relative element is made dependent as a nominal argument of the main clause, and the rest of the free relative depends on the relative element as a "regular" relative clause (
The annotated texts consists of (for the definitions of tokens, syntactic words and multi-word tokens, please refer to UD documentation):
- 246 sentences
- 10651 tokens, or 9070 without considering the 1581 punctuation marks (i.e. part of speech
PUNCT) - 10755 syntactic words, or 9174 without considering the 1581 punctuation marks (i.e. part of speech
PUNCT)
The difference is given by 104 multi-word tokens which are always composed by 2 elements, the second of which is nearly always a functional clitic, distributed as follows:
- 94 que (
CCONJ) - 3 cum (
ADP) - 2 ue (
CCONJ) - 2 quis (
PRON) - 1 quidem (
PART) - 1 uero (
ADV)
The only case where a token is split into two lexical components is rempublicam, made out of forms of res (NOUN) and publicus (ADJ) respectively.
There are 1866 different lemmas, 1857 if punctuation marks are not included. This leads to a lexical richness (i.e. "type-token ratio") of ca 20,24%. Please note that some words might be different even if they are assigned the same lemma, but then e.g. they belong to different parts of speech.
There are 13 foreign words (marked with Foreign=Yes) in the text: 8 belong to Italian (it), 5 to Ancient Greek (grc). Especially the latter represent cases of code-switching.
All UD's parts of speech apart from SYM and X are used. They are distributed as follows according to their form types and lemmas:
| Part of speech | per form type | per lemma |
|---|---|---|
NOUN + PROPN |
1572 + 404 | 454 + 221 |
VERB |
1691 | 508 |
ADJ |
862 | 344 |
ADV |
845 | 208 |
DET + NUM |
863 + 25 | 45 + 8 |
ADP |
608 | 29 |
PRON |
837 | 19 |
SCONJ |
366 | 19 |
PART |
276 | 16 |
CCONJ |
560 | 13 |
AUX |
264 | 1 |
INTJ |
1 | 1 |
PUNCT |
1581 | 9 |
Please refer to UD's guidelines for parts of speech.
A total of 61 dependency relations is used, 31 of which represent subtypes of universal relations, as follows:
aclacl:relcl
advcladvcl:absadvcl:cmpadvcl:predadvcl:relcl
advmodadvmod:emphadvmod:lmodadvmod:negadvmod:tmod
amodapposauxaux:pass
caseccccompconjconj:expl
csubjcsubj:cleftcsubj:passcsubj:relcl
detdet:numgov
discoursedislocated(NB: never used without subtype)dislocated:csubjdislocated:nsubjdislocated:obj
flatflat:govflat:name
marknmodnsubjnsubj:cleftnsubj:outernsubj:pass
nummodobjoblobl:agentobl:argobl:cmpobl:lmodobl:tmod
orphanparataxis(NB: never used without subtype)parataxis:repparataxis:reportingparataxis:speaker
punctrootvocativexcomp
Please refer to UD's guidelines for dependency relations and to the specific guide lines for Latin, where present.
The following 24 morpholexical features and their values (given as pipe-separated strings after the equal sign) are used:
Abbr=YesAdvType=Loc|TimAspect=Imp|Inch|Perf|ProspCase=Abl|Acc|Dat|Gen|Loc|Nom|VocCompound=YesDegree=Abs|Cmp|DimForeign=YesForm=EmpGender=Fem|Masc|NeutInflClass=IndEurA|IndEurE|IndEurI|IndEurO|IndEurU|IndEurX|LatA|LatAnom|LatE|LatI|LatI2|LatPron|LatX- also as layer
[nominal]
- also as layer
Mood=Imp|Ind|SubNumber=Plur|Sing- also as layer
[psor]
- also as layer
NumForm=Roman|WordNumType=Card|Mult|OrdPartType=IntPerson=1|2|3- also as layer
[psor]
- also as layer
Polarity=NegPoss=YesPronType=Con|Dem|Ind|Int|Neg|Prs|Rel|TotReflex=YesTense=Fut|Past|PresVariant=GreekVerbForm=Conv|Fin|Inf|PartVoice=Act|Pass
Please refer to UD's guidelines for morpholexical features, layered features, and to the specific guide lines for Latin, where present.
- Federica Gamba, ÚFAL, MFF, Univerzita Karlova, Prague, Czech Republic:
gamba at ufal.mff.cuni.cz - Flavio Massimiliano Cecchini, KU Leuven, Belgium (formerly CIRCSE, Universita Cattolica del Sacro Cuore, Milan, Italy):
flaviomassimiliano.cecchini at kuleuven.be
-
Federica Gamba and Daniel Zeman. 2023a. Universalising Latin Universal Dependencies: a harmonisation of Latin treebanks in UD. In Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023), Washington, DC, USA, March. Association for Computational Linguistics (ACL).
-
Federica Gamba and Daniel Zeman (2023b). Latin Morphology through the Centuries: Ensuring Consistency for Better Language Processing. In Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing RANLP 2023, Varna, Bulgaria, September.
-
Flavio Massimiliano Cecchini, Rachele Sprugnoli, Giovanni Moretti, Marco Passarotti. 2020. UDante: First Steps Towards the Universal Dependencies Treebank of Dante's Latin Works. In: Johanna Monti, Felice Dell'Orletta, Fabio Tamburini (eds.), Proceedings of the Seventh Italian Conference on Computational Linguistics. CEUR Workshop Proceedings, pp. 1-7.
-
Flavio Massimiliano Cecchini. 2021. Formae reformandae: for a reorganisation of verb form annotation in Universal Dependencies illustrated by the specific case of Latin. In Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021), pages 1–15, Sofia, Bulgaria. Association for Computational Linguistics.
-
Rachele Sprugnoli, Marco Passarotti, Flavio Massimiliano Cecchini, Margherita Fantoli, and Giovanni Moretti. 2022. Overview of the EvaLatin 2022 Evaluation Campaign. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 183–188, Marseille, France. European Language Resources Association.
-
Marco Antonio Sabellico, De Latinae Linguae Reparatione, edited by G. Bottari, Messina, Italy. Università degli studi di Messina, Centro interdipartimentale di studi umanistici, 1999 (Percorsi dei classici, 2).