The CLARIN PressMint project plans to compile corpora of historical newspapers for a number of countries and languages.
PressMint corpora are to be interoperable, i.e. encoded to a common PressMint schema, a customisation of the TEI Guidelines, but with various down-stream formats (TSV, CoNLL-U, JSON etc.) also available. The same scripts should process the common data in any PressMint corpus, despite the different kind of information included in the corpora.
The PressMint Git workflow, scripts and documentation will be based on the ParlaMint project, which builds richly annotated corpora of parliamentary proceedings for a large number of countries and autonomous regions.
This Git repository is, as yet, a stub with content still to be added. Note that there are several branches for different parts of the development.
The repository contains the following directories:
- The Samples directory contains directories by contributing (CLARIN) country. It will eventually include samples for all variants and formats of the PressMint corpora.