Skip to content

A ground truth (GT) dataset created within the OCR-D project and consisting of 348 pages extracted from historical documents pertaining to the "Verzeichnis der im deutschen Sprachraum erschienenen Drucke" (VD), all of which have been digitised by Staatsbibliothek zu Berlin – Berlin State Library (SBB).

License

Notifications You must be signed in to change notification settings

OCR-D/OCR-D-GT-VD-SBB

Repository files navigation

OCR-D-GT-VD-SBB

A ground truth (GT) dataset created within the OCR-D project and consisting of 348 pages extracted from historical documents pertaining to the "Verzeichnis der im deutschen Sprachraum erschienenen Drucke" (VD), all of which have been digitised by Staatsbibliothek zu Berlin – Berlin State Library (SBB). The data publication consists of 348 .xml files with transcriptions for 348 .tif facsimile image files. The image files pertain to 67 distinct works; four images were extracted from each of the 65 works; from two further works, 49 and 39 images respectively were extracted to create the GT. The dataset is complemented by a .csv file which contains a mapping between the identifiers used in this dataset and the unique identifiers used in the digitised collections of Staatsbibliothek zu Berlin – Berlin State Library, as well as a filelisting in .csv format. Data selection was performed within the OCR-D project at Staatsbibliothek zu Berlin – Berlin State Library. The project is funded by the German Research Foundation DFG, project grant no. 460675868. Ground truth data were established by a digitisation service provider and post-corrected by staff members of the Berlin State Library, data curation and publication was done by two members of the team of the research project "Mensch.Maschine.Kultur – Künstliche Intelligenz für das Digitale Kulturelle Erbe" at Staatsbibliothek zu Berlin – Berlin State Library. The research project was funded by the Federal Government Commissioner for Culture and the Media (BKM), project grant no. 2522DIG002.

Metadata

Language:
fra, deu, lat, nds
Format:
Page-XML
Time:
1509-1827
GT Type:
data_structure_and_text
License:
CC-BY-4.0
Transcription Guidelines:
https://ocr-d.de/en/gt-guidelines/trans/
Project:
OCR-D/MMK

Sources

The volume of transcriptions:

TextLine Page
0 0

List of transcriptions

document TxtRegion ImgRegion LineDrawRegion GraphRegion TabRegion ChartRegion SepRegion MathRegion ChemRegion MusicRegion AdRegion NoiseRegion UnknownRegion CustomRegion TextLine Page

Extent

In this section they can insert additional information, instructions or notes.

About

A ground truth (GT) dataset created within the OCR-D project and consisting of 348 pages extracted from historical documents pertaining to the "Verzeichnis der im deutschen Sprachraum erschienenen Drucke" (VD), all of which have been digitised by Staatsbibliothek zu Berlin – Berlin State Library (SBB).

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages