HIPE 2026 shared task is a CLEF 2026 Evaluation Lab on the extraction and qualification of of person–place relations in multilingual historical documents..
Building on the success of HIPE-2020 and HIPE-2022, which focused on entity recognition and linking, HIPE-2026 aims to support answering the question Who was where, when? and to deepen our understanding of how people and places were connected in historical media. This will enable the reconstruction of life trajectories, the tracing of mobility patterns, and the identification of actors within local contexts.
Key information
Data
HIPE-2026 Data Releases
HIPE-2026 Evaluation
Acknowledgements
References
- 💻 Visit the website for general information on the shared task and registration.
- 📓 Read the Participation Guidelines for detailed information about the tasks, datasets and evaluation.
- License: HIPE-2026 data is released under a CC BY-NC-SA 4.0 License
- Where to find the data:
- Release history:
- 04.12.2026: data sample + data json schema.
- 19.12.2025: extended data sample release v1.0 and sandbox release (high quality automatic annotations)
- 19.01.2026: full training and dev data release v2.0
- xx.xx.2026: masked test data release
- xx.xx.2026: unmasked test data release
Contents and preparation
HIPE-2026 builds on the HIPE-2022 v2.1 NE-annotated historical newspaper datasets.
Primary datasets included in the HIPE-2026 data are those that include PERS and LOC annotations, namely:impresso-hipe-2020, newseye, sonar, and letemps.
HIPE-2022 data in IOB format, containing NE mentions and Wikidata QIDs, is converted into JSON, preserving the document text and metadata, and enabling the extraction of person–place pairs.
The preparation process involved roughly the following steps:
- Representation transformation: convert IOB-encoded annotations into structured JSON (intermediate JSON schema).
- Data cleaning & filtering: merge NIL entities and remove overly long documents.
- Extraction of candidate person–location pairs: identify potential pairs within each document and filter.
- Annotation — pre-annotate with an LLM, then manually review and correct collaboratively.
- Final dataset creation: assemble dataset splits and package for release (final JSON schema).
Format and data representation
- HIPE-2026 data follows this JSON schema.
- All documents from different primary datasets of HIPE-2022 are gathered in the same language-dependent JSON Line file.
- Information on the source document and its metadata are in the
mediaproperty.
Directory structure and naming convention
-
Training and development datasets consist of UTF-8 JSON Line files. There is one
.jsonlfile per language and split. -
Files are named according to this schema:
HIPE-2026-vx.x-<dataset>-<train|dev|test>-<lg1>.jsonl. -
Data directory is organised per HIPE release version and language:
data └── newspapers ├── vx.x ├── lg1 │ ├── HIPE-2026-vx.x-newspapers-train-lg1.jsonl │ ├── HIPE-2026-vx.x-newspapers-dev-lg1.jsonl └── lg2 │ ├── HIPE-2026-vx.x-newspapers-train-lg2.jsonl │ ├── HIPE-2026-vx.x-newspapers-dev-lg2.jsonl ├── ... └── vx.x ├── lg1 ... └── literaryworks ├── vx.x ├── lg1 │ ├── HIPE-2026-vx.x-literaryworks-test-lg1.jsonl
Versioning
- HIPE-2026 releases are versioned
Major.Minor. Version informatio is present in the data directory structure and data filenames. - Each HIPE-2026 release has an equivalent git repository release, with release notes.
To come soon: Link to notebook when available.
To validate that your .jsonl files conform to the HIPE-2026 schema:
-
Create a virtual environment and install dependencies:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt -
Run the validator using the provided
Makefile:make validate
This will check all .jsonl files in data/v1.0/ against the schema at hipe-2026-data.schema.json.
To clean up the virtual environment and cache files:
make cleanTo (re)install dependencies:
make installAlternatively, you can run the validator script directly:
python scripts/check_jsonlschema.py \
--schemafile schemas/hipe-2026-data.schema.json \
data/v1.0/*.jsonlScore the performance of a single submission file.
python scripts/file_scorer_evaluation.py \
--gold_data_file GOLD_DATA_FILE_TO_EVALUATE_AGAINST.jsonl \
--predictions_file YOUR_PREDICTION_FILE.jsonlScore the performance of an entire submission folder.
python scripts/folder_scorer_evaluation.py \
--gold_data_folder "data/newspapers/v0.9" # is also set as the default.
--team_name TEAM_NAME \
--submission_folder SUBMISSION_FOLDER_NAMENote: For the currently supported sample evaluations, the submission files should follow the naming {team_name}_{gold_file_stem}_run1.jsonl
The HIPE-2026 organising team expresses its sincere appreciation to the CLEF-2026 Lab Organising Committee for the overall coordination and support. HIPE-eval editions are organised within the framework of the Impresso - Media Monitoring of the Past project, funded by the Swiss National Science Foundation under grant No. CRSII5_213585 and by the Luxembourg National Research Fund under grant No. 17498891.
To be updated
-
HIPE-2022 Participant Papers in Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, edited by Faggioli, Guglielmo and Ferro, Nicola and Hanbury, Allan and Potthast, Martin.
-
HIPE-2022 Extended Overview Paper:
M. Ehrmann, M. Romanello, S. Najem-Meyer, A. Doucet, and S. Clematide (2022). Extended Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents. In Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, edited by Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, Vol. 3180. CEUR-WS, 2022. https://doi.org/10.5281/zenodo.6979577.
bibtex
@inproceedings{ehrmann_extended_2022, title = {Extended Overview of {{HIPE-2022}}: {{Named Entity Recognition}} and {{Linking}} in {{Multilingual Historical Documents}}}, booktitle = {Proceedings of the {{Working Notes}} of {{CLEF}} 2022 - {{Conference}} and {{Labs}} of the {{Evaluation Forum}}}, author = {Ehrmann, Maud and Romanello, Matteo and {Najem-Meyer}, Sven and Doucet, Antoine and Clematide, Simon}, editor = {Faggioli, Guglielmo and Ferro, Nicola and Hanbury, Allan and Potthast, Martin}, year = {2022}, volume = {3180}, publisher = {{CEUR-WS}}, doi = {10.5281/zenodo.6979577}, url = {http://ceur-ws.org/Vol-3180/paper-83.pdf} }
-
HIPE-2020 Participant Papers are in Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, edited by Linda Cappellato, Carsten Eickhoff, Nicola Ferro, Aurélie Névéol.
-
HIPE-2020 Extended Overview Paper:
M. Ehrmann, M. Romanello, A. Flückiger, and S. Clematide, Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers in Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 2020, vol. 2696, p. 38. doi: 10.5281/zenodo.4117566.
bibtex
@inproceedings{ehrmann_extended_2020, ids = {ehrmann2020extended,ehrmann_extended_2020a}, title = {Extended {{Overview}} of {{CLEF HIPE}} 2020: {{Named Entity Processing}} on {{Historical Newspapers}}}, booktitle = {Working {{Notes}} of {{CLEF}} 2020 - {{Conference}} and {{Labs}} of the {{Evaluation Forum}}}, author = {Ehrmann, Maud and Romanello, Matteo and Fl{\"u}ckiger, Alex and Clematide, Simon}, editor = {Cappellato, Linda and Eickhoff, Carsten and Ferro, Nicola and N{\'e}v{\'e}ol, Aur{\'e}lie}, year = 2020, volume = {2696}, pages = {38}, publisher = {CEUR-WS}, address = {Thessaloniki, Greece}, url = {https://infoscience.epfl.ch/record/281054}, keywords = {cited} }