Tools for unpacking and viewing digitized newspaper archives with OCR data.
This repo contains this newspaper: https://www.nb.no/items/URN:NBN:no-nb_digavis_fremtiden_null_null_19770426_69_94_1
It is shared with the CC-BY-NC license.
unpack.py - Extracts .tar files from a newspaper archive directory into unpacked/.
page_viewer.py - GUI viewer showing newspaper pages side-by-side with OCR bounding boxes overlay.
block_viewer.py - GUI viewer for individual TextBlocks. Shows cropped image regions with TextLine (blue) and String (red) bounding boxes alongside extracted text. Select XML files from dropdown, navigate blocks with arrow keys.
With uv
# Install dependencies
uv sync
# Unpack the archive
uv run unpack.py
# Launch the full page viewer
uv run page_viewer.py
# Launch the TextBlock viewer
uv run block_viewer.py