Skip to content

titaenstad/newspaper_example

Repository files navigation

Newspaper OCR Viewer

Tools for unpacking and viewing digitized newspaper archives with OCR data.

Data

This repo contains this newspaper: https://www.nb.no/items/URN:NBN:no-nb_digavis_fremtiden_null_null_19770426_69_94_1
It is shared with the CC-BY-NC license.

Scripts

unpack.py - Extracts .tar files from a newspaper archive directory into unpacked/.

page_viewer.py - GUI viewer showing newspaper pages side-by-side with OCR bounding boxes overlay.

block_viewer.py - GUI viewer for individual TextBlocks. Shows cropped image regions with TextLine (blue) and String (red) bounding boxes alongside extracted text. Select XML files from dropdown, navigate blocks with arrow keys.

Usage

With uv

# Install dependencies
uv sync

# Unpack the archive
uv run unpack.py

# Launch the full page viewer
uv run page_viewer.py

# Launch the TextBlock viewer
uv run block_viewer.py

About

Example of a newspaper file structure

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published