Skip to content

vamoosev/eums

Repository files navigation

Legal Document to MediaWiki Converter

Converts PDF and DOCX legal documents into MediaWiki markup.

Installation

It's recommended to use a virtual environment to avoid system Python restrictions.

python3 -m venv .venv
. .venv/bin/activate
pip install pdfplumber python-docx tqdm

Usage

python main.py <input_dir> <output_dir>

Example:

python main.py attachments laws_output

Importing .wiki pages into MediaWiki

This repository includes a simple dockerized MediaWiki setup and a helper script to bulk import the generated .wiki files as pages.

Steps:

  1. Start MediaWiki with the database
docker-compose up -d
  1. Complete the web installer once (first time only)
  • Open http://localhost:8080
  • Database settings to use during install:
    • DB host: db
    • DB name: mediawiki
    • DB user: wiki
    • DB password: change_me
  • When the installer offers LocalSettings.php, place it into the wiki root (the container volume /var/www/html persists it). If prompted to download, upload it back via the container or bind mount; with this compose, it should be saved automatically in the volume.
  1. Import all .wiki files
  • The folder laws_output/ is mounted read-only inside the MediaWiki container at /import.
  • Run the helper script to import all files. It uses each filename (without .wiki) as the page title, converting underscores to spaces.
chmod +x import_wiki.sh
./import_wiki.sh

Optionally, import into a specific namespace (e.g., Laws):

./import_wiki.sh Laws

Notes:

  • The script uses maintenance/importTextFiles.php with --overwrite, so re-running will update existing pages with the latest content.
  • If you see an error about LocalSettings.php missing, finish the installer in step 2 first.
  • Access the wiki at http://localhost:8080

Output Format

  • Filename: sanitized document title (max 200 chars)
  • Extension: .wiki
  • Content: MediaWiki markup with proper hierarchy

Notes on formatting

  • Title consistency: the infobox now uses the same clean title shown in the page banner. If no title is detected, it falls back to "Untitled Bill" (not {{PAGENAME}}).
  • Introduced by: stray lines like Introduced By: Name are captured into the infobox and removed from the article body.
  • Section symbols (§): lines with multiple sections like §1 ... §2 ... are split so each section begins on its own line, improving readability.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published