Converts PDF and DOCX legal documents into MediaWiki markup.
It's recommended to use a virtual environment to avoid system Python restrictions.
python3 -m venv .venv
. .venv/bin/activate
pip install pdfplumber python-docx tqdmpython main.py <input_dir> <output_dir>Example:
python main.py attachments laws_outputThis repository includes a simple dockerized MediaWiki setup and a helper script to bulk import the generated .wiki files as pages.
Steps:
- Start MediaWiki with the database
docker-compose up -d- Complete the web installer once (first time only)
- Open http://localhost:8080
- Database settings to use during install:
- DB host:
db - DB name:
mediawiki - DB user:
wiki - DB password:
change_me
- DB host:
- When the installer offers
LocalSettings.php, place it into the wiki root (the container volume/var/www/htmlpersists it). If prompted to download, upload it back via the container or bind mount; with this compose, it should be saved automatically in the volume.
- Import all
.wikifiles
- The folder
laws_output/is mounted read-only inside the MediaWiki container at/import. - Run the helper script to import all files. It uses each filename (without
.wiki) as the page title, converting underscores to spaces.
chmod +x import_wiki.sh
./import_wiki.shOptionally, import into a specific namespace (e.g., Laws):
./import_wiki.sh LawsNotes:
- The script uses
maintenance/importTextFiles.phpwith--overwrite, so re-running will update existing pages with the latest content. - If you see an error about
LocalSettings.phpmissing, finish the installer in step 2 first. - Access the wiki at http://localhost:8080
- Filename: sanitized document title (max 200 chars)
- Extension:
.wiki - Content: MediaWiki markup with proper hierarchy
- Title consistency: the infobox now uses the same clean title shown in the page banner. If no title is detected, it falls back to "Untitled Bill" (not
{{PAGENAME}}). - Introduced by: stray lines like
Introduced By: Nameare captured into the infobox and removed from the article body. - Section symbols (§): lines with multiple sections like
§1 ... §2 ...are split so each section begins on its own line, improving readability.