WaybackPress

Recover WordPress sites from the Internet Archive's Wayback Machine. This tool discovers, validates, and exports WordPress content from archived snapshots into a standard WordPress WXR import file.

Features

Automated URL discovery from Wayback Machine CDX API
Intelligent post validation with content heuristics
Multi-pass media fetching with automatic retries
Clean WXR 1.2 export compatible with WordPress Importer
Resumable operations with progress tracking
Configurable request throttling to respect archive.org
Detailed logging and reporting

Legal and Ethical Use

This tool is for personal archival and legitimate content recovery only.

You are responsible for:

Only recovering content you have legal rights to
Complying with Internet Archive's Terms of Service
Respecting copyright and intellectual property laws
Using conservative rate limiting (default: 5s delay, 2 concurrency)
Not using this for commercial scraping or bulk downloads

The tool has built-in safeguards (rate limiting, user-agent identification) but ultimately you are responsible for how you use it.

Installation

From Source (Recommended)

Python 3.12+ requires a virtual environment due to PEP 668. This is the recommended approach for all Python versions:

git clone https://github.com/stardothosting/shift8-waybackpress.git
cd shift8-waybackpress

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate  # On Linux/macOS
# OR
venv\Scripts\activate     # On Windows

# Install package
pip install -e .

# Verify installation
waybackpress --version

When you're done using the tool:

deactivate

For future use, always activate the virtual environment first:

cd shift8-waybackpress
source venv/bin/activate
waybackpress run example.com

Alternative: System-Wide Installation (Python 3.11 and older)

pip install -r requirements.txt
pip install -e .

Note: This method will fail on Python 3.12+ with an "externally-managed-environment" error.

Requirements

Python 3.8 or higher
Dependencies: beautifulsoup4, lxml, aiohttp, python-dateutil, trafilatura

Quick Start

The simplest way to recover a site:

waybackpress run example.com

To limit recovery to a specific date range (e.g., October 2018 to October 2025):

waybackpress run example.com --from 20181001 --to 20251031

This will run the complete pipeline: discover URLs, validate posts, fetch media, and generate a WordPress import file.

Usage

WaybackPress works in stages, allowing you to control each step of the recovery process.

Stage 1: Discover URLs

Query the Wayback Machine to find all archived URLs for your domain:

waybackpress discover example.com

Single URL Extraction: Extract just one specific post instead of the entire site:

waybackpress discover example.com --url https://example.com/2020/01/post-title/

Date Range Filtering: Limit discovery to specific date range:

waybackpress discover example.com --from 20181001 --to 20251031

This queries only snapshots between October 1, 2018 and October 31, 2025. Useful for:

Recovering content from specific time periods
Avoiding very old or very recent snapshots
Reducing processing time for large sites

Options:

--url URL: Extract a single specific URL instead of entire site
--from DATE: Start date (YYYYMMDD or YYYYMMDDHHMMSS format)
--to DATE: End date (YYYYMMDD or YYYYMMDDHHMMSS format)
--output DIR: Specify output directory (default: wayback-data/example.com)
--delay SECONDS: Delay between requests (default: 5)
--concurrency N: Concurrent requests (default: 2)

Stage 2: Validate Posts

Download and validate discovered URLs to identify actual blog posts:

waybackpress validate --output wayback-data/example.com

This stage:

Downloads HTML for each URL
Extracts metadata (title, date, author, categories, tags)
Identifies valid posts using content heuristics
Filters out archives, category pages, and duplicates
Generates a detailed validation report

Stage 3: Fetch Media

Download images, CSS, and JavaScript referenced in posts:

waybackpress fetch-media --output wayback-data/example.com

Options:

--pass N: Pass number for multi-pass fetching (default: 1)

The media fetcher:

Parses HTML to extract all media URLs
Queries CDX API for available snapshots
Attempts multiple snapshots if initial fetch fails
Tracks successes and failures for additional passes
Saves progress incrementally

Multi-Pass Media Fetching

If the first pass has a low success rate, run additional passes:

waybackpress fetch-media --output wayback-data/example.com --pass 2

Each pass attempts different snapshots, increasing the likelihood of recovery.

Stage 4: Export to WordPress

Generate a WordPress WXR import file:

waybackpress export --output wayback-data/example.com

Options:

--title TEXT: Site title for export (default: domain name)
--url URL: Site URL for export (default: http://domain)
--author-name NAME: Post author name (default: admin)
--author-email EMAIL: Post author email (default: [email protected])

Complete Pipeline

Run all stages at once:

waybackpress run example.com

With date range:

waybackpress run example.com --from 20181001 --to 20251031

Options:

--skip-media: Skip media fetching
--output DIR: Output directory
--delay SECONDS: Request delay
--concurrency N: Concurrent requests
--from DATE: Start date (YYYYMMDD or YYYYMMDDHHMMSS)
--to DATE: End date (YYYYMMDD or YYYYMMDDHHMMSS)
All export options (--title, --url, --author-name, --author-email)

Output Structure

WaybackPress creates the following directory structure:

wayback-data/
└── example.com/
    ├── config.json              # Project configuration
    ├── waybackpress.log         # Detailed logs
    ├── discovered_urls.tsv      # All discovered URLs
    ├── valid_posts.tsv          # Validated post URLs
    ├── validation_report.csv    # Detailed validation results
    ├── media_report.csv         # Media fetch results
    ├── wordpress-export.xml     # Final WXR import file
    ├── html/                    # Downloaded HTML files
    │   └── post-slug.html
    └── media/                   # Downloaded media assets
        └── example.com/
            └── wp-content/
                └── uploads/

Configuration

Each project maintains a config.json file with settings and state:

{
  "domain": "example.com",
  "output_dir": "wayback-data/example.com",
  "delay": 5.0,
  "concurrency": 2,
  "skip_media": false,
  "discovered": true,
  "validated": true,
  "media_fetched": true,
  "exported": true
}

Best Practices

Respecting Archive.org

The Wayback Machine is a free public resource. Be respectful:

Use the default 5-second delay between requests
Keep concurrency at 2 or lower
Run during off-peak hours for large sites
Consider multiple sessions for sites with thousands of posts

Media Recovery

Media fetching has inherent limitations:

Not all media is archived
Some snapshots may be corrupted
Success rates typically range from 30-50%

Strategies to improve recovery:

Run multiple passes (2-3 recommended)
Increase delay and decrease concurrency for better reliability
Review media_report.csv to identify patterns in failures
Consider manual recovery for high-value assets

Validation Heuristics

The validator applies several filters:

Minimum content length (200 characters)
Duplicate detection (content hash)
URL pattern matching (excludes /category/, /tag/, /feed/)
Date validation

Review validation_report.csv to verify results and adjust if needed.

Importing into WordPress

After generating the WXR file:

Log into your WordPress admin panel
Go to Tools → Import → WordPress
Install the WordPress Importer if prompted
Upload wordpress-export.xml
Assign post authors and choose import options
Click "Run Importer"

Media Files

Media files must be uploaded separately:

Connect to your server via SFTP/SSH
Navigate to wp-content/uploads/
Upload the contents of the media/ directory
Preserve the directory structure (domain/wp-content/uploads/)

Alternatively, use WP-CLI:

wp media regenerate --yes

Troubleshooting

Installation Issues

Error: "externally-managed-environment"

You're using Python 3.12+ which requires virtual environments. Follow the recommended installation steps above using python3 -m venv venv.

Error: "Cannot update time stamp of directory 'waybackpress.egg-info'"

The egg-info directory is owned by root. Remove it and reinstall:

sudo rm -rf waybackpress.egg-info
python3 -m venv venv
source venv/bin/activate
pip install -e .

ModuleNotFoundError: No module named 'trafilatura'

The setup.py is missing the trafilatura dependency. This is fixed in the latest version. If you're using an older version:

pip install trafilatura>=2.0.0

No Posts Found

Verify the domain is archived: https://web.archive.org/
Check if posts use non-standard URL patterns
Review discovered_urls.tsv to see what was found
Adjust URL filtering logic in utils.py if needed

Low Media Success Rate

Run additional passes with --pass 2, --pass 3
Reduce concurrency: --concurrency 1
Increase delay: --delay 10
Check media_report.csv for failure patterns

Import Errors

Validate XML: xmllint --noout wordpress-export.xml
Check WordPress error logs
Ensure server has adequate memory (php.ini: memory_limit)
Split large imports into smaller batches

Development

Run tests:

python -m pytest tests/

Format code:

black waybackpress/

Type checking:

mypy waybackpress/

Project Structure

waybackpress/
├── __init__.py       # Package metadata
├── __main__.py       # Entry point for python -m
├── cli.py            # Command-line interface
├── config.py         # Configuration management
├── utils.py          # Shared utilities
├── discover.py       # URL discovery
├── validate.py       # Post validation
├── fetch.py          # Media fetching
└── export.py         # WXR generation

Known Limitations

Only works with WordPress sites (other CMSs not supported)
Requires posts to be archived in Wayback Machine
Media recovery depends on archive availability
Some dynamic content (comments, widgets) may not preserve perfectly
Wayback snapshots may have inconsistent timestamps

Contributing

Contributions are welcome. Please:

Fork the repository
Create a feature branch
Make your changes with tests
Submit a pull request

License

MIT License. See LICENSE file for details.

Credits

Developed by Shift8 Web for the WordPress community.

Built using:

BeautifulSoup4 for HTML parsing
aiohttp for async HTTP requests
python-dateutil for flexible date parsing
lxml for XML processing

Changelog

0.1.0 (Initial Release)

URL discovery from Wayback CDX API
Post validation with content heuristics
Multi-pass media fetching
WXR 1.2 export generation
Resumable operations
Progress tracking and reporting

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
examples		examples
tests		tests
waybackpress		waybackpress
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

stardothosting/shift8-waybackpress

Folders and files

Latest commit

History

Repository files navigation

WaybackPress

Features

Legal and Ethical Use

Installation

From Source (Recommended)

Alternative: System-Wide Installation (Python 3.11 and older)

Requirements

Quick Start

Usage

Stage 1: Discover URLs

Stage 2: Validate Posts

Stage 3: Fetch Media

Multi-Pass Media Fetching

Stage 4: Export to WordPress

Complete Pipeline

Output Structure

Configuration

Best Practices

Respecting Archive.org

Media Recovery

Validation Heuristics

Importing into WordPress

Media Files

Troubleshooting

Installation Issues

No Posts Found

Low Media Success Rate

Import Errors

Development

Project Structure

Known Limitations

Contributing

License

Credits

Changelog

0.1.0 (Initial Release)

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages