Recover WordPress sites from the Internet Archive's Wayback Machine. This tool discovers, validates, and exports WordPress content from archived snapshots into a standard WordPress WXR import file.
- Automated URL discovery from Wayback Machine CDX API
- Intelligent post validation with content heuristics
- Multi-pass media fetching with automatic retries
- Clean WXR 1.2 export compatible with WordPress Importer
- Resumable operations with progress tracking
- Configurable request throttling to respect archive.org
- Detailed logging and reporting
This tool is for personal archival and legitimate content recovery only.
You are responsible for:
- Only recovering content you have legal rights to
- Complying with Internet Archive's Terms of Service
- Respecting copyright and intellectual property laws
- Using conservative rate limiting (default: 5s delay, 2 concurrency)
- Not using this for commercial scraping or bulk downloads
The tool has built-in safeguards (rate limiting, user-agent identification) but ultimately you are responsible for how you use it.
Python 3.12+ requires a virtual environment due to PEP 668. This is the recommended approach for all Python versions:
git clone https://github.com/stardothosting/shift8-waybackpress.git
cd shift8-waybackpress
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
source venv/bin/activate # On Linux/macOS
# OR
venv\Scripts\activate # On Windows
# Install package
pip install -e .
# Verify installation
waybackpress --versionWhen you're done using the tool:
deactivateFor future use, always activate the virtual environment first:
cd shift8-waybackpress
source venv/bin/activate
waybackpress run example.compip install -r requirements.txt
pip install -e .Note: This method will fail on Python 3.12+ with an "externally-managed-environment" error.
- Python 3.8 or higher
- Dependencies: beautifulsoup4, lxml, aiohttp, python-dateutil, trafilatura
The simplest way to recover a site:
waybackpress run example.comTo limit recovery to a specific date range (e.g., October 2018 to October 2025):
waybackpress run example.com --from 20181001 --to 20251031This will run the complete pipeline: discover URLs, validate posts, fetch media, and generate a WordPress import file.
WaybackPress works in stages, allowing you to control each step of the recovery process.
Query the Wayback Machine to find all archived URLs for your domain:
waybackpress discover example.comSingle URL Extraction: Extract just one specific post instead of the entire site:
waybackpress discover example.com --url https://example.com/2020/01/post-title/Date Range Filtering: Limit discovery to specific date range:
waybackpress discover example.com --from 20181001 --to 20251031This queries only snapshots between October 1, 2018 and October 31, 2025. Useful for:
- Recovering content from specific time periods
- Avoiding very old or very recent snapshots
- Reducing processing time for large sites
Options:
--url URL: Extract a single specific URL instead of entire site--from DATE: Start date (YYYYMMDD or YYYYMMDDHHMMSS format)--to DATE: End date (YYYYMMDD or YYYYMMDDHHMMSS format)--output DIR: Specify output directory (default: wayback-data/example.com)--delay SECONDS: Delay between requests (default: 5)--concurrency N: Concurrent requests (default: 2)
Download and validate discovered URLs to identify actual blog posts:
waybackpress validate --output wayback-data/example.comThis stage:
- Downloads HTML for each URL
- Extracts metadata (title, date, author, categories, tags)
- Identifies valid posts using content heuristics
- Filters out archives, category pages, and duplicates
- Generates a detailed validation report
Download images, CSS, and JavaScript referenced in posts:
waybackpress fetch-media --output wayback-data/example.comOptions:
--pass N: Pass number for multi-pass fetching (default: 1)
The media fetcher:
- Parses HTML to extract all media URLs
- Queries CDX API for available snapshots
- Attempts multiple snapshots if initial fetch fails
- Tracks successes and failures for additional passes
- Saves progress incrementally
If the first pass has a low success rate, run additional passes:
waybackpress fetch-media --output wayback-data/example.com --pass 2Each pass attempts different snapshots, increasing the likelihood of recovery.
Generate a WordPress WXR import file:
waybackpress export --output wayback-data/example.comOptions:
--title TEXT: Site title for export (default: domain name)--url URL: Site URL for export (default: http://domain)--author-name NAME: Post author name (default: admin)--author-email EMAIL: Post author email (default: [email protected])
Run all stages at once:
waybackpress run example.comWith date range:
waybackpress run example.com --from 20181001 --to 20251031Options:
--skip-media: Skip media fetching--output DIR: Output directory--delay SECONDS: Request delay--concurrency N: Concurrent requests--from DATE: Start date (YYYYMMDD or YYYYMMDDHHMMSS)--to DATE: End date (YYYYMMDD or YYYYMMDDHHMMSS)- All export options (--title, --url, --author-name, --author-email)
WaybackPress creates the following directory structure:
wayback-data/
└── example.com/
├── config.json # Project configuration
├── waybackpress.log # Detailed logs
├── discovered_urls.tsv # All discovered URLs
├── valid_posts.tsv # Validated post URLs
├── validation_report.csv # Detailed validation results
├── media_report.csv # Media fetch results
├── wordpress-export.xml # Final WXR import file
├── html/ # Downloaded HTML files
│ └── post-slug.html
└── media/ # Downloaded media assets
└── example.com/
└── wp-content/
└── uploads/
Each project maintains a config.json file with settings and state:
{
"domain": "example.com",
"output_dir": "wayback-data/example.com",
"delay": 5.0,
"concurrency": 2,
"skip_media": false,
"discovered": true,
"validated": true,
"media_fetched": true,
"exported": true
}The Wayback Machine is a free public resource. Be respectful:
- Use the default 5-second delay between requests
- Keep concurrency at 2 or lower
- Run during off-peak hours for large sites
- Consider multiple sessions for sites with thousands of posts
Media fetching has inherent limitations:
- Not all media is archived
- Some snapshots may be corrupted
- Success rates typically range from 30-50%
Strategies to improve recovery:
- Run multiple passes (2-3 recommended)
- Increase delay and decrease concurrency for better reliability
- Review
media_report.csvto identify patterns in failures - Consider manual recovery for high-value assets
The validator applies several filters:
- Minimum content length (200 characters)
- Duplicate detection (content hash)
- URL pattern matching (excludes /category/, /tag/, /feed/)
- Date validation
Review validation_report.csv to verify results and adjust if needed.
After generating the WXR file:
- Log into your WordPress admin panel
- Go to Tools → Import → WordPress
- Install the WordPress Importer if prompted
- Upload
wordpress-export.xml - Assign post authors and choose import options
- Click "Run Importer"
Media files must be uploaded separately:
- Connect to your server via SFTP/SSH
- Navigate to
wp-content/uploads/ - Upload the contents of the
media/directory - Preserve the directory structure (domain/wp-content/uploads/)
Alternatively, use WP-CLI:
wp media regenerate --yesError: "externally-managed-environment"
You're using Python 3.12+ which requires virtual environments. Follow the recommended installation steps above using python3 -m venv venv.
Error: "Cannot update time stamp of directory 'waybackpress.egg-info'"
The egg-info directory is owned by root. Remove it and reinstall:
sudo rm -rf waybackpress.egg-info
python3 -m venv venv
source venv/bin/activate
pip install -e .ModuleNotFoundError: No module named 'trafilatura'
The setup.py is missing the trafilatura dependency. This is fixed in the latest version. If you're using an older version:
pip install trafilatura>=2.0.0- Verify the domain is archived: https://web.archive.org/
- Check if posts use non-standard URL patterns
- Review
discovered_urls.tsvto see what was found - Adjust URL filtering logic in
utils.pyif needed
- Run additional passes with
--pass 2,--pass 3 - Reduce concurrency:
--concurrency 1 - Increase delay:
--delay 10 - Check
media_report.csvfor failure patterns
- Validate XML:
xmllint --noout wordpress-export.xml - Check WordPress error logs
- Ensure server has adequate memory (php.ini: memory_limit)
- Split large imports into smaller batches
Run tests:
python -m pytest tests/Format code:
black waybackpress/Type checking:
mypy waybackpress/waybackpress/
├── __init__.py # Package metadata
├── __main__.py # Entry point for python -m
├── cli.py # Command-line interface
├── config.py # Configuration management
├── utils.py # Shared utilities
├── discover.py # URL discovery
├── validate.py # Post validation
├── fetch.py # Media fetching
└── export.py # WXR generation
- Only works with WordPress sites (other CMSs not supported)
- Requires posts to be archived in Wayback Machine
- Media recovery depends on archive availability
- Some dynamic content (comments, widgets) may not preserve perfectly
- Wayback snapshots may have inconsistent timestamps
Contributions are welcome. Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
MIT License. See LICENSE file for details.
Developed by Shift8 Web for the WordPress community.
Built using:
- BeautifulSoup4 for HTML parsing
- aiohttp for async HTTP requests
- python-dateutil for flexible date parsing
- lxml for XML processing
- URL discovery from Wayback CDX API
- Post validation with content heuristics
- Multi-pass media fetching
- WXR 1.2 export generation
- Resumable operations
- Progress tracking and reporting