A Python script that scrapes BuiltWith.com pages for technology information used by websites, compiling a unique, sorted list of technologies.
This project is designed to quickly and easily compile the technologies a website uses by scraping its BuiltWith.com page. You can list multiple websites in an endpoints.txt file (each on a new line) to process them in batch. The script deduplicates the results so that only unique technologies are included in the final output.
- Batch Processing: Add multiple website endpoints via
endpoints.txt. - Data Extraction: Scrapes technology data from BuiltWith.com pages.
- Deduplication: Outputs only unique technology names.
- Sorted Output: Produces a sorted list of technologies in
output.txt.
- Python 3.x
- requests
- BeautifulSoup4
-
Clone or Download the Repository
git clone https://github.com/thejessicafelts/builtwith-scraper.git cd builtwith-scraper -
Install Dependencies
Use pip to install the necessary libraries:
pip install requests beautifulsoup4
-
Prepare Endpoints File
Create an
endpoints.txtfile in the project directory. Each line should contain the website endpoint (the part of the URL followinghttp://builtwith.com/). For example:example1.com example2.com example3.com
Run the Python script:
python3 script.py
The script will:
- Read endpoints from
endpoints.txt - Construct full URLs by appending endpoints to the base URL
http://builtwith.com/ - Scrape technology data from each page
- Deduplicate and sort the technology names
- Write the unique list to
output.txt
Contributions are welcome! Feel free to open an issue or submit a pull request for any improvements or bug fixes.