This tool is an advanced web crawler designed to asynchronously traverse directories hosted on a web server and identify potential secrets (like API keys, passwords, database URLs) and sensitive files (like .env, configuration files, private keys) within the content of those files.
It builds upon the concept of directory browsing to discover files and then performs content analysis to flag security concerns.
- Asynchronous Crawling: Uses
aiohttpandasynciofor efficient, concurrent fetching of web resources. - Depth and Limit Control: Configurable maximum crawl depth and total URL limit to prevent runaway scans.
- Intelligent Link Extraction: Employs
BeautifulSoupto parse HTML directory listings and extract links, attempting to handle various server listing formats. - Comprehensive Secret Detection: Searches file contents using a wide range of regular expressions targeting common secret formats (API Keys, DB URLs, Cloud credentials, etc.).
- Sensitive File Identification: Recognizes potentially sensitive filenames based on patterns (e.g.,
.env,config.json,wp-config.php). - Private Key Detection: Specifically looks for private key content and filenames.
- Directory Listing Detection: Identifies pages that appear to be server-generated directory listings.
- Configurable Ignoring: Skips common irrelevant directories (like
node_modules,.git) and file types (images, videos, archives). - Detailed Reporting: Generates a JSON report summarizing findings, including discovered secrets, sensitive files, and a sitemap of visited URLs. Also prints a summary to the console.
- Modular Design: Code is separated into distinct modules for configuration, core logic, data models, detection rules, and utilities for better maintainability.
- Python 3.7+
aiohttp(for asynchronous HTTP requests)beautifulsoup4(for HTML parsing)
-
Clone or download this repository.
-
Install the required Python packages:
pip install -r requirements.txt
Run the script from the command line, providing the target URL as the main argument.
python main.py <target_url> [options]<target_url>: The starting URL of the website to scan (e.g.,http://example.com).
--max-depth DEPTH: Maximum directory depth to crawl (default: 10).--max-urls COUNT: Maximum number of URLs to fetch and scan (default: 5000).--concurrency WORKERS: Number of concurrent worker tasks for fetching (default: 20).--output FILENAME: Name of the JSON output report file (default:secrets_report.json).-v,--verbose: Enable verbose logging (shows DEBUG level messages).-h,--help: Show help message and exit.
Scan http://test.local with a maximum depth of 5, up to 1000 URLs, using 10 concurrent workers, and save the report as my_report.json. Enable verbose output.
python main.py http://test.local --max-depth 5 --max-urls 1000 --concurrency 10 --output my_report.json --verbose- Ignored Paths & Files: The lists
IGNORED_PATHSandIGNORED_FILE_PATTERNSinconfig/settings.pycontrol which directories and file types are skipped during the crawl. You can modify these lists to customize the spider's behavior.
