Spidy

This tool is an advanced web crawler designed to asynchronously traverse directories hosted on a web server and identify potential secrets (like API keys, passwords, database URLs) and sensitive files (like .env, configuration files, private keys) within the content of those files.

It builds upon the concept of directory browsing to discover files and then performs content analysis to flag security concerns.

Preview

Features

Asynchronous Crawling: Uses aiohttp and asyncio for efficient, concurrent fetching of web resources.
Depth and Limit Control: Configurable maximum crawl depth and total URL limit to prevent runaway scans.
Intelligent Link Extraction: Employs BeautifulSoup to parse HTML directory listings and extract links, attempting to handle various server listing formats.
Comprehensive Secret Detection: Searches file contents using a wide range of regular expressions targeting common secret formats (API Keys, DB URLs, Cloud credentials, etc.).
Sensitive File Identification: Recognizes potentially sensitive filenames based on patterns (e.g., .env, config.json, wp-config.php).
Private Key Detection: Specifically looks for private key content and filenames.
Directory Listing Detection: Identifies pages that appear to be server-generated directory listings.
Configurable Ignoring: Skips common irrelevant directories (like node_modules, .git) and file types (images, videos, archives).
Detailed Reporting: Generates a JSON report summarizing findings, including discovered secrets, sensitive files, and a sitemap of visited URLs. Also prints a summary to the console.
Modular Design: Code is separated into distinct modules for configuration, core logic, data models, detection rules, and utilities for better maintainability.

Requirements

Python 3.7+
aiohttp (for asynchronous HTTP requests)
beautifulsoup4 (for HTML parsing)

Installation

Clone or download this repository.
Install the required Python packages:
```
pip install -r requirements.txt
```

Usage

Run the script from the command line, providing the target URL as the main argument.

python main.py <target_url> [options]

Arguments

<target_url>: The starting URL of the website to scan (e.g., http://example.com).

Options

--max-depth DEPTH: Maximum directory depth to crawl (default: 10).
--max-urls COUNT: Maximum number of URLs to fetch and scan (default: 5000).
--concurrency WORKERS: Number of concurrent worker tasks for fetching (default: 20).
--output FILENAME: Name of the JSON output report file (default: secrets_report.json).
-v, --verbose: Enable verbose logging (shows DEBUG level messages).
-h, --help: Show help message and exit.

Example

Scan http://test.local with a maximum depth of 5, up to 1000 URLs, using 10 concurrent workers, and save the report as my_report.json. Enable verbose output.

python main.py http://test.local --max-depth 5 --max-urls 1000 --concurrency 10 --output my_report.json --verbose

Configuration

Ignored Paths & Files: The lists IGNORED_PATHS and IGNORED_FILE_PATTERNS in config/settings.py control which directories and file types are skipped during the crawl. You can modify these lists to customize the spider's behavior.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
config		config
core		core
detectors		detectors
models		models
utils		utils
.gitignore		.gitignore
README.md		README.md
logger.py		logger.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spidy

Preview

Features

Requirements

Installation

Usage

Arguments

Options

Example

Configuration

About

Uh oh!

Releases

Packages

Languages

ayxkaddd/Spidy

Folders and files

Latest commit

History

Repository files navigation

Spidy

Preview

Features

Requirements

Installation

Usage

Arguments

Options

Example

Configuration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages