A production-grade system that scans Virtual Machine storage, extracts detailed file metadata, computes hashes, stores Parquet backups, loads metadata into MySQL, and visualizes insights using Apache Superset.
This project is built using:
- PySpark for distributed metadata scanning
- MySQL for centralized storage
- Superset for analytics dashboards
- Bash for automation
Ideal for:
- Storage monitoring
- Duplicate file detection
- Capacity planning
- Infrastructure audits
- Enterprise server monitoring
- DevOps & Linux engineering
VM-File-Monitoring-Setup/
│
├── storage_monitor/
│ ├── config.yaml # Scanner configuration
│ ├── requirements.txt # Python dependencies
│ ├── run_scan.sh # Shell wrapper
│ ├── scan_vm_to_mysql.py # MySQL ingestion pipeline
│ └── scripts/
│ ├── helpers.py # Hashing utilities
│ └── scan.py # PySpark metadata scanner
│
├── superset/
│ └── superset_config.py # Superset backend configuration
│
├── .gitignore
└── README.md- Parallel directory traversal
- Extracts metadata (size, owner, permissions, extension, timestamps, etc.)
- Depth-based directory chunking for faster performance
- Efficient columnar storage
- Can be consumed by Spark, Pandas, Athena, and other analytics engines
- Stores normalized metadata
- Ideal for BI dashboards and analytics
- Scales to millions of file entries
- Visualize storage usage
- Filter by owner, extension, directory, or scan date
- Identify large files and duplicate candidates
- Customize paths, hashing thresholds, log directories, and workers via config.yaml
- Runs on any Linux VM
- Supports automation via cron
- Resilient and scalable architecture
- Python 3.10+
- Apache Spark 4.x
- MySQL server
- Apache Superset
- Linux environment
git clone https://github.com/Durvesh123/StorageMetaScan.git
cd StorageMetaScan/storage_monitorpython3 -m venv venv
source venv/bin/activatepip install -r requirements.txtEdit config.yaml:
root_paths:
- /path/to/scan
max_workers: 4
output_dir: ./outputs/metadata./run_scan.shThis will:
- Scan directories
- Collect metadata
- Hash files
- Save Parquet output
Set environment variables before running:
export MYSQL_USER="root"
export MYSQL_PASS="yourpassword"Run ingestion:
python3 scan_vm_to_mysql.pyPlace your Superset configuration:
superset/superset_config.pyStart Superset:
superset run -p 8088 --with-threads --reload --debuggerThen:
- Connect to MySQL
- Import the metadata table
- Build dashboards
The scanner:
- Reads directories in parallel
- Performs lightweight metadata extraction
- Avoids heavy operations until necessary
- Writes output to Parquet
| Step | Action |
|---|---|
| 1 | Compare file size |
| 2 | Compute partial hash |
| 3 | Group by (size, partial hash) |
| 4 | For duplicates → compute full hash |
| 5 | Confirm true duplicates |
The script scan_vm_to_mysql.py loads metadata into a MySQL table for analytical queries.
Visualizes:
- Top largest files
- Count by extensions
- Data by owners
- Daily scans
- Duplicate candidates
- Real-time scanning via inotify
- S3 / GCS / Azure Blob support
- REST API using FastAPI
- Docker Compose environment
- Alerting system (Slack, Teams, Email)
- RBAC for multi-user access
- Enterprise storage monitoring
- DevOps infra audits
- Duplicate file cleanup
- Security review
- Cloud migration prep
- Capacity forecasting