🚀 Storage Monitoring & Metadata Pipeline (PySpark + MySQL + Superset)

A production-grade system that scans Virtual Machine storage, extracts detailed file metadata, computes hashes, stores Parquet backups, loads metadata into MySQL, and visualizes insights using Apache Superset.

This project is built using:

PySpark for distributed metadata scanning
MySQL for centralized storage
Superset for analytics dashboards
Bash for automation

Ideal for:

Storage monitoring
Duplicate file detection
Capacity planning
Infrastructure audits
Enterprise server monitoring
DevOps & Linux engineering

📁 Project Structure

VM-File-Monitoring-Setup/
│
├── storage_monitor/
│   ├── config.yaml           # Scanner configuration
│   ├── requirements.txt      # Python dependencies
│   ├── run_scan.sh           # Shell wrapper
│   ├── scan_vm_to_mysql.py   # MySQL ingestion pipeline
│   └── scripts/
│       ├── helpers.py        # Hashing utilities
│       └── scan.py           # PySpark metadata scanner
│
├── superset/
│   └── superset_config.py    # Superset backend configuration
│
├── .gitignore
└── README.md

✨ Features

1. Distributed File Scanning (PySpark)

Parallel directory traversal
Extracts metadata (size, owner, permissions, extension, timestamps, etc.)
Depth-based directory chunking for faster performance

2. Parquet Backup Storage

Efficient columnar storage
Can be consumed by Spark, Pandas, Athena, and other analytics engines

3. MySQL Metadata Store

Stores normalized metadata
Ideal for BI dashboards and analytics
Scales to millions of file entries

4. Superset Dashboard

Visualize storage usage
Filter by owner, extension, directory, or scan date
Identify large files and duplicate candidates

5. Configurable & Extensible

Customize paths, hashing thresholds, log directories, and workers via config.yaml

6. Production Ready

Runs on any Linux VM
Supports automation via cron
Resilient and scalable architecture

🛠 Prerequisites

Python 3.10+
Apache Spark 4.x
MySQL server
Apache Superset
Linux environment

🚀 Getting Started

1. Clone the Repository

git clone https://github.com/Durvesh123/StorageMetaScan.git
cd StorageMetaScan/storage_monitor

2. Create Virtual Environment

python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure the Scanner

Edit config.yaml:

root_paths:
  - /path/to/scan
max_workers: 4
output_dir: ./outputs/metadata

5. Run the Metadata Scanner

./run_scan.sh

This will:

Scan directories
Collect metadata
Hash files
Save Parquet output

6. Load Metadata Into MySQL

Set environment variables before running:

export MYSQL_USER="root"
export MYSQL_PASS="yourpassword"

Run ingestion:

python3 scan_vm_to_mysql.py

7. Visualize in Apache Superset

Place your Superset configuration:

superset/superset_config.py

Start Superset:

superset run -p 8088 --with-threads --reload --debugger

Then:

Connect to MySQL
Import the metadata table
Build dashboards

🧠 How It Works (Architecture Overview)

1. Scanner Phase (PySpark)

The scanner:

Reads directories in parallel
Performs lightweight metadata extraction
Avoids heavy operations until necessary
Writes output to Parquet

2. Duplicate Detection Logic

Step	Action
1	Compare file size
2	Compute partial hash
3	Group by (size, partial hash)
4	For duplicates → compute full hash
5	Confirm true duplicates

3. MySQL Ingestion

The script scan_vm_to_mysql.py loads metadata into a MySQL table for analytical queries.

Suitable for BI
Efficient indexing possible
Supports Superset dashboards

4. Superset Dashboard

Visualizes:

Top largest files
Count by extensions
Data by owners
Daily scans
Duplicate candidates

Dashboard Preview:

🧩 Future Enhancements (Roadmap)

Real-time scanning via inotify
S3 / GCS / Azure Blob support
REST API using FastAPI
Docker Compose environment
Alerting system (Slack, Teams, Email)
RBAC for multi-user access

📦 Use Cases

Enterprise storage monitoring
DevOps infra audits
Duplicate file cleanup
Security review
Cloud migration prep
Capacity forecasting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Storage Monitoring & Metadata Pipeline (PySpark + MySQL + Superset)

📁 Project Structure

✨ Features

1. Distributed File Scanning (PySpark)

2. Parquet Backup Storage

3. MySQL Metadata Store

4. Superset Dashboard

5. Configurable & Extensible

6. Production Ready

🛠 Prerequisites

🚀 Getting Started

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Configure the Scanner

5. Run the Metadata Scanner

6. Load Metadata Into MySQL

7. Visualize in Apache Superset

🧠 How It Works (Architecture Overview)

1. Scanner Phase (PySpark)

2. Duplicate Detection Logic

3. MySQL Ingestion

4. Superset Dashboard

🧩 Future Enhancements (Roadmap)

📦 Use Cases

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
outputs		outputs
storage_monitor		storage_monitor
superset		superset
.gitignore		.gitignore
README.md		README.md

Durvesh123/StorageMetaScan

Folders and files

Latest commit

History

Repository files navigation

🚀 Storage Monitoring & Metadata Pipeline (PySpark + MySQL + Superset)

📁 Project Structure

✨ Features

1. Distributed File Scanning (PySpark)

2. Parquet Backup Storage

3. MySQL Metadata Store

4. Superset Dashboard

5. Configurable & Extensible

6. Production Ready

🛠 Prerequisites

🚀 Getting Started

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Configure the Scanner

5. Run the Metadata Scanner

6. Load Metadata Into MySQL

7. Visualize in Apache Superset

🧠 How It Works (Architecture Overview)

1. Scanner Phase (PySpark)

2. Duplicate Detection Logic

3. MySQL Ingestion

4. Superset Dashboard

🧩 Future Enhancements (Roadmap)

📦 Use Cases

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages