🔍 SneakDex

A Modern, Distributed Search Engine Built for Scale

🌟 Overview

SneakDex is a high-performance, enterprise-grade distributed search engine designed for modern web-scale content discovery and analysis. Built with a microservices architecture, it efficiently crawls, processes, indexes, and serves web content with exceptional speed and reliability.

✨ Key Features

🚀 High Performance - Built with Go, Rust, Python, and Next.js for optimal speed and resource efficiency
🌐 Distributed Architecture - Microservices design for horizontal scalability
🔄 Real-time Processing - Kafka-based streaming for instant content updates
🧠 Semantic Search - Advanced vector embeddings with Sentence Transformers
🔍 Hybrid Search - Combines vector similarity (75%) with full-text search (25%)
🖼️ Text-to-Image Search - Pure vector search for image discovery using semantic embeddings
📊 Enterprise Monitoring - Comprehensive observability with Prometheus & Grafana
🛡️ Production Ready - Battle-tested with robust error handling and security
⚡ Cloud Native - Container-first design with Docker & Kubernetes support

🏗️ Architecture

Mermaid -

%%{init: {
  'themeVariables': {
    'fontSize': '15px',
    'primaryColor': '#e6f2ff',
    'secondaryColor': '#f5e6ff',
    'tertiaryColor': '#fffae6'
  },
  'flowchart': {
    'htmlLabels': false,
    'curve': 'basis',
    'defaultRenderer': 'elk'
  }
}}%%

flowchart TB
  %% --- Subgraphs ---
  subgraph APP["`**App/Web Layer**`"]
    WEB["`🎨 Next.js Frontend`"]
    API["`🔧 Next.js Search API`"]
  end

  subgraph DATA["`**Data Layer**`"]
    QDRANT["`Qdrant<br/>(Vector DB)`"]
    SUPABASE["`Supabase/<br/>Postgres`"]
    REDIS_EXPORTER_LOCAL["`Redis<br/>Metrics Exporter`"]
    REDIS_LOCAL["`Redis Cache<br/>Local`"]
    REDIS["`Redis Cache<br/>Hosted`"]
  end

  subgraph CORE["`**SneakDex Core Services**`"]
    PARSER["`📄 Parser<br/>Service`"]
    INDEXER["`🗃️ Indexer<br/>Service`"]
    CRAWLER["`🕷️ Crawler<br/>Service`"]
  end

  subgraph QUEUE["`**Message Queue**`"]
    KAFKA_EXPORTER["`Kafka<br/>Metrics Exporter`"]
    KAFKA["`Apache Kafka`"]
  end

  subgraph PIPELINE["`**ML Pipeline**`"]
    EMBEDDINGS_LOCAL["`🤖 MiniLM-L12-v2<br/>(Local)`"]
    EMBEDDINGS_SERVER["`🤖 MiniLM-L12-v2`"]
    HUGGINGFACE["`🤗 HuggingFace<br/>API`"]
  end

  subgraph MON["`**Monitoring**`"]
    PROM["`Prometheus`"]
    GRAF["`Grafana`"]
  end

  %% --- Flows ---
  WEB -- "API<br/>Request" --> API
  API ==>|"Cache" | REDIS
  API ==>|"Vector Search" | QDRANT
  API ==>|"User Data" | SUPABASE
  API ==>|"Embedding" | EMBEDDINGS_SERVER
  EMBEDDINGS_SERVER -.->|Fallback| HUGGINGFACE

  CRAWLER -- "Job<br/>Schedule" --> REDIS_LOCAL
  CRAWLER ==> KAFKA
  REDIS_LOCAL --> REDIS_EXPORTER_LOCAL
  KAFKA ==> PARSER
  KAFKA --> KAFKA_EXPORTER
  PARSER -.-> KAFKA
  KAFKA ==> INDEXER
  INDEXER ==>|"Vectors" | QDRANT
  INDEXER ==>|"Metadata" | SUPABASE
  INDEXER ==>|"Local Embed" | EMBEDDINGS_LOCAL

  CRAWLER -.-> PROM
  PARSER -.-> PROM
  INDEXER -.-> PROM
  KAFKA_EXPORTER -.-> PROM
  REDIS_EXPORTER_LOCAL -.-> PROM
  PROM --> GRAF

  %% --- Styles for improved visibility ---
  classDef web fill:#d6f5d6,stroke:#333,stroke-width:2px;
  classDef data fill:#ffebcc,stroke:#795548,stroke-width:2px;
  classDef core fill:#f0d9ff,stroke:#9c27b0,stroke-width:2.5px;
  classDef queue fill:#ffe6e6,stroke:#c62828,stroke-width:2px;
  classDef pipe fill:#e6ecff,stroke:#1565c0,stroke-width:2px;
  classDef mon fill:#e1f5fe,stroke:#0277bd,stroke-width:2px;
  classDef api fill:#b3e5fc,stroke:#0097a7,stroke-width:3px;

  class WEB,API web;
  class QDRANT,SUPABASE,REDIS,REDIS_LOCAL,REDIS_EXPORTER_LOCAL data;
  class PARSER,INDEXER,CRAWLER core;
  class KAFKA,KAFKA_EXPORTER queue;
  class EMBEDDINGS_LOCAL,EMBEDDINGS_SERVER,HUGGINGFACE pipe;
  class PROM,GRAF mon;

  %% API is special
  class API api;

🧩 Services

🕷️ Crawler Service

Go to Crawler README

High-performance distributed web crawler

Technology: Go + Colly framework
Queue Management: Redis-based distributed URL queue
Content Delivery: Real-time streaming via Kafka
Features: Concurrent crawling, URL deduplication, rate limiting, robots.txt compliance
Security: IP filtering, domain validation, content size limits
Monitoring: Prometheus metrics, structured logging, health checks

Key Metrics:

Processes thousands of pages per minute
Intelligent URL validation prevents malicious access
Graceful error handling with exponential backoff

📄 Parser Service

Go to Parser README

High-performance HTML content extraction and processing

Technology: Rust for memory safety and blazing speed
Content Processing: HTML parsing, text extraction, metadata analysis
Language Detection: Automatic language identification using whatlang
Text Cleaning: Normalizes whitespace, removes noise, extracts readable content
Features: Title/description extraction, heading detection, link analysis, image cataloging
Validation: Content size limits, quality filtering, robust error handling

Key Outputs:

Structured JSON with cleaned text and metadata
Language detection and word count analysis
Hierarchical heading extraction (H1-H6)

🗃️ Indexer Service

Go to Indexer README

Scalable semantic and sparse indexing with vector embeddings

Technology: Python + Sentence Transformers for AI-powered semantic understanding
Vector Database: Qdrant for high-performance vector similarity search
Sparse Indexing: Supabase/PostgreSQL with full-text search capabilities (tsvector)
Semantic Processing: Dense vector embeddings for documents and images
Batch Processing: Configurable batch sizes for optimal throughput and resource utilization
Multi-Modal Support: Processes both text content and associated images with captions

Key Features:

Dual Indexing Strategy: Vector embeddings in Qdrant + metadata in PostgreSQL
Language-Aware: Stores language metadata for multilingual search optimization
Content Snippets: Generates searchable text previews for result display
Fault Tolerance: Skips malformed messages, continues processing with comprehensive error logging
Real-time Monitoring: Tracks vector count, batch success rates, and processing throughput

Performance Metrics:

Processes 50(configurable) documents per batch
Concurrent embedding generation for faster indexing
Automatic content size limits prevent resource exhaustion
Horizontal scaling support for enterprise workloads

🚀 App Service

Go to App README

Full-stack search interface with hybrid search capabilities

Technology: Next.js ≥15.4.1 with React frontend and API routes backend
Search Engine: Hybrid search combining vector similarity and full-text search
Caching Strategy: Redis/Upstash distributed caching with intelligent TTL management
ML Integration: MiniLM-L12-v2 embeddings with HuggingFace API fallback
Multi-Modal Search: Traditional web search and text-to-image semantic search
Performance: Sub-second response times with intelligent result caching

Key Features:

Hybrid Search Architecture: Vector search (75% weight) + PostgreSQL full-text (25% weight) + Additional Domain match boost depending on query length and domain length
Intelligent Result Fusion: Advanced scoring algorithms merge results from multiple sources
Text-to-Image Search: Pure vector search for image discovery using semantic embeddings
Robust Fallbacks: Vector → Payload fallback + PostgreSQL chain ensures high availability
Smart Caching: Multi-layered caching with in-memory embeddings and Redis persistence
Real-time Interface: Responsive Next.js frontend with mobile optimization

Search Capabilities:

Semantic Understanding: 384-dimensional vectors with cosine similarity
Result Ranking: Sophisticated scoring combining relevance and freshness

API Endpoints:

POST /api/search: Hybrid web search with configurable parameters
POST /api/search-images: Text-to-image semantic search

Performance Metrics:

Handles millions of documents with sub-second search times
Concurrent user support with horizontal scaling
Intelligent caching reduces database load by 80%+
99.9% uptime with comprehensive fallback mechanisms

🚀 Quick Start

Prerequisites

Docker & Docker Compose 📦
Go ≥ 1.24 (for development)
Rust ≥ 1.82 (for development)
Python ≥ 3.12 (for development)
Next.js ≥ 15.4.1 (for development)
Redis ≥ 7.0
Apache Kafka ≥ 4.0.0
Qdrant ≥ 1.0.0
Supabase/PostgreSQL ≥ 2.0.0

🐳 Docker Deployment

# Clone the repository
git clone https://github.com/Sneakyhydra/SneakDex.git
cd sneakdex

# List of commands
make help

# Start all services
make up

# Start all services (PROD)
make up ENV=prod

# Start a service
make up SERVICE=crawler

# Scale a service
make up SERVICE=parser SCALE="parser=3"

# View logs
make logs

⚙️ Configuration

All services are configured via environment variables for container-friendly deployment.

Go to Crawler Configuration

Go to Parser Configuration

Go to Indexer Configuration

Go to App Configuration

Grafana Dashboard

Real-time metrics for all services

🛠️ Development

Local Development Setup

# Run crawler service
cd services/crawler
go mod download
go run cmd/crawler/main.go

# Run parser service  
cd services/parser
cargo run

# Run indexer service
cd services/indexer
python -m src.main

# Run app service
cd services/app
npm install
npm run dev

# With development config
export GO_ENV=development
export LOG_LEVEL=debug
export NODE_ENV=development

Project Structure

sneakdex/
├── services/
│   ├── crawler/                # Web crawling service (Go)
│   ├── parser/                 # Content parsing service (Rust)
│   ├── indexer/                # Search indexing service (Python)
│   └── app/                    # Search interface service (Next.js)
│       ├── app/                # Next.js pages and API routes
│       |   ├── _components/    # React components
│       |   ├── _contexts/      # Context for state management
│       |   ├── _types/         # Typescript types
│       |   ├── api/            # API
│       └── public/             # Static assets
|
├── docker-compose.yml
│── monitoring/
│── Architecture.png
│── Makefile
└── README.md

📈 Performance

Crawling Speed: 1000+ pages/minute per instance
Parsing Throughput: High-speed Rust processing with memory safety
Indexing Rate: 50(configurable) documents per batch with semantic embeddings
Search Latency: Sub-second response times with hybrid search
Vector Search: Sub-millisecond similarity search via Qdrant
Cache Performance: 80%+ hit rate reduces database load significantly
Concurrent Processing: Parallel connections per service
Memory Efficient: Multi-level caching and batch processing reduces resource usage
Horizontal Scaling: Add instances to increase throughput linearly
Fault Tolerant: Auto-retry with exponential backoff across all services
ML Performance: Local embeddings with HuggingFace fallback for high availability

🔒 Security

Built-in Security Features

✅ Private IP address filtering (RFC 1918)
✅ Domain whitelist/blacklist support
✅ Content size limits (prevents DoS)
✅ Request timeout protection
✅ User-Agent transparency
✅ Container security best practices
✅ Environment-based secrets management
✅ Payload sanitization and validation
✅ API key authentication for external services
✅ Input validation and query sanitization
✅ Rate limiting and abuse protection

Network Security

Minimal attack surface with health-check-only inbound ports
Outbound filtering for HTTP/HTTPS only
Internal service mesh for secure communication
Encrypted connections to Qdrant and Supabase
Secure API endpoints with comprehensive validation

📄 License

MIT License - feel free to use, modify, and contribute to this project.

Built with ❤️ for the open web

⭐ Star this on GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 SneakDex

🌟 Overview

✨ Key Features

🏗️ Architecture

🧩 Services

🕷️ Crawler Service

📄 Parser Service

🗃️ Indexer Service

🚀 App Service

🚀 Quick Start

Prerequisites

🐳 Docker Deployment

⚙️ Configuration

Grafana Dashboard

🛠️ Development

Local Development Setup

Project Structure

📈 Performance

🔒 Security

Built-in Security Features

Network Security

📄 License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
app		app
monitoring		monitoring
services		services
.gitignore		.gitignore
Architecture.png		Architecture.png
Makefile		Makefile
README.md		README.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml

Sneakyhydra/SneakDex

Folders and files

Latest commit

History

Repository files navigation

🔍 SneakDex

🌟 Overview

✨ Key Features

🏗️ Architecture

🧩 Services

🕷️ Crawler Service

📄 Parser Service

🗃️ Indexer Service

🚀 App Service

🚀 Quick Start

Prerequisites

🐳 Docker Deployment

⚙️ Configuration

Grafana Dashboard

🛠️ Development

Local Development Setup

Project Structure

📈 Performance

🔒 Security

Built-in Security Features

Network Security

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages