A Modern, Distributed Search Engine Built for Scale
SneakDex is a high-performance, enterprise-grade distributed search engine designed for modern web-scale content discovery and analysis. Built with a microservices architecture, it efficiently crawls, processes, indexes, and serves web content with exceptional speed and reliability.
π High Performance - Built with Go, Rust, Python, and Next.js for optimal speed and resource efficiency
π Distributed Architecture - Microservices design for horizontal scalability
π Real-time Processing - Kafka-based streaming for instant content updates
π§ Semantic Search - Advanced vector embeddings with Sentence Transformers
π Hybrid Search - Combines vector similarity (75%) with full-text search (25%)
πΌοΈ Text-to-Image Search - Pure vector search for image discovery using semantic embeddings
π Enterprise Monitoring - Comprehensive observability with Prometheus & Grafana
π‘οΈ Production Ready - Battle-tested with robust error handling and security
β‘ Cloud Native - Container-first design with Docker & Kubernetes support
Mermaid -
%%{init: {
'themeVariables': {
'fontSize': '15px',
'primaryColor': '#e6f2ff',
'secondaryColor': '#f5e6ff',
'tertiaryColor': '#fffae6'
},
'flowchart': {
'htmlLabels': false,
'curve': 'basis',
'defaultRenderer': 'elk'
}
}}%%
flowchart TB
%% --- Subgraphs ---
subgraph APP["`**App/Web Layer**`"]
WEB["`π¨ Next.js Frontend`"]
API["`π§ Next.js Search API`"]
end
subgraph DATA["`**Data Layer**`"]
QDRANT["`Qdrant<br/>(Vector DB)`"]
SUPABASE["`Supabase/<br/>Postgres`"]
REDIS_EXPORTER_LOCAL["`Redis<br/>Metrics Exporter`"]
REDIS_LOCAL["`Redis Cache<br/>Local`"]
REDIS["`Redis Cache<br/>Hosted`"]
end
subgraph CORE["`**SneakDex Core Services**`"]
PARSER["`π Parser<br/>Service`"]
INDEXER["`ποΈ Indexer<br/>Service`"]
CRAWLER["`π·οΈ Crawler<br/>Service`"]
end
subgraph QUEUE["`**Message Queue**`"]
KAFKA_EXPORTER["`Kafka<br/>Metrics Exporter`"]
KAFKA["`Apache Kafka`"]
end
subgraph PIPELINE["`**ML Pipeline**`"]
EMBEDDINGS_LOCAL["`π€ MiniLM-L12-v2<br/>(Local)`"]
EMBEDDINGS_SERVER["`π€ MiniLM-L12-v2`"]
HUGGINGFACE["`π€ HuggingFace<br/>API`"]
end
subgraph MON["`**Monitoring**`"]
PROM["`Prometheus`"]
GRAF["`Grafana`"]
end
%% --- Flows ---
WEB -- "API<br/>Request" --> API
API ==>|"Cache" | REDIS
API ==>|"Vector Search" | QDRANT
API ==>|"User Data" | SUPABASE
API ==>|"Embedding" | EMBEDDINGS_SERVER
EMBEDDINGS_SERVER -.->|Fallback| HUGGINGFACE
CRAWLER -- "Job<br/>Schedule" --> REDIS_LOCAL
CRAWLER ==> KAFKA
REDIS_LOCAL --> REDIS_EXPORTER_LOCAL
KAFKA ==> PARSER
KAFKA --> KAFKA_EXPORTER
PARSER -.-> KAFKA
KAFKA ==> INDEXER
INDEXER ==>|"Vectors" | QDRANT
INDEXER ==>|"Metadata" | SUPABASE
INDEXER ==>|"Local Embed" | EMBEDDINGS_LOCAL
CRAWLER -.-> PROM
PARSER -.-> PROM
INDEXER -.-> PROM
KAFKA_EXPORTER -.-> PROM
REDIS_EXPORTER_LOCAL -.-> PROM
PROM --> GRAF
%% --- Styles for improved visibility ---
classDef web fill:#d6f5d6,stroke:#333,stroke-width:2px;
classDef data fill:#ffebcc,stroke:#795548,stroke-width:2px;
classDef core fill:#f0d9ff,stroke:#9c27b0,stroke-width:2.5px;
classDef queue fill:#ffe6e6,stroke:#c62828,stroke-width:2px;
classDef pipe fill:#e6ecff,stroke:#1565c0,stroke-width:2px;
classDef mon fill:#e1f5fe,stroke:#0277bd,stroke-width:2px;
classDef api fill:#b3e5fc,stroke:#0097a7,stroke-width:3px;
class WEB,API web;
class QDRANT,SUPABASE,REDIS,REDIS_LOCAL,REDIS_EXPORTER_LOCAL data;
class PARSER,INDEXER,CRAWLER core;
class KAFKA,KAFKA_EXPORTER queue;
class EMBEDDINGS_LOCAL,EMBEDDINGS_SERVER,HUGGINGFACE pipe;
class PROM,GRAF mon;
%% API is special
class API api;
High-performance distributed web crawler
- Technology: Go + Colly framework
- Queue Management: Redis-based distributed URL queue
- Content Delivery: Real-time streaming via Kafka
- Features: Concurrent crawling, URL deduplication, rate limiting, robots.txt compliance
- Security: IP filtering, domain validation, content size limits
- Monitoring: Prometheus metrics, structured logging, health checks
Key Metrics:
- Processes thousands of pages per minute
- Intelligent URL validation prevents malicious access
- Graceful error handling with exponential backoff
High-performance HTML content extraction and processing
- Technology: Rust for memory safety and blazing speed
- Content Processing: HTML parsing, text extraction, metadata analysis
- Language Detection: Automatic language identification using whatlang
- Text Cleaning: Normalizes whitespace, removes noise, extracts readable content
- Features: Title/description extraction, heading detection, link analysis, image cataloging
- Validation: Content size limits, quality filtering, robust error handling
Key Outputs:
- Structured JSON with cleaned text and metadata
- Language detection and word count analysis
- Hierarchical heading extraction (H1-H6)
Scalable semantic and sparse indexing with vector embeddings
- Technology: Python + Sentence Transformers for AI-powered semantic understanding
- Vector Database: Qdrant for high-performance vector similarity search
- Sparse Indexing: Supabase/PostgreSQL with full-text search capabilities (tsvector)
- Semantic Processing: Dense vector embeddings for documents and images
- Batch Processing: Configurable batch sizes for optimal throughput and resource utilization
- Multi-Modal Support: Processes both text content and associated images with captions
Key Features:
- Dual Indexing Strategy: Vector embeddings in Qdrant + metadata in PostgreSQL
- Language-Aware: Stores language metadata for multilingual search optimization
- Content Snippets: Generates searchable text previews for result display
- Fault Tolerance: Skips malformed messages, continues processing with comprehensive error logging
- Real-time Monitoring: Tracks vector count, batch success rates, and processing throughput
Performance Metrics:
- Processes 50(configurable) documents per batch
- Concurrent embedding generation for faster indexing
- Automatic content size limits prevent resource exhaustion
- Horizontal scaling support for enterprise workloads
Full-stack search interface with hybrid search capabilities
- Technology: Next.js β₯15.4.1 with React frontend and API routes backend
- Search Engine: Hybrid search combining vector similarity and full-text search
- Caching Strategy: Redis/Upstash distributed caching with intelligent TTL management
- ML Integration: MiniLM-L12-v2 embeddings with HuggingFace API fallback
- Multi-Modal Search: Traditional web search and text-to-image semantic search
- Performance: Sub-second response times with intelligent result caching
Key Features:
- Hybrid Search Architecture: Vector search (75% weight) + PostgreSQL full-text (25% weight) + Additional Domain match boost depending on query length and domain length
- Intelligent Result Fusion: Advanced scoring algorithms merge results from multiple sources
- Text-to-Image Search: Pure vector search for image discovery using semantic embeddings
- Robust Fallbacks: Vector β Payload fallback + PostgreSQL chain ensures high availability
- Smart Caching: Multi-layered caching with in-memory embeddings and Redis persistence
- Real-time Interface: Responsive Next.js frontend with mobile optimization
Search Capabilities:
- Semantic Understanding: 384-dimensional vectors with cosine similarity
- Result Ranking: Sophisticated scoring combining relevance and freshness
API Endpoints:
- POST /api/search: Hybrid web search with configurable parameters
- POST /api/search-images: Text-to-image semantic search
Performance Metrics:
- Handles millions of documents with sub-second search times
- Concurrent user support with horizontal scaling
- Intelligent caching reduces database load by 80%+
- 99.9% uptime with comprehensive fallback mechanisms
- Docker & Docker Compose π¦
- Go β₯ 1.24 (for development)
- Rust β₯ 1.82 (for development)
- Python β₯ 3.12 (for development)
- Next.js β₯ 15.4.1 (for development)
- Redis β₯ 7.0
- Apache Kafka β₯ 4.0.0
- Qdrant β₯ 1.0.0
- Supabase/PostgreSQL β₯ 2.0.0
# Clone the repository
git clone https://github.com/Sneakyhydra/SneakDex.git
cd sneakdex
# List of commands
make help
# Start all services
make up
# Start all services (PROD)
make up ENV=prod
# Start a service
make up SERVICE=crawler
# Scale a service
make up SERVICE=parser SCALE="parser=3"
# View logs
make logsAll services are configured via environment variables for container-friendly deployment.
- Real-time metrics for all services
# Run crawler service
cd services/crawler
go mod download
go run cmd/crawler/main.go
# Run parser service
cd services/parser
cargo run
# Run indexer service
cd services/indexer
python -m src.main
# Run app service
cd services/app
npm install
npm run dev
# With development config
export GO_ENV=development
export LOG_LEVEL=debug
export NODE_ENV=developmentsneakdex/
βββ services/
β βββ crawler/ # Web crawling service (Go)
β βββ parser/ # Content parsing service (Rust)
β βββ indexer/ # Search indexing service (Python)
β βββ app/ # Search interface service (Next.js)
β βββ app/ # Next.js pages and API routes
β | βββ _components/ # React components
β | βββ _contexts/ # Context for state management
β | βββ _types/ # Typescript types
β | βββ api/ # API
β βββ public/ # Static assets
|
βββ docker-compose.yml
βββ monitoring/
βββ Architecture.png
βββ Makefile
βββ README.md
- Crawling Speed: 1000+ pages/minute per instance
- Parsing Throughput: High-speed Rust processing with memory safety
- Indexing Rate: 50(configurable) documents per batch with semantic embeddings
- Search Latency: Sub-second response times with hybrid search
- Vector Search: Sub-millisecond similarity search via Qdrant
- Cache Performance: 80%+ hit rate reduces database load significantly
- Concurrent Processing: Parallel connections per service
- Memory Efficient: Multi-level caching and batch processing reduces resource usage
- Horizontal Scaling: Add instances to increase throughput linearly
- Fault Tolerant: Auto-retry with exponential backoff across all services
- ML Performance: Local embeddings with HuggingFace fallback for high availability
- β Private IP address filtering (RFC 1918)
- β Domain whitelist/blacklist support
- β Content size limits (prevents DoS)
- β Request timeout protection
- β User-Agent transparency
- β Container security best practices
- β Environment-based secrets management
- β Payload sanitization and validation
- β API key authentication for external services
- β Input validation and query sanitization
- β Rate limiting and abuse protection
- Minimal attack surface with health-check-only inbound ports
- Outbound filtering for HTTP/HTTPS only
- Internal service mesh for secure communication
- Encrypted connections to Qdrant and Supabase
- Secure API endpoints with comprehensive validation
MIT License - feel free to use, modify, and contribute to this project.
Built with β€οΈ for the open web
