A high-performance service designed to eliminate duplicate and near-duplicate content, ensuring diversity and token uniqueness in datasets. Built on Cloudflare Workers. Used in ELTEX.
🤖 Powered by WALL-E, a GitHub bot that supercharges spec-driven development through automated generation of Cloudflare Workers.
This service demonstrates how to effectively handle exact and semantic duplicates in data collection workflows:
- Cloudflare Workers + Workers AI: Vector embeddings (bge-m3) for semantic comparison
- PostgreSQL + pgvector: Vector storage and cosine similarity search
- Two-stage filtering: Content hash filtering → Vector similarity threshold