Skip to content

A high-performance service designed to eliminate duplicate and near-duplicate content, ensuring diversity and token uniqueness in datasets. Built on Cloudflare Workers.

License

Notifications You must be signed in to change notification settings

1712n/dedup-service

Deduplication Service

A high-performance service designed to eliminate duplicate and near-duplicate content, ensuring diversity and token uniqueness in datasets. Built on Cloudflare Workers. Used in ELTEX.

🤖 Powered by WALL-E, a GitHub bot that supercharges spec-driven development through automated generation of Cloudflare Workers.

Implementation approach

This service demonstrates how to effectively handle exact and semantic duplicates in data collection workflows:

  • Cloudflare Workers + Workers AI: Vector embeddings (bge-m3) for semantic comparison
  • PostgreSQL + pgvector: Vector storage and cosine similarity search
  • Two-stage filtering: Content hash filtering → Vector similarity threshold

About

A high-performance service designed to eliminate duplicate and near-duplicate content, ensuring diversity and token uniqueness in datasets. Built on Cloudflare Workers.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5