[Enhancement] Add Support for Semantic Deduplication #104

psriramsnc · 2026-01-12T05:20:39Z

🚀 SyGra now supports Semantic Deduplication

🧾 Summary

This PR introduces Semantic Deduplication 🧠✨ as a graph-level post processor to remove near-duplicate generated outputs using embedding-based cosine similarity.
The goal is to improve output quality 📈 and scalability ⚡ by eliminating semantically similar items at the final graph output stage.
The PR also includes documentation 📚, examples 🧪, and unit tests ✅ to ensure the feature is easy to configure, verify, and maintain.

🛠️ Features Implemented

➕ Added SemanticDedupPostProcessor under sygra/core/graph/graph_postprocessor.py to perform semantic deduplication on the final graph output list by embedding a configurable field and dropping items above a similarity threshold.
⚙️ Added configurable dedup_mode options:
- 🧭 nearest_neighbor (default): Incremental deduplication using a nearest-neighbor approach (via LangChain InMemoryVectorStore) for improved scalability.
- 🔍 all_pairs: Computes a full cosine similarity matrix for exact but O(n²) comparisons.
- 🚨 Added strict validation for dedup_mode (unsupported values raise ValueError).
🧾 Added optional JSON report generation (when metadata["output_file"] is available), including:
- 📊 Dropped vs kept statistics
- 🔁 Sampled duplicate pairs
📚 Added and updated documentation & examples, along with 🧪 unit tests covering:
- Keep strategy behavior
- Dedup modes
- Invalid mode handling
- Report generation

⚡ Performance Impact

🚀 dedup_mode: nearest_neighbor significantly reduces memory usage and improves runtime for large output lists by avoiding an O(n²) similarity matrix build.
🐢 dedup_mode: all_pairs remains available for exact behavior but has O(n²) time and memory characteristics.
✅ Overall: Positive performance impact for large outputs when using the default nearest_neighbor mode.

🧪 How to Test the Feature

Steps for reviewers to verify functionality:

▶️ Run unit tests:

pytest tests/core/graph/test_graph_postprocessor.py -k SemanticDedupPostProcessor

Observe tests passing for:

Dedup correctness (keep: first / keep: last)
dedup_mode handling (nearest_neighbor, all_pairs, invalid mode raises)
Report file generation when metadata["output_file"] is provided

Run the example task config:

Use tasks/examples/semantic_dedup or tasks/examples/semantic_dedup_no_seed
Execute the task using the standard project task runner flow
Observe:
Output list is deduplicated based on similarity_threshold
A report file is written next to the output file (if output metadata is set), named semantic_dedup_report_*.json

🖼️ Screenshots (if applicable**)

N/A (no UI changes).

✅ Checklist

Lint fixes and unit testing done
End to end task testing
Documentation updated

📝 Notes

nearest_neighbor is the recommended/default mode for scale.
all_pairs is intended for smaller datasets due to its O(n²) cost.
Dedup report generation depends on metadata["output_file"] being present at runtime; without it, dedup still runs but no report is persisted.

tasks/examples/semantic_dedup_no_seed/task_executor.py

Added Semantic Dedup Graph Post Processor

22da38e

psriramsnc self-assigned this Jan 12, 2026

psriramsnc added the enhancement New feature or request label Jan 12, 2026

Merge branch 'main' into scratch/feat_semantic_dedup

e772e38

github-code-quality bot found potential problems Jan 12, 2026

View reviewed changes

tasks/examples/semantic_dedup_no_seed/task_executor.py Fixed Show fixed Hide fixed

Added ANN dedup, test cases and documentation

744b918

psriramsnc marked this pull request as ready for review January 12, 2026 12:51

psriramsnc requested a review from a team as a code owner January 12, 2026 12:51

vipul-mittal and others added 2 commits January 13, 2026 11:47

Merge branch 'main' into scratch/feat_semantic_dedup

002cf6a

Added Documentation page to mkdocs

9111173

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Add Support for Semantic Deduplication #104

[Enhancement] Add Support for Semantic Deduplication #104

Uh oh!

psriramsnc commented Jan 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Enhancement] Add Support for Semantic Deduplication #104

Are you sure you want to change the base?

[Enhancement] Add Support for Semantic Deduplication #104

Uh oh!

Conversation

psriramsnc commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 SyGra now supports Semantic Deduplication

🧾 Summary

🛠️ Features Implemented

⚡ Performance Impact

🧪 How to Test the Feature

🖼️ Screenshots (if applicable**)

✅ Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

psriramsnc commented Jan 12, 2026 •

edited

Loading