[Enhancement] Add Support for Semantic Deduplication #104
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🚀 SyGra now supports Semantic Deduplication
🧾 Summary
This PR introduces Semantic Deduplication 🧠✨ as a graph-level post processor to remove near-duplicate generated outputs using embedding-based cosine similarity.
The goal is to improve output quality 📈 and scalability ⚡ by eliminating semantically similar items at the final graph output stage.
The PR also includes documentation 📚, examples 🧪, and unit tests ✅ to ensure the feature is easy to configure, verify, and maintain.
🛠️ Features Implemented
➕ Added
SemanticDedupPostProcessorundersygra/core/graph/graph_postprocessor.pyto perform semantic deduplication on the final graph output list by embedding a configurable field and dropping items above a similarity threshold.⚙️ Added configurable
dedup_modeoptions:nearest_neighbor(default): Incremental deduplication using a nearest-neighbor approach (via LangChainInMemoryVectorStore) for improved scalability.all_pairs: Computes a full cosine similarity matrix for exact but O(n²) comparisons.dedup_mode(unsupported values raiseValueError).🧾 Added optional JSON report generation (when
metadata["output_file"]is available), including:📚 Added and updated documentation & examples, along with 🧪 unit tests covering:
⚡ Performance Impact
dedup_mode: nearest_neighborsignificantly reduces memory usage and improves runtime for large output lists by avoiding an O(n²) similarity matrix build.dedup_mode: all_pairsremains available for exact behavior but has O(n²) time and memory characteristics.nearest_neighbormode.🧪 How to Test the Feature
Steps for reviewers to verify functionality:
Observe:
🖼️ Screenshots (if applicable**)
N/A (no UI changes).
✅ Checklist
📝 Notes