Skip to content

Conversation

@psriramsnc
Copy link
Collaborator

@psriramsnc psriramsnc commented Jan 12, 2026

🚀 SyGra now supports Semantic Deduplication

🧾 Summary

This PR introduces Semantic Deduplication 🧠✨ as a graph-level post processor to remove near-duplicate generated outputs using embedding-based cosine similarity.
The goal is to improve output quality 📈 and scalability ⚡ by eliminating semantically similar items at the final graph output stage.
The PR also includes documentation 📚, examples 🧪, and unit tests ✅ to ensure the feature is easy to configure, verify, and maintain.

🛠️ Features Implemented

  • ➕ Added SemanticDedupPostProcessor under sygra/core/graph/graph_postprocessor.py to perform semantic deduplication on the final graph output list by embedding a configurable field and dropping items above a similarity threshold.

  • ⚙️ Added configurable dedup_mode options:

    • 🧭 nearest_neighbor (default): Incremental deduplication using a nearest-neighbor approach (via LangChain InMemoryVectorStore) for improved scalability.
    • 🔍 all_pairs: Computes a full cosine similarity matrix for exact but O(n²) comparisons.
    • 🚨 Added strict validation for dedup_mode (unsupported values raise ValueError).
  • 🧾 Added optional JSON report generation (when metadata["output_file"] is available), including:

    • 📊 Dropped vs kept statistics
    • 🔁 Sampled duplicate pairs
  • 📚 Added and updated documentation & examples, along with 🧪 unit tests covering:

    • Keep strategy behavior
    • Dedup modes
    • Invalid mode handling
    • Report generation

⚡ Performance Impact

  • 🚀 dedup_mode: nearest_neighbor significantly reduces memory usage and improves runtime for large output lists by avoiding an O(n²) similarity matrix build.
  • 🐢 dedup_mode: all_pairs remains available for exact behavior but has O(n²) time and memory characteristics.
  • Overall: Positive performance impact for large outputs when using the default nearest_neighbor mode.

🧪 How to Test the Feature

Steps for reviewers to verify functionality:

  1. ▶️ Run unit tests:
    pytest tests/core/graph/test_graph_postprocessor.py -k SemanticDedupPostProcessor
    
    
  2. Observe tests passing for:
  • Dedup correctness (keep: first / keep: last)
  • dedup_mode handling (nearest_neighbor, all_pairs, invalid mode raises)
  • Report file generation when metadata["output_file"] is provided
  1. Run the example task config:
  • Use tasks/examples/semantic_dedup or tasks/examples/semantic_dedup_no_seed
  • Execute the task using the standard project task runner flow
    Observe:
  • Output list is deduplicated based on similarity_threshold
  • A report file is written next to the output file (if output metadata is set), named semantic_dedup_report_*.json

🖼️ Screenshots (if applicable**)

N/A (no UI changes).

✅ Checklist

  • Lint fixes and unit testing done
  • End to end task testing
  • Documentation updated

📝 Notes

  • nearest_neighbor is the recommended/default mode for scale.
  • all_pairs is intended for smaller datasets due to its O(n²) cost.
  • Dedup report generation depends on metadata["output_file"] being present at runtime; without it, dedup still runs but no report is persisted.

@psriramsnc psriramsnc self-assigned this Jan 12, 2026
@psriramsnc psriramsnc added the enhancement New feature or request label Jan 12, 2026
@psriramsnc psriramsnc marked this pull request as ready for review January 12, 2026 12:51
@psriramsnc psriramsnc requested a review from a team as a code owner January 12, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants