Skip to content

Conversation

@Johnnas12
Copy link
Collaborator

@Johnnas12 Johnnas12 commented Dec 29, 2025

This pull request introduces a new "Community Detection & Semantic Summarization" module for the Galaxy Tool Knowledge Graph. It implements a hierarchical clustering engine using the Leiden algorithm to detect communities of tools based on usage patterns, and leverages LLMs (via Hugging Face) to generate semantic titles and summaries for these communities. The changes include new configuration options, documentation, the clustering and summarization scripts, and visual diagrams of the data pipeline and hierarchy.

Key additions and improvements:

1. Community Detection & Clustering Implementation

  • Added build_communities.py, which connects to Neo4j, normalizes the graph, builds weighted tool co-occurrence topology, and applies a two-level Leiden clustering algorithm to assign tools to hierarchical communities. Results are written back to Neo4j with appropriate relationships.

2. LLM-based Semantic Summarization

  • Added summarize_communities.py, which queries the Hugging Face API to generate semantic titles and summaries for each detected community at both levels, updating the Neo4j database accordingly. Includes robust error handling and JSON extraction.

3. Documentation & Explanation

  • Added a comprehensive README.md explaining the purpose, algorithmic approach, usage instructions, and the role of this module in powering GraphRAG and agent reasoning.

4. Visual Documentation

  • Added Mermaid diagrams illustrating the data pipeline (pipeline_flow.mmd) and the hierarchical logic and relationships among communities and tools (hierarchy_logic.mmd). [1] [2]

5. Configuration Updates

  • Updated .env.sample to include Hugging Face LLM and Neo4j configuration variables required for the new modules.This pull request introduces a new "Community Detection & Semantic Summarization" module for the Galaxy Tool Knowledge Graph, enabling hierarchical clustering of tools based on real-world usage and semantic labeling of these clusters using LLMs. The changes span new configuration, code, documentation, and schema updates to support building, storing, and summarizing tool communities, which are foundational for advanced graph-based reasoning and recommendations.

Major new functionality: Community detection and semantic summarization

  • Added the agents/community_detection/build_communities.py script, which constructs a weighted tool-tool co-occurrence graph from workflow data in Neo4j, applies a hierarchical Leiden clustering algorithm, and writes multi-level community assignments back to the database.
  • Added the agents/community_detection/summarize_communities.py script, which uses a HuggingFace-hosted LLM to generate human-readable titles and summaries for each detected community, storing these semantic labels in Neo4j.

Documentation and diagram updates

  • Added a comprehensive README.md to the agents/community_detection directory, detailing the purpose, algorithmic approach, usage, and architecture of the new community detection and summarization pipeline.
  • Added two new Mermaid diagrams: pipeline_flow.mmd (data pipeline from raw workflows to semantic graph) and hierarchy_logic.mmd (visualizing the hierarchical community structure and relationships). [1] [2]

Schema and configuration enhancements

  • Updated schema.node.yml and schema.edge.yml to define new node (Community) and edge types (USED_WITH, IN_COMMUNITY), enabling storage of community assignments and tool-tool relationships in Neo4j. [1] [2]
  • Updated .env.sample with HuggingFace and Neo4j configuration variables required for the new modules.

Other updates

  • Updated the workflow data import path in agents/ingestion/main.py to use a newer dataset, supporting the expanded tool graph.

Macmilan24 and others added 7 commits December 25, 2025 13:02
…summarization

- Add `build_communities.py` to detect Galaxy tool communities using the Leiden algorithm with a 2-level hierarchy (L0 specific, L1 broad).
- Implement graph projection logic to build weighted `USED_WITH` edges based on workflow data flow and co-occurrence.
- Add `summarize_communities.py` to generate semantic titles and descriptions for clusters using the HuggingFace Inference API.
- Implement robust JSON parsing with few-shot prompting and idempotency checks for reliable LLM batch processing.
- Add graph normalization step to materialize missing `USES_TOOL` relationships in Neo4j.
…onfigs

feat(community-detection): implement hierarchical clustering and LLM summarization
Enables conversion of tool metadata from JSON to multiple CSV files
to improve integration with the generic loader pipeline. Expands the
schema to include tools, categories, inputs, and outputs, and defines
corresponding relationships for enhanced workflow analysis.
@Johnnas12 Johnnas12 merged commit 18031fb into main Dec 29, 2025
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants