Community Detection & Semantic Summarization #42
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new "Community Detection & Semantic Summarization" module for the Galaxy Tool Knowledge Graph. It implements a hierarchical clustering engine using the Leiden algorithm to detect communities of tools based on usage patterns, and leverages LLMs (via Hugging Face) to generate semantic titles and summaries for these communities. The changes include new configuration options, documentation, the clustering and summarization scripts, and visual diagrams of the data pipeline and hierarchy.
Key additions and improvements:
1. Community Detection & Clustering Implementation
build_communities.py, which connects to Neo4j, normalizes the graph, builds weighted tool co-occurrence topology, and applies a two-level Leiden clustering algorithm to assign tools to hierarchical communities. Results are written back to Neo4j with appropriate relationships.2. LLM-based Semantic Summarization
summarize_communities.py, which queries the Hugging Face API to generate semantic titles and summaries for each detected community at both levels, updating the Neo4j database accordingly. Includes robust error handling and JSON extraction.3. Documentation & Explanation
README.mdexplaining the purpose, algorithmic approach, usage instructions, and the role of this module in powering GraphRAG and agent reasoning.4. Visual Documentation
pipeline_flow.mmd) and the hierarchical logic and relationships among communities and tools (hierarchy_logic.mmd). [1] [2]5. Configuration Updates
.env.sampleto include Hugging Face LLM and Neo4j configuration variables required for the new modules.This pull request introduces a new "Community Detection & Semantic Summarization" module for the Galaxy Tool Knowledge Graph, enabling hierarchical clustering of tools based on real-world usage and semantic labeling of these clusters using LLMs. The changes span new configuration, code, documentation, and schema updates to support building, storing, and summarizing tool communities, which are foundational for advanced graph-based reasoning and recommendations.Major new functionality: Community detection and semantic summarization
agents/community_detection/build_communities.pyscript, which constructs a weighted tool-tool co-occurrence graph from workflow data in Neo4j, applies a hierarchical Leiden clustering algorithm, and writes multi-level community assignments back to the database.agents/community_detection/summarize_communities.pyscript, which uses a HuggingFace-hosted LLM to generate human-readable titles and summaries for each detected community, storing these semantic labels in Neo4j.Documentation and diagram updates
README.mdto theagents/community_detectiondirectory, detailing the purpose, algorithmic approach, usage, and architecture of the new community detection and summarization pipeline.pipeline_flow.mmd(data pipeline from raw workflows to semantic graph) andhierarchy_logic.mmd(visualizing the hierarchical community structure and relationships). [1] [2]Schema and configuration enhancements
schema.node.ymlandschema.edge.ymlto define new node (Community) and edge types (USED_WITH,IN_COMMUNITY), enabling storage of community assignments and tool-tool relationships in Neo4j. [1] [2].env.samplewith HuggingFace and Neo4j configuration variables required for the new modules.Other updates
agents/ingestion/main.pyto use a newer dataset, supporting the expanded tool graph.