-
Notifications
You must be signed in to change notification settings - Fork 4
semantic search capabilities to the Neo4j graph pipeline by adding node embedding and search scripts #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Johnnas12
wants to merge
13
commits into
main
Choose a base branch
from
graph-rag-and-community
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Streamlines logic for tool selection and metadata extraction by reducing unnecessary line breaks and simplifying code structure. Changes the default tool limit to unlimited for greater flexibility. Removes explicit encoding specification when saving output, relying on defaults. Improves maintainability and readability without altering core functionality.
Introduces scripts to compute and store text embeddings on graph nodes using a sentence transformer model, enabling vector-based semantic search via cosine similarity. Enhances support for semantic retrieval and discoverability of nodes in the Neo4j database.
Documents new functionality for computing node embeddings using open models and performing semantic search in Neo4j. Highlights optional steps to enhance search capabilities and index support. Improves pipeline usability for similarity-based queries.
Prevents accidental tracking of generated CSV and cypher output files related to workflows and tools. Reduces repository clutter from intermediate and result files.
Introduces Makefile targets to convert latest tool and workflow JSONs to CSV, and to load them into a graph database with configurable credentials. Enables a full data processing pipeline for streamlined automation.
Introduces targets to automate downloading of tool and workflow metadata, and a composite target to run downloads and the existing pipeline in sequence. Improves reproducibility and streamlines setup for new environments.
Updates CSV writing to quote all fields, set a backslash as the escape character, and use a consistent newline terminator. Improves compatibility with downstream CSV consumers and prevents issues with special characters in data.
Switches from semicolon-based to blank line separation for Cypher statements, avoiding issues with embedded semicolons in strings. Adds progress bars for statement execution and optional pre-load index creation to speed up MERGE operations. Enhances usability and reliability of the batch loader.
Replaces in-memory CSV parsing and manual Cypher statement assembly with APOC-based periodic batch loaders using LOAD CSV, improving scalability and leveraging database-side MD5 generation for IDs. Adds support for configurable batch size, parallel execution, custom concurrency, and flexible CSV path resolution. Simplifies code structure by removing row-by-row Python processing in favor of database-driven logic, enhancing performance for large datasets.
Optimizes node and relationship import by processing CSV data in configurable batches instead of loading all rows at once, reducing memory usage and improving scalability for large files. Adds progress bars for better user feedback and timing logs for performance monitoring. Updates CLI to support batch size configuration.
Integrates pydantic models to validate and structure all data extracted for CSV output, improving data consistency and error resilience. Centralizes schema definitions in a dedicated module to enable easier maintenance and future extension.
Uses Pydantic models to ensure consistent structure and data integrity when converting tool JSON to CSV. Improves robustness and maintainability by validating tool data before output.
Eliminates redundant and unused model classes to reduce code complexity and improve maintainability. Focuses the schema on essential properties and avoids unnecessary duplication. No functional impact expected.
Tibex88
requested changes
Jan 13, 2026
| from pydantic import BaseModel #type: ignore | ||
| from typing import Any | ||
|
|
||
| class WorkflowProperties(BaseModel): |
Contributor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these models auto generated from the yml files, if not they should be
Contributor
|
Also there is an error in the test |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces semantic search capabilities to the Neo4j graph pipeline by adding node embedding and search scripts, along with related documentation updates. It also includes minor code cleanups and formatting improvements in the Galaxy tools metadata downloader.
Semantic search and embedding features:
embed_nodes.pyto compute and store text embeddings on graph nodes using a configurable open-source model (default: MiniLM), with optional vector index creation for Neo4j to support semantic search.search.pyCLI script to embed a user query and return top-K similar nodes by label, leveraging stored embeddings for semantic similarity search.agents/generic_loader/README.mdwith instructions and usage examples for embedding nodes and running semantic search, including details on model selection and index creation. [1] [2]Improvements to Galaxy tools metadata downloader:
tool_limitintool_downloader.pytoNonefor unlimited downloads and cleaned up formatting in several places for better readability. [1] [2] [3] [4]