semantic search capabilities to the Neo4j graph pipeline by adding node embedding and search scripts #44

Johnnas12 · 2025-12-29T12:07:07Z

This pull request introduces semantic search capabilities to the Neo4j graph pipeline by adding node embedding and search scripts, along with related documentation updates. It also includes minor code cleanups and formatting improvements in the Galaxy tools metadata downloader.

Semantic search and embedding features:

Added embed_nodes.py to compute and store text embeddings on graph nodes using a configurable open-source model (default: MiniLM), with optional vector index creation for Neo4j to support semantic search.
Added search.py CLI script to embed a user query and return top-K similar nodes by label, leveraging stored embeddings for semantic similarity search.
Updated agents/generic_loader/README.md with instructions and usage examples for embedding nodes and running semantic search, including details on model selection and index creation. [1] [2]

Improvements to Galaxy tools metadata downloader:

Changed the default tool_limit in tool_downloader.py to None for unlimited downloads and cleaned up formatting in several places for better readability. [1] [2] [3] [4]

Streamlines logic for tool selection and metadata extraction by reducing unnecessary line breaks and simplifying code structure. Changes the default tool limit to unlimited for greater flexibility. Removes explicit encoding specification when saving output, relying on defaults. Improves maintainability and readability without altering core functionality.

Introduces scripts to compute and store text embeddings on graph nodes using a sentence transformer model, enabling vector-based semantic search via cosine similarity. Enhances support for semantic retrieval and discoverability of nodes in the Neo4j database.

Documents new functionality for computing node embeddings using open models and performing semantic search in Neo4j. Highlights optional steps to enhance search capabilities and index support. Improves pipeline usability for similarity-based queries.

Prevents accidental tracking of generated CSV and cypher output files related to workflows and tools. Reduces repository clutter from intermediate and result files.

Introduces Makefile targets to convert latest tool and workflow JSONs to CSV, and to load them into a graph database with configurable credentials. Enables a full data processing pipeline for streamlined automation.

Introduces targets to automate downloading of tool and workflow metadata, and a composite target to run downloads and the existing pipeline in sequence. Improves reproducibility and streamlines setup for new environments.

Updates CSV writing to quote all fields, set a backslash as the escape character, and use a consistent newline terminator. Improves compatibility with downstream CSV consumers and prevents issues with special characters in data.

Switches from semicolon-based to blank line separation for Cypher statements, avoiding issues with embedded semicolons in strings. Adds progress bars for statement execution and optional pre-load index creation to speed up MERGE operations. Enhances usability and reliability of the batch loader.

Replaces in-memory CSV parsing and manual Cypher statement assembly with APOC-based periodic batch loaders using LOAD CSV, improving scalability and leveraging database-side MD5 generation for IDs. Adds support for configurable batch size, parallel execution, custom concurrency, and flexible CSV path resolution. Simplifies code structure by removing row-by-row Python processing in favor of database-driven logic, enhancing performance for large datasets.

Optimizes node and relationship import by processing CSV data in configurable batches instead of loading all rows at once, reducing memory usage and improving scalability for large files. Adds progress bars for better user feedback and timing logs for performance monitoring. Updates CLI to support batch size configuration.

Integrates pydantic models to validate and structure all data extracted for CSV output, improving data consistency and error resilience. Centralizes schema definitions in a dedicated module to enable easier maintenance and future extension.

Uses Pydantic models to ensure consistent structure and data integrity when converting tool JSON to CSV. Improves robustness and maintainability by validating tool data before output.

Eliminates redundant and unused model classes to reduce code complexity and improve maintainability. Focuses the schema on essential properties and avoids unnecessary duplication. No functional impact expected.

Tibex88 · 2026-01-13T11:15:37Z

agents/generic_loader/convertor_schema.py

+from pydantic import BaseModel #type: ignore
+from typing import Any
+
+class WorkflowProperties(BaseModel):


Are these models auto generated from the yml files, if not they should be

Tibex88 · 2026-01-13T11:24:32Z

Also there is an error in the test

Johnnas12 added 4 commits December 29, 2025 14:54

Updates ignore list for new workflow and output files

be67643

Prevents accidental tracking of generated CSV and cypher output files related to workflows and tools. Reduces repository clutter from intermediate and result files.

Johnnas12 requested a review from Tibex88 December 29, 2025 17:55

Johnnas12 added 9 commits December 30, 2025 11:48

Adds CSV conversion and graph loading pipeline targets

9dbd3a3

Introduces Makefile targets to convert latest tool and workflow JSONs to CSV, and to load them into a graph database with configurable credentials. Enables a full data processing pipeline for streamlined automation.

Adds download targets and full pipeline in build script

5a20296

Introduces targets to automate downloading of tool and workflow metadata, and a composite target to run downloads and the existing pipeline in sequence. Improves reproducibility and streamlines setup for new environments.

Enforces consistent CSV quoting and escaping

ca92b32

Updates CSV writing to quote all fields, set a backslash as the escape character, and use a consistent newline terminator. Improves compatibility with downstream CSV consumers and prevents issues with special characters in data.

Enforces schema validation for tool CSV conversion

8c3f1b4

Uses Pydantic models to ensure consistent structure and data integrity when converting tool JSON to CSV. Improves robustness and maintainability by validating tool data before output.

Removes unused data models to streamline schema

2f2032a

Eliminates redundant and unused model classes to reduce code complexity and improve maintainability. Focuses the schema on essential properties and avoids unnecessary duplication. No functional impact expected.

Tibex88 requested changes Jan 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

semantic search capabilities to the Neo4j graph pipeline by adding node embedding and search scripts #44

semantic search capabilities to the Neo4j graph pipeline by adding node embedding and search scripts #44

Uh oh!

Johnnas12 commented Dec 29, 2025

Uh oh!

Tibex88 Jan 13, 2026

Uh oh!

Tibex88 commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

semantic search capabilities to the Neo4j graph pipeline by adding node embedding and search scripts #44

Are you sure you want to change the base?

semantic search capabilities to the Neo4j graph pipeline by adding node embedding and search scripts #44

Uh oh!

Conversation

Johnnas12 commented Dec 29, 2025

Uh oh!

Tibex88 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Tibex88 commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants