Skip to content

Conversation

@Johnnas12
Copy link
Collaborator

This pull request introduces semantic search capabilities to the Neo4j graph pipeline by adding node embedding and search scripts, along with related documentation updates. It also includes minor code cleanups and formatting improvements in the Galaxy tools metadata downloader.

Semantic search and embedding features:

  • Added embed_nodes.py to compute and store text embeddings on graph nodes using a configurable open-source model (default: MiniLM), with optional vector index creation for Neo4j to support semantic search.
  • Added search.py CLI script to embed a user query and return top-K similar nodes by label, leveraging stored embeddings for semantic similarity search.
  • Updated agents/generic_loader/README.md with instructions and usage examples for embedding nodes and running semantic search, including details on model selection and index creation. [1] [2]

Improvements to Galaxy tools metadata downloader:

  • Changed the default tool_limit in tool_downloader.py to None for unlimited downloads and cleaned up formatting in several places for better readability. [1] [2] [3] [4]

Streamlines logic for tool selection and metadata extraction by reducing unnecessary line breaks and simplifying code structure. Changes the default tool limit to unlimited for greater flexibility. Removes explicit encoding specification when saving output, relying on defaults. Improves maintainability and readability without altering core functionality.
Introduces scripts to compute and store text embeddings on graph nodes
using a sentence transformer model, enabling vector-based semantic
search via cosine similarity. Enhances support for semantic retrieval
and discoverability of nodes in the Neo4j database.
Documents new functionality for computing node embeddings using open models and performing semantic search in Neo4j.
Highlights optional steps to enhance search capabilities and index support.
Improves pipeline usability for similarity-based queries.
Prevents accidental tracking of generated CSV and cypher output files
related to workflows and tools. Reduces repository clutter from
intermediate and result files.
@Johnnas12 Johnnas12 requested a review from Tibex88 December 29, 2025 17:55
Introduces Makefile targets to convert latest tool and workflow JSONs to CSV,
and to load them into a graph database with configurable credentials.
Enables a full data processing pipeline for streamlined automation.
Introduces targets to automate downloading of tool and workflow metadata,
and a composite target to run downloads and the existing pipeline in sequence.
Improves reproducibility and streamlines setup for new environments.
Updates CSV writing to quote all fields, set a backslash as the escape character, and use a consistent newline terminator.
Improves compatibility with downstream CSV consumers and prevents issues with special characters in data.
Switches from semicolon-based to blank line separation for Cypher statements, avoiding issues with embedded semicolons in strings. Adds progress bars for statement execution and optional pre-load index creation to speed up MERGE operations. Enhances usability and reliability of the batch loader.
Replaces in-memory CSV parsing and manual Cypher statement assembly with APOC-based periodic batch loaders using LOAD CSV, improving scalability and leveraging database-side MD5 generation for IDs.

Adds support for configurable batch size, parallel execution, custom concurrency, and flexible CSV path resolution. Simplifies code structure by removing row-by-row Python processing in favor of database-driven logic, enhancing performance for large datasets.
Optimizes node and relationship import by processing CSV data
in configurable batches instead of loading all rows at once,
reducing memory usage and improving scalability for large files.
Adds progress bars for better user feedback and timing logs for
performance monitoring. Updates CLI to support batch size
configuration.
Integrates pydantic models to validate and structure all data
extracted for CSV output, improving data consistency and error
resilience. Centralizes schema definitions in a dedicated module
to enable easier maintenance and future extension.
Uses Pydantic models to ensure consistent structure and data integrity when converting tool JSON to CSV.
Improves robustness and maintainability by validating tool data before output.
Eliminates redundant and unused model classes to reduce code complexity
and improve maintainability. Focuses the schema on essential properties
and avoids unnecessary duplication. No functional impact expected.
from pydantic import BaseModel #type: ignore
from typing import Any

class WorkflowProperties(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these models auto generated from the yml files, if not they should be

@Tibex88
Copy link
Contributor

Tibex88 commented Jan 13, 2026

Also there is an error in the test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants