A multi-stage pipeline for extracting Metal-Organic Polyhedra (MOPs) information from scientific papers using MCP-enhanced LLM agents, producing structured knowledge graphs (TTL).
- Python 3.11+
- (Recommended) WSL on Windows for a smoother Linux-like environment
- Docker (only if you use MCP tools that require it; some tools are stdio-only)
# venv
python -m venv .venv
source .venv/bin/activate # Windows PowerShell: .venv\Scripts\Activate.ps1
# or conda
conda create -n mcp_layer python=3.11
conda activate mcp_layerpip install -r requirements.txtThis repo git-ignores many runtime folders (caches, logs, generated prompts/scripts).
Some modules (notably models/locations.py) require directories to exist at import time.
Run:
python scripts/bootstrap_repo.pyIf you plan to run grounding/lookup agents, also create grounding-cache folders:
python scripts/bootstrap_repo.py --with-grounding-cache ontospeciescp configs/mcp_configs.json.example configs/mcp_configs.jsonThen edit configs/mcp_configs.json to reflect your local environment (paths, server commands).
This repo does not ship a committed .env.example. Create .env in the repo root with what your environment expects.
At minimum, many agents expect something like:
API_KEY=...
BASE_URL=...Exact keys depend on your models/ModelConfig.py / models/LLMCreator.py configuration.
After python scripts/bootstrap_repo.py, you should have (among others):
data/(runtime data, cached results; gitignored)data/log/(required; some modules error if missing)data/ontologies/(place ontology T-Box TTLs here)data/grounding_cache/<ontology>/labels(optional; for Script C fuzzy lookup)
raw_data/(PDF inputs; gitignored)sandbox/(scratch scripts; gitignored)ai_generated_contents*/(LLM-generated artifacts; gitignored)
There are two “layers”:
- Ontology-specific MCP lookup server (generated for a given ontology)
- Grounding consumer agent that applies mappings to TTLs
This repo includes configs/grounding.json to run the OntoSpecies lookup server via stdio.
The grounding agent lives at src/agents/grounding/grounding_agent.py.
- Single file:
python -m src.agents.grounding.grounding_agent --ttl path/to/file.ttl --write-grounded-ttl- Batch folder (recursively processes
*.ttl, skipping*_grounded.ttland*link.ttl):
python -m src.agents.grounding.grounding_agent --batch-dir evaluation/data/merged_tll --write-grounded-ttlNotes:
- Internal merge (deduplicating identical nodes across TTLs) runs by default in batch mode; disable with
--no-internal-merge. - Default grounding materialization mode is
replace(replacessource_iriwithgrounded_iri). You can switch tosameaswith--grounding-mode sameas.
The main pipeline entrypoint is mop_main.py (see its CLI help):
python mop_main.py --helpUse the following canonical Python entrypoints to generate plans, prompts, and MCP scripts.
python -m src.agents.scripts_and_prompts_generation.task_division_agent \
--tbox data/ontologies/ontosynthesis.ttl \
--output configs/task_division_plan.json \
--model gpt-5python -m src.agents.scripts_and_prompts_generation.task_prompt_creation_agent \
--version 1 \
--plan configs/task_division_plan.json \
--tbox data/ontologies/ontosynthesis.ttl \
--model gpt-4.1 \
--parallel 3Legacy plan-driven mode (matches the old run_extraction_prompt_creation.sh intent):
python -m src.agents.scripts_and_prompts_generation.task_extraction_prompt_creation_agent \
--version 1 \
--plan configs/task_division_plan.json \
--tbox data/ontologies/ontosynthesis.ttl \
--model gpt-5 \
--parallel 3Iterations-driven mode (uses ontology flags + ai_generated_contents_candidate/iterations/**/iterations.json):
python -m src.agents.scripts_and_prompts_generation.task_extraction_prompt_creation_agent \
--ontosynthesis \
--version 1 \
--model gpt-5 \
--parallel 34) Generate MCP underlying scripts from T-Box (writes into ai_generated_contents_candidate/scripts/…)
All ontologies from ape_generated_contents/meta_task_config.json:
python -m src.agents.scripts_and_prompts_generation.mcp_underlying_script_creation_agent --allSingle ontology (by short name or by TTL path):
python -m src.agents.scripts_and_prompts_generation.mcp_underlying_script_creation_agent \
--ontology ontosynthesis \
--model gpt-5 \
--split