This project explores methods for reliably annotating dataset descriptions and scientific texts related to Molecular Dynamics (MD). Because Large Language Models (LLMs) are inherently non-deterministic, we aim to enforce structured and reproducible outputs using a strict Pydantic schema. Below is a Mermaid diagram that summarizes the schema used to capture detected entities:
classDiagram
class ListOfEntities {
entities: list[Molecule | SimulationTime | ForceFieldModel | Temperature | SoftwareName | SoftwareVersion]
}
class SoftwareVersion {
label: str = 'SOFTVERS'
text: str
}
class Temperature {
label: str = 'TEMP'
text: str
}
class SimulationTime {
label: str = 'STIME'
text: str
}
class Molecule {
label: str = 'MOL'
text: str
}
class SoftwareName {
label: str = 'SOFTNAME'
text: str
}
class ForceFieldModel {
label: str = 'FFM'
text: str
}
class Entity {
label: str
text: str
}
ListOfEntities ..> Molecule
ListOfEntities ..> SoftwareVersion
ListOfEntities ..> SimulationTime
ListOfEntities ..> Temperature
ListOfEntities ..> SoftwareName
ListOfEntities ..> ForceFieldModel
To assess robustness and accuracy, we benchmark several LLMs (GPT-5, Gemini 3 Pro, etc.) together with extraction libraries such as Instructor, LlamaIndex, and Pydantic. Our goal is to identify the best model–framework combinations for accurate, consistent, and schema-compliant Molecular Dynamics Named Entity Recognition (MDNER).
We use uv to manage dependencies and the project environment.
Clone the GitHub repository:
git clone git@github.com:MDverse/mdner_llm.git
cd mdner_llmSync dependencies:
uv syncCreate an .env file with a valid OpenAI and OpenRouter API key:
OPENAI_API_KEY=<your-openai-api-key>
OPENROUTER_API_KEY=<your-openrouter-api-key>Remark: This .env file is ignored by git.
To extract structured entities from a single text using a specified LLM and framework, run :
uv run src/extract_entities.py \
--path-prompt prompts/json_few_shot.txt \
--model openai/gpt-4o \
--path-text annotations/v2/figshare_121241.json \
--tag-prompt json \
--framework instructor \
--output-dir results/llm_annotations \
--max-retries 3This command will extract entities from
annotations/v2/figshare_121241.jsonusing the prompt inprompts/json_few_shot.txtand the "instructor" validation framework, saving results inresults/llm_annotationswith base filenamefigshare_121241_openai_gpt-4o_instructor_YYYYMMDD_HHMMSS. Two files will be generated: a JSON metadata file (.json) and a text file with the raw model response (.txt). The command will retry up to 3 times in case of API errors.
Options:
-
--path-prompt: Path to a text file containing the extraction prompt. -
--model: Language model name to use for extraction find in OpenRouter page model (https://openrouter.ai/models). -
--path-text: Path to a JSON file containing the text to annotate. Must include a key "raw_text" with the text content. -
--tag-prompt(Default: "json"): Descriptor indicating the format of the expected LLM output. Choices: "json" or "json_with_positions". -
--framework(Default: None): Validation framework to apply to model outputs. Choices: "instructor", "llamaindex", "pydanticai". -
--output-dir(Default: "results/llm_annotations"): Directory where the output JSON and text files will be saved. -
--max-retries(Default: 3): Maximum number of retries in case of API or validation failure.
To extract structured entities from multiple texts (from a text file containing selected path of annotation texts: --path-texts) using a specified LLM and framework, run :
uv run src/extract_entities_all_texts.py \
--path-prompt prompts/json_few_shot.txt \
--model openai/gpt-4o \
--path-texts results/50_selected_files_20260103_002043.txt \
--tag-prompt json \
--framework instructor \
--output-dir results/llm_annotations \
--max-retries 3This command processes up to annotation files from
results/50_selected_files_20260103_002043.txtand saves the corresponding.jsonand.txtoutputs inresults/llm_annotations/{file_name}_openai_gpt-4o_instructor_YYYYMMDD_HHMMSS.
To evaluate the quality of JSON entity annotations produced by LLMs and different framework, run:
uv run src/evaluate_json_annotations.py \
--annotations-dir results/llm_annotations \
--results-dir results/json_evaluation_statsThis command loads all LLM-generated JSON files in results/llm_annotations, computes per-annotation metrics against the ground-truth, and saves the results in
results/json_evaluation_stats/per_text_metrics_YYYY-MM-DDTHH-MM-SS.parquet. It then creates an Excel summary for each model and framework inresults/json_evaluation_stats/evaluation_summary_YYYY-MM-DDTHH-MM-SS.xlsx.
To format old json annotations, run:
uv run src/format_json_annotations.pyThis command processes all JSON files in
annotations/v1, reformats the entities with their text and exact positions, and saves the formatted files toannotations/v2.
To vizualize the corrections of json annotations, open the notebook in notebooks/correct_and_vizualize_annotations.ipynb.
To perform statistics on the distribution of annotations per files and class, run:
uv run src/count_entities.py --annotations-dir annotations/v2This command processes all JSON files listed, counts the number of entities per class for each annotation, and outputs a TSV file with the filename, text length, and entity counts per class. It will also produce plots with class distribution for all entities and entity distribution by class.
To generate a QC inventory of named entities from annotation files, run:
uv run src/qc_entity_inventory.py \
--annot-folder annotations/v2 \
--out-folder results/qc_annotationsThis command will scan all JSON annotations, aggregate and normalize entities per class, count their occurrences, and save one vocabulary file per class in the output folder.
💡 Running a QC inventory on annotation files ensures that all entities are consistently aggregated and normalized. This is a crucial step for defining annotation rules in molecular dynamics, helping standardize formats, units, and naming conventions. The generated files can be explored in
notebooks/qc_entity_inventory_explorer.ipynband the rules are documented indocs/annotation_rules.md.
To select informative annotation JSON files and export their paths in a text file, run:
uv run src/select_annotation_files.py \
--annotations-dir annotations/v2 \
--nb-files 50 \
--res-path results/50_selected_files_20260103This command selects up to 50 annotation JSON files from
annotations/v2according to entity coverage and recency, and writes their paths to:results/50_selected_files_20260103.txt