Skip to content

computationalpathologygroup/ANTONI-Alpha

Repository files navigation

ANTONI-Alpha

Vision-Language Model for Computational Pathology

Resources

Authors

Computational Pathology Group RadboudUMC

Model Information

ANTONI-Alpha is a vision-language model for computational pathology. It combines Prism vision embeddings (1280-dim) with MedGemma-2B language model through a learned cross-attention projector, enabling natural language interactions with whole slide images.

Architecture:

  • Vision encoder: Prism (produces tile-level embeddings)
  • Language model: MedGemma-2B (4-bit quantized with LoRA)
  • Projector: Cross-attention with 256 learnable query tokens

Training:

  • Stage 1: Projector alignment (frozen LLM)
  • Stage 2: Instruction tuning (LoRA fine-tuning)
  • Dataset: HISTAI-Instruct (multilingual, multimodal)

Installation

git clone https://github.com/computationalpathologygroup/ANTONI-Alpha.git
cd ANTONI-Alpha
pip install -e .

Optional: Flash Attention 2

For improved performance on compatible hardware, install Flash Attention 2:

pip install flash-attn==2.8.3 --no-build-isolation

The --no-build-isolation flag allows the build process to use your installed PyTorch. Flash Attention 2 requires CUDA-capable hardware and will be used automatically if installed.

How to Use

import torch
from pathlib import Path
from antoni_alpha.models.antoni_pretrained import AntoniAlphaPreTrained

# Load model
model = AntoniAlphaPreTrained.from_pretrained(
    "SaltySander/ANTONI-Alpha",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# Load slide features (Prism embeddings: [num_tiles, 1280])
slide_features = torch.load("slide_features.pt")
slide_latents = slide_features.unsqueeze(0)  # Add batch dimension
slide_latents = slide_latents.to(next(model.projection_layer.parameters()).device)

# Run inference
conversation = [{"role": "user", "content": "What tissue is this?"}]

with torch.no_grad():
    output_ids = model.generate(
        slide_latents=slide_latents,
        conversations=[conversation],
        max_new_tokens=200,
        do_sample=False,
    )

response = model.processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(response)

See examples/inference_example.py for a complete multi-turn conversation example.

Input/Output Structure

Input:

  • slide_latents: Tensor of shape [batch_size, num_tiles, 1280] (Prism embeddings)
  • conversations: List of conversation lists in OpenAI format

Output:

  • Generated text response from the language model

Training

# Configure training
python train.py --config config/finetune.yaml

Training configurations available in config/ directory.

License

This model is released under the Health AI Developer Foundations License.

Citation

@inproceedings{moonemans2025open,
  title={Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modeling},
  author={Sander Moonemans and Sebastiaan Ram and Fr{\'e}d{\'e}rique Meeuwsen and Carlijn Lems and Jeroen van der Laak and Geert Litjens and Francesco Ciompi},
  booktitle={Submitted to Medical Imaging with Deep Learning},
  year={2025},
  url={https://openreview.net/forum?id=aGPowreqPi},
  note={under review}
}

About

Open Instruction Tuning for Whole-Slide Digital Pathology

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages