Fine‑tune Qwen 2.5‑VL (or any Vision‑Language model with the same API) on image grounding tasks using GRPO (Generic Reward Prediction Optimization) in just a few lines of code.
- Plug‑and‑play trainer – drop in your own JSON dataset of prompts + bounding‑boxes and start training.
- Image‑aware data collator – automatically loads, preprocesses and batches images.
- Reward‑based optimisation – leverages the
trllibrary’s GRPO algorithm for RL‑style fine‑tuning. - Minimal codebase – only three Python files, easy to read and customise.
- Accepts an
image_processorand animages_rootfolder. - Overrides
data_collatorto- Load images with Pillow.
- Batch‑encode them via the Hugging Face
AutoProcessor. - Return a dict containing
pixel_values– tensor (C × H × W)prompt– instruction stringsolution– ground‑truth bbox or coordinatesscales– original image size
Tiny subclass that forwards all arguments to the real Qwen 2.5‑VL model while gracefully ignoring the extra logits_to_keep parameter expected by GRPO.
Currently only accuracy_reward_coord, which returns 1 if the (x, y) coordinate predicted by the model falls inside the ground‑truth bounding‑box and 0 otherwise.
Feel free to add IoU‑ or distance‑based rewards here.
Provides a concrete example wiring everything together.
Customise the constants at the top, or replace them with argparse flags for production use.
| Hyper‑parameter | Where to set | Notes |
|---|---|---|
per_device_train_batch_size |
GRPOConfig |
Limited by GPU memory – images are heavy! |
num_generations |
GRPOConfig |
How many action samples to draw per prompt. |
reward_funcs |
trainer init | List of callables returning a reward ∈ {0, 1}. |
bf16 / fp16 |
GRPOConfig |
Use bf16 on A100/H100 for speed and memory efficiency. |