Skip to content

Official codebase for the paper "Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space"

Notifications You must be signed in to change notification settings

eric-ai-lab/DMLR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

arXiv Project Page

This repository implements DMLR (Dynamic Multimodal Interleaving Latent Reasoning), a method for improving vision-language model reasoning through latent space optimization. The system uses reinforcement learning to optimize "thought tokens" that enhance the model's reasoning capabilities on multimodal tasks.

DMLR Framework

Comparison between DMLR and two reasoning paradigms. (A) Text-only reasoning: relies solely on explicit CoT, often causing visual grounding errors and redundant steps. (B) Think-with-Image reasoning: depends on external perception tools, leading to unstable tool calls and extra overhead. (C) DMLR (ours): refines latent think tokens in the latent space through confidence-guided optimization and dynamically injects visual information, achieving self-improving reasoning without additional training while maintaining high efficiency.

Setup

  1. Clone the repository:
git clone <repository-url>
cd DMLR
  1. Install dependencies (using uv or pip):
# Using uv (recommended)
uv pip install -r requirements.txt

# Or using pip
pip install -r requirements.txt
  1. Set up Hugging Face token (if needed for private models):
export HUGGING_FACE_TOKEN=your_token_here
  1. Configure LLM verifier (for answer verification): Create a .env file in the project root with the following content:
OPENAI_API_KEY=your_api_key_here
OPENAI_API_BASE_URL=https://api.openai.com/v1
MODEL_TYPE=gpt-4.1-2025-04-14

Quick Start

Basic Usage

Run inference using the provided script:

bash script/run.sh

Custom Configuration

You can modify script/run.sh or run directly with Python:

uv run python main.py \
    --dataset data/scienceqa.json \
    --model_name_or_path Qwen/Qwen2.5-VL-7B-Instruct \
    --output_dir ./output \
    --device cuda \
    --start_data_idx 0 \
    --end_data_idx 100 \
    --max_new_tokens 2048 \
    --max_num_steps 15 \
    --num_thought_tokens 2 \
    --sigma 25.0 \
    --lr 0.01

Configuration

Key Parameters

Dataset & Model

  • --dataset: Path to dataset JSON file or dataset name
  • --model_name_or_path: Hugging Face model identifier (default: Qwen/Qwen2.5-VL-7B-Instruct)
  • --output_dir: Directory to save results
  • --start_data_idx: Starting index for evaluation (default: 0)
  • --end_data_idx: Ending index for evaluation (default: 100)

Optimization

  • --num_thought_tokens: Number of thought tokens to optimize (default: 8)
  • --max_num_steps: Maximum RL optimization steps (default: 20)
  • --lr: Learning rate for optimization (default: 0.005)
  • --sigma: Noise scale for exploration (default: 20.0)
  • --sigma_decay: Decay factor for sigma (default: 0.95)
  • --reward_threshold: Reward threshold to stop early (default: -1)

Visual Features

  • --num_selected_patches: Max image patches per thought token
  • --initial_patch_count: Initial number of patches to insert
  • --patch_increment: Additional patches when best reward improves
  • --visual_insert_stride: Insert visual tokens every N thought tokens
  • --visual_injection_start_step: Start visual injection from this RL step
  • --visual_injection_interval: Perform injection every N RL steps
  • --visual_only: Use visual features to initialize thought tokens

Inference

  • --max_new_tokens: Maximum tokens to generate (default: 2048)
  • --num_workers: Number of worker processes (default: 1)
  • --worker_device_round_robin: Distribute workers across GPUs
  • --min_pixels: Minimum image pixels (default: 128)
  • --max_pixels: Maximum image pixels (default: 256)

Verification

  • --use_llm_verify: Use LLM to verify solution equivalence
  • --verify_only: Re-verify existing results without re-running inference

Datasets

The project supports multiple vision-language datasets. Place your dataset JSON files in the data/ directory. Expected format:

[
  {
    "prompt": "Question text",
    "solution": "Answer",
    "image_path": "path/to/image.jpg",
    "idx": 0
  }
]

Output

Results are saved in the specified output_dir:

  • results.json: Complete evaluation results with:
    • Model predictions
    • Ground truth answers
    • Correctness judgments
    • Best reward values and steps
    • Configuration parameters

Results JSON Structure

{
  "model": "Qwen/Qwen2.5-VL-7B-Instruct",
  "dataset": "data/scienceqa.json",
  "accuracy": 0.75,
  "correct": 75,
  "total": 100,
  "config": {...},
  "args": {...},
  "entries": [
    {
      "data_idx": 0,
      "question": "...",
      "model_output": "...",
      "answer": "...",
      "ground_truth": "...",
      "is_correct": true,
      "best_reward": 0.85,
      "best_reward_step": 5,
      "stop_reason": "eos_token"
    }
  ]
}

Project Structure

DMLR/
├── main.py                 # Main entry point
├── DMLR/
│   ├── data.py            # Dataset loading
│   ├── inference.py       # Core inference logic
│   ├── reward.py          # Reward model implementation
│   ├── prompts.py         # Prompt templates
│   ├── verifier.py        # Answer verification
│   ├── utils.py           # Utility functions
│   └── logger.py          # Logging setup
├── data/                  # Dataset files
├── script/
│   └── run.sh            # Example run script
└── Readme.md             # This file

Citation

If you use this code in your research, please cite:

@misc{liu2025reasoningminddynamicmultimodal,
      title={Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space}, 
      author={Chengzhi Liu and Yuzhe Yang and Yue Fan and Qingyue Wei and Sheng Liu and Xin Eric Wang},
      year={2025},
      eprint={2512.12623},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.12623}, 
}

About

Official codebase for the paper "Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •