Long thinking time in Qwen3-VL-2B-Instruct #1730

ZehangLuo · 2025-11-05T09:34:51Z

ZehangLuo
Nov 5, 2025

Description

When using the Qwen3-VL-2B-Instruct model, the thinking time (i.e., the delay before generating a response) is noticeably long. In some cases, the prolonged thinking time also causes excessive GPU memory usage or even out-of-memory errors.

This issue was encountered while running the ERQA benchmark. I would like to know whether this is an intended behavior of the model or if there are recommended ways to optimize the inference speed.

def query_qwen_model(model, processor, contents, max_tokens=128, device="cuda"):
    """
    Query the Qwen model with images and text.
    
    Args:
        model: Loaded Qwen model
        processor: Model processor
        contents: List of text strings and PIL Images
        max_tokens: Maximum tokens to generate
        device: Device model is on
        
    Returns:
        Generated text response
    """
    # Build messages in Qwen format
    content_list = []
    
    for item in contents:
        if isinstance(item, str):
            # Text content
            content_list.append({"type": "text", "text": item})
        else:
            # PIL Image
            content_list.append({"type": "image", "image": item})
    
    messages = [
        {
            "role": "user",
            "content": content_list
        }
    ]
    
    # # Prepare inputs
    # text = processor.apply_chat_template(
    #     messages, tokenize=False, add_generation_prompt=True
    # )
    
    # image_inputs, video_inputs = process_vision_info(messages)
    
    # inputs = processor(
    #     text=[text],
    #     images=image_inputs,
    #     videos=video_inputs,
    #     padding=True,
    #     return_tensors="pt",
    # )

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
    )
    inputs.pop("token_type_ids", None)
    
    # Move to device
    inputs = inputs.to(device)

    # Generate
    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
        ).to(device)
    
    # Decode output (skip input tokens)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    
    output_text = processor.batch_decode(
        generated_ids_trimmed, 
        skip_special_tokens=True, 
        clean_up_tokenization_spaces=False
    )[0]
    
    
    return output_text

Environment Information

Linux
a800 *1
python 3.10.9
torch 2.9.0
cuda 12.6

jklj077 · 2025-11-19T03:01:58Z

jklj077
Nov 19, 2025
Maintainer

Hi, please report this to https://github.com/QwenLM/Qwen3-VL. Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long thinking time in Qwen3-VL-2B-Instruct #1730

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Long thinking time in Qwen3-VL-2B-Instruct #1730

Uh oh!

Uh oh!

ZehangLuo Nov 5, 2025

Description

Environment Information

Replies: 1 comment

Uh oh!

Uh oh!

jklj077 Nov 19, 2025 Maintainer

ZehangLuo
Nov 5, 2025

jklj077
Nov 19, 2025
Maintainer