Replies: 1 comment
-
|
Hi, please report this to https://github.com/QwenLM/Qwen3-VL. Thank you! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
When using the Qwen3-VL-2B-Instruct model, the thinking time (i.e., the delay before generating a response) is noticeably long. In some cases, the prolonged thinking time also causes excessive GPU memory usage or even out-of-memory errors.
This issue was encountered while running the ERQA benchmark. I would like to know whether this is an intended behavior of the model or if there are recommended ways to optimize the inference speed.
Environment Information
Linux
a800 *1
python 3.10.9
torch 2.9.0
cuda 12.6
Beta Was this translation helpful? Give feedback.
All reactions