使用8张3090部署Qwen3-8B遇到问题？ #1738

lwj16 · 2025-11-12T08:37:46Z

lwj16
Nov 12, 2025

参数和报错：
python -m vllm.entrypoints.openai.api_server --model /root/models/Qwen3-8B --trust-remote-code --tensor-parallel-size 8 --max-model-len 32768 --gpu-memory-utilization 0.85 --enforce-eager --host 0.0.0.0 --port 8000
INFO 11-12 08:15:25 [init.py:216] Automatically detected platform cuda.
(APIServer pid=26730) INFO 11-12 08:15:28 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=26730) INFO 11-12 08:15:28 [utils.py:233] non-default args: {'host': '0.0.0.0', 'model': '/root/models/Qwen3-8B', 'trust_remote_code': True, 'max_model_len': 32768, 'enforce_eager': True, 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.85}
(APIServer pid=26730) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=26730) INFO 11-12 08:15:38 [model.py:547] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=26730) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=26730) INFO 11-12 08:15:38 [model.py:1510] Using max model len 32768
(APIServer pid=26730) INFO 11-12 08:15:38 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=26730) INFO 11-12 08:15:38 [init.py:381] Cudagraph is disabled under eager mode
INFO 11-12 08:15:43 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=26990) INFO 11-12 08:15:45 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=26990) INFO 11-12 08:15:45 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/root/models/Qwen3-8B', speculative_config=None, tokenizer='/root/models/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/models/Qwen3-8B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_DP0 pid=26990) WARNING 11-12 08:15:45 [multiproc_executor.py:720] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=26990) INFO 11-12 08:15:45 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3, 4, 5, 6, 7], buffer_handle=(8, 16777216, 10, 'psm_79793927'), local_subscribe_addr='ipc:///tmp/988d4975-8f1c-44c1-ad2f-a224c7d86fbc', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a73af085'), local_subscribe_addr='ipc:///tmp/27c032e2-d31c-47d1-8487-46742964319f', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_4bdf7865'), local_subscribe_addr='ipc:///tmp/fe7e0f55-5b0f-4a65-a0b6-64e57f8a1d0e', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_25153c38'), local_subscribe_addr='ipc:///tmp/f090db20-3231-4b67-ad86-0d1514c3429d', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_23dbf83f'), local_subscribe_addr='ipc:///tmp/1b6e03a0-9a71-4b8f-b046-73691a9ff5dd', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e278312f'), local_subscribe_addr='ipc:///tmp/dea12f8b-6ba0-4d8f-a28c-5c168623beae', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_74356a2e'), local_subscribe_addr='ipc:///tmp/342edb78-f715-427c-bd2d-155ac399ae4c', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_77a7ed28'), local_subscribe_addr='ipc:///tmp/daecdda0-ba89-48ca-a9b0-0c9a53843ae8', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_9ab76b9e'), local_subscribe_addr='ipc:///tmp/42ed43f4-5257-40e4-be55-be787e8c6c6b', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 11-12 08:15:57 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_93d44ff8'), local_subscribe_addr='ipc:///tmp/26436e6b-653a-4337-b1f4-ae622d0c4698', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 1 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 2 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 5 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 5, EP rank 5
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 6 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 6, EP rank 6
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 7 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 7, EP rank 7
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 3 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 4 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 4, EP rank 4
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP2 pid=27126) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP6 pid=27130) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP0 pid=27124) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP4 pid=27128) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP1 pid=27125) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP5 pid=27129) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP7 pid=27131) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP3 pid=27127) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP2 pid=27126) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP6 pid=27130) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP4 pid=27128) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP0 pid=27124) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP1 pid=27125) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP5 pid=27129) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP2 pid=27126) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP3 pid=27127) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP6 pid=27130) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP4 pid=27128) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP0 pid=27124) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP1 pid=27125) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP7 pid=27131) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP5 pid=27129) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP3 pid=27127) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
(Worker_TP7 pid=27131) INFO 11-12 08:15:59 [cuda.py:366] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:45<03:02, 45.51s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [01:31<02:16, 45.61s/it]
(Worker_TP2 pid=27126) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP6 pid=27130) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP0 pid=27124) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP5 pid=27129) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP1 pid=27125) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP4 pid=27128) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP7 pid=27131) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
Loading safetensors checkpoint shards: 40% Completed | 2/5 [02:02<03:03, 61.03s/it]
(Worker_TP0 pid=27124)
[rank0]:[W1112 08:18:01.634948802 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 498, in init
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 83, in init
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] self._init_executor()
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] raise e from None
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=26990) Process EngineCore_DP0:
(EngineCore_DP0 pid=26990) Traceback (most recent call last):
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=26990) self.run()
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=26990) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=26990) raise e
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=26990) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 498, in init
(EngineCore_DP0 pid=26990) super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 83, in init
(EngineCore_DP0 pid=26990) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=26990) self._init_executor()
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=26990) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=26990) raise e from None
(EngineCore_DP0 pid=26990) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=26730) Traceback (most recent call last):
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(APIServer pid=26730) return _run_code(code, main_globals, None,
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/runpy.py", line 86, in _run_code
(APIServer pid=26730) exec(code, run_globals)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1953, in
(APIServer pid=26730) uvloop.run(run_server(args))
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/uvloop/init.py", line 69, in run
(APIServer pid=26730) return loop.run_until_complete(wrapper())
(APIServer pid=26730) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=26730) return await main
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
(APIServer pid=26730) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
(APIServer pid=26730) async with build_async_engine_client(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/contextlib.py", line 199, in aenter
(APIServer pid=26730) return await anext(self.gen)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
(APIServer pid=26730) async with build_async_engine_client_from_engine_args(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/contextlib.py", line 199, in aenter
(APIServer pid=26730) return await anext(self.gen)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
(APIServer pid=26730) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/utils/init.py", line 1572, in inner
(APIServer pid=26730) return fn(*args, **kwargs)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=26730) return cls(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 134, in init
(APIServer pid=26730) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=26730) return AsyncMPClient(*client_args)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 769, in init
(APIServer pid=26730) super().init(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 448, in init
(APIServer pid=26730) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/contextlib.py", line 142, in exit
(APIServer pid=26730) next(self.gen)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=26730) wait_for_engine_startup(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=26730) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=26730) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

半周前还能正常部署的，这两天突然就莫名其妙寄了。
我试过调小GPU使用率、仅使用一张运行、重新从modelscope下模型、开一个新的环境下载vllm，但都不行......求助

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

使用8张3090部署Qwen3-8B遇到问题？ #1738

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

使用8张3090部署Qwen3-8B遇到问题？ #1738

Uh oh!

lwj16 Nov 12, 2025

Replies: 0 comments

lwj16
Nov 12, 2025