You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
参数和报错:
python -m vllm.entrypoints.openai.api_server --model /root/models/Qwen3-8B --trust-remote-code --tensor-parallel-size 8 --max-model-len 32768 --gpu-memory-utilization 0.85 --enforce-eager --host 0.0.0.0 --port 8000
INFO 11-12 08:15:25 [init.py:216] Automatically detected platform cuda.
(APIServer pid=26730) INFO 11-12 08:15:28 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=26730) INFO 11-12 08:15:28 [utils.py:233] non-default args: {'host': '0.0.0.0', 'model': '/root/models/Qwen3-8B', 'trust_remote_code': True, 'max_model_len': 32768, 'enforce_eager': True, 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.85}
(APIServer pid=26730) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=26730) INFO 11-12 08:15:38 [model.py:547] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=26730) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=26730) INFO 11-12 08:15:38 [model.py:1510] Using max model len 32768
(APIServer pid=26730) INFO 11-12 08:15:38 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=26730) INFO 11-12 08:15:38 [init.py:381] Cudagraph is disabled under eager mode
INFO 11-12 08:15:43 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=26990) INFO 11-12 08:15:45 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=26990) INFO 11-12 08:15:45 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/root/models/Qwen3-8B', speculative_config=None, tokenizer='/root/models/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/models/Qwen3-8B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_DP0 pid=26990) WARNING 11-12 08:15:45 [multiproc_executor.py:720] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=26990) INFO 11-12 08:15:45 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3, 4, 5, 6, 7], buffer_handle=(8, 16777216, 10, 'psm_79793927'), local_subscribe_addr='ipc:///tmp/988d4975-8f1c-44c1-ad2f-a224c7d86fbc', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a73af085'), local_subscribe_addr='ipc:///tmp/27c032e2-d31c-47d1-8487-46742964319f', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_4bdf7865'), local_subscribe_addr='ipc:///tmp/fe7e0f55-5b0f-4a65-a0b6-64e57f8a1d0e', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_25153c38'), local_subscribe_addr='ipc:///tmp/f090db20-3231-4b67-ad86-0d1514c3429d', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_23dbf83f'), local_subscribe_addr='ipc:///tmp/1b6e03a0-9a71-4b8f-b046-73691a9ff5dd', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e278312f'), local_subscribe_addr='ipc:///tmp/dea12f8b-6ba0-4d8f-a28c-5c168623beae', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_74356a2e'), local_subscribe_addr='ipc:///tmp/342edb78-f715-427c-bd2d-155ac399ae4c', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_77a7ed28'), local_subscribe_addr='ipc:///tmp/daecdda0-ba89-48ca-a9b0-0c9a53843ae8', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_9ab76b9e'), local_subscribe_addr='ipc:///tmp/42ed43f4-5257-40e4-be55-be787e8c6c6b', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 11-12 08:15:57 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_93d44ff8'), local_subscribe_addr='ipc:///tmp/26436e6b-653a-4337-b1f4-ae622d0c4698', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 1 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 2 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 5 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 5, EP rank 5
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 6 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 6, EP rank 6
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 7 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 7, EP rank 7
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 3 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 4 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 4, EP rank 4
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP2 pid=27126) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP6 pid=27130) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP0 pid=27124) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP4 pid=27128) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP1 pid=27125) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP5 pid=27129) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP7 pid=27131) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP3 pid=27127) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP2 pid=27126) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP6 pid=27130) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP4 pid=27128) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP0 pid=27124) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP1 pid=27125) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP5 pid=27129) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP2 pid=27126) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP3 pid=27127) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP6 pid=27130) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP4 pid=27128) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP0 pid=27124) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP1 pid=27125) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP7 pid=27131) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP5 pid=27129) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP3 pid=27127) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
(Worker_TP7 pid=27131) INFO 11-12 08:15:59 [cuda.py:366] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:45<03:02, 45.51s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [01:31<02:16, 45.61s/it]
(Worker_TP2 pid=27126) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP6 pid=27130) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP0 pid=27124) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP5 pid=27129) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP1 pid=27125) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP4 pid=27128) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP7 pid=27131) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
Loading safetensors checkpoint shards: 40% Completed | 2/5 [02:02<03:03, 61.03s/it]
(Worker_TP0 pid=27124)
[rank0]:[W1112 08:18:01.634948802 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 498, in init
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 83, in init
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] self._init_executor()
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] raise e from None
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=26990) Process EngineCore_DP0:
(EngineCore_DP0 pid=26990) Traceback (most recent call last):
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=26990) self.run()
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=26990) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=26990) raise e
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=26990) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 498, in init
(EngineCore_DP0 pid=26990) super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 83, in init
(EngineCore_DP0 pid=26990) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=26990) self._init_executor()
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=26990) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=26990) raise e from None
(EngineCore_DP0 pid=26990) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=26730) Traceback (most recent call last):
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(APIServer pid=26730) return _run_code(code, main_globals, None,
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/runpy.py", line 86, in _run_code
(APIServer pid=26730) exec(code, run_globals)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1953, in
(APIServer pid=26730) uvloop.run(run_server(args))
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/uvloop/init.py", line 69, in run
(APIServer pid=26730) return loop.run_until_complete(wrapper())
(APIServer pid=26730) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=26730) return await main
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
(APIServer pid=26730) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
(APIServer pid=26730) async with build_async_engine_client(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/contextlib.py", line 199, in aenter
(APIServer pid=26730) return await anext(self.gen)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
(APIServer pid=26730) async with build_async_engine_client_from_engine_args(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/contextlib.py", line 199, in aenter
(APIServer pid=26730) return await anext(self.gen)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
(APIServer pid=26730) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/utils/init.py", line 1572, in inner
(APIServer pid=26730) return fn(*args, **kwargs)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=26730) return cls(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 134, in init
(APIServer pid=26730) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=26730) return AsyncMPClient(*client_args)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 769, in init
(APIServer pid=26730) super().init(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 448, in init
(APIServer pid=26730) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/contextlib.py", line 142, in exit
(APIServer pid=26730) next(self.gen)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=26730) wait_for_engine_startup(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=26730) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=26730) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
参数和报错:
python -m vllm.entrypoints.openai.api_server --model /root/models/Qwen3-8B --trust-remote-code --tensor-parallel-size 8 --max-model-len 32768 --gpu-memory-utilization 0.85 --enforce-eager --host 0.0.0.0 --port 8000
INFO 11-12 08:15:25 [init.py:216] Automatically detected platform cuda.
(APIServer pid=26730) INFO 11-12 08:15:28 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=26730) INFO 11-12 08:15:28 [utils.py:233] non-default args: {'host': '0.0.0.0', 'model': '/root/models/Qwen3-8B', 'trust_remote_code': True, 'max_model_len': 32768, 'enforce_eager': True, 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.85}
(APIServer pid=26730) The argument
trust_remote_codeis to be used with Auto classes. It has no effect here and is ignored.(APIServer pid=26730) INFO 11-12 08:15:38 [model.py:547] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=26730)
torch_dtypeis deprecated! Usedtypeinstead!(APIServer pid=26730) INFO 11-12 08:15:38 [model.py:1510] Using max model len 32768
(APIServer pid=26730) INFO 11-12 08:15:38 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=26730) INFO 11-12 08:15:38 [init.py:381] Cudagraph is disabled under eager mode
INFO 11-12 08:15:43 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=26990) INFO 11-12 08:15:45 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=26990) INFO 11-12 08:15:45 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/root/models/Qwen3-8B', speculative_config=None, tokenizer='/root/models/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/models/Qwen3-8B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_DP0 pid=26990) WARNING 11-12 08:15:45 [multiproc_executor.py:720] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=26990) INFO 11-12 08:15:45 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3, 4, 5, 6, 7], buffer_handle=(8, 16777216, 10, 'psm_79793927'), local_subscribe_addr='ipc:///tmp/988d4975-8f1c-44c1-ad2f-a224c7d86fbc', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:49 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:50 [init.py:216] Automatically detected platform cuda.
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a73af085'), local_subscribe_addr='ipc:///tmp/27c032e2-d31c-47d1-8487-46742964319f', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_4bdf7865'), local_subscribe_addr='ipc:///tmp/fe7e0f55-5b0f-4a65-a0b6-64e57f8a1d0e', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_25153c38'), local_subscribe_addr='ipc:///tmp/f090db20-3231-4b67-ad86-0d1514c3429d', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_23dbf83f'), local_subscribe_addr='ipc:///tmp/1b6e03a0-9a71-4b8f-b046-73691a9ff5dd', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e278312f'), local_subscribe_addr='ipc:///tmp/dea12f8b-6ba0-4d8f-a28c-5c168623beae', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_74356a2e'), local_subscribe_addr='ipc:///tmp/342edb78-f715-427c-bd2d-155ac399ae4c', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_77a7ed28'), local_subscribe_addr='ipc:///tmp/daecdda0-ba89-48ca-a9b0-0c9a53843ae8', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-12 08:15:55 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_9ab76b9e'), local_subscribe_addr='ipc:///tmp/42ed43f4-5257-40e4-be55-be787e8c6c6b', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 11-12 08:15:57 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 11-12 08:15:57 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_93d44ff8'), local_subscribe_addr='ipc:///tmp/26436e6b-653a-4337-b1f4-ae622d0c4698', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:57 [init.py:1384] Found nccl from library libnccl.so.2
INFO 11-12 08:15:57 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 1 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 2 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 5 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 5, EP rank 5
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 6 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 6, EP rank 6
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 7 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 7, EP rank 7
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 3 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 4 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 4, EP rank 4
INFO 11-12 08:15:58 [parallel_state.py:1208] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP2 pid=27126) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP6 pid=27130) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP0 pid=27124) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP4 pid=27128) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
WARNING 11-12 08:15:58 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP1 pid=27125) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP5 pid=27129) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP7 pid=27131) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP3 pid=27127) INFO 11-12 08:15:58 [gpu_model_runner.py:2602] Starting to load model /root/models/Qwen3-8B...
(Worker_TP2 pid=27126) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP6 pid=27130) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP4 pid=27128) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP0 pid=27124) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP1 pid=27125) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP5 pid=27129) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP2 pid=27126) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP3 pid=27127) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP6 pid=27130) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP4 pid=27128) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP0 pid=27124) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP1 pid=27125) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP7 pid=27131) INFO 11-12 08:15:58 [gpu_model_runner.py:2634] Loading model from scratch...
(Worker_TP5 pid=27129) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
(Worker_TP3 pid=27127) INFO 11-12 08:15:58 [cuda.py:366] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
(Worker_TP7 pid=27131) INFO 11-12 08:15:59 [cuda.py:366] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:45<03:02, 45.51s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [01:31<02:16, 45.61s/it]
(Worker_TP2 pid=27126) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP6 pid=27130) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP0 pid=27124) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP5 pid=27129) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP1 pid=27125) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP4 pid=27128) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP7 pid=27131) INFO 11-12 08:17:58 [multiproc_executor.py:558] Parent process exited, terminating worker
Loading safetensors checkpoint shards: 40% Completed | 2/5 [02:02<03:03, 61.03s/it]
(Worker_TP0 pid=27124)
[rank0]:[W1112 08:18:01.634948802 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 498, in init
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 83, in init
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] self._init_executor()
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] raise e from None
(EngineCore_DP0 pid=26990) ERROR 11-12 08:18:02 [core.py:708] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=26990) Process EngineCore_DP0:
(EngineCore_DP0 pid=26990) Traceback (most recent call last):
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=26990) self.run()
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=26990) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=26990) raise e
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=26990) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 498, in init
(EngineCore_DP0 pid=26990) super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 83, in init
(EngineCore_DP0 pid=26990) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=26990) self._init_executor()
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=26990) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=26990) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=26990) raise e from None
(EngineCore_DP0 pid=26990) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=26730) Traceback (most recent call last):
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(APIServer pid=26730) return _run_code(code, main_globals, None,
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/runpy.py", line 86, in _run_code
(APIServer pid=26730) exec(code, run_globals)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1953, in
(APIServer pid=26730) uvloop.run(run_server(args))
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/uvloop/init.py", line 69, in run
(APIServer pid=26730) return loop.run_until_complete(wrapper())
(APIServer pid=26730) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=26730) return await main
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
(APIServer pid=26730) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
(APIServer pid=26730) async with build_async_engine_client(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/contextlib.py", line 199, in aenter
(APIServer pid=26730) return await anext(self.gen)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
(APIServer pid=26730) async with build_async_engine_client_from_engine_args(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/contextlib.py", line 199, in aenter
(APIServer pid=26730) return await anext(self.gen)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
(APIServer pid=26730) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/utils/init.py", line 1572, in inner
(APIServer pid=26730) return fn(*args, **kwargs)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=26730) return cls(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 134, in init
(APIServer pid=26730) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=26730) return AsyncMPClient(*client_args)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 769, in init
(APIServer pid=26730) super().init(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 448, in init
(APIServer pid=26730) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/contextlib.py", line 142, in exit
(APIServer pid=26730) next(self.gen)
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=26730) wait_for_engine_startup(
(APIServer pid=26730) File "/opt/miniconda3/envs/vllm-fresh/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=26730) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=26730) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/miniconda3/envs/vllm-fresh/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
半周前还能正常部署的,这两天突然就莫名其妙寄了。
我试过调小GPU使用率、仅使用一张运行、重新从modelscope下模型、开一个新的环境下载vllm,但都不行......求助
Beta Was this translation helpful? Give feedback.
All reactions