🚀[Fine-tuning] Qwen3-MoE Megatron Training Implementation and Best Practices👋 #1301
Replies: 13 comments 1 reply
-
|
could the GRPO work on 910b? |
Beta Was this translation helpful? Give feedback.
-
大概率可以,没实践。 |
Beta Was this translation helpful? Give feedback.
-
|
牛的,俺来试试, 之前试过开 deepspeed zero3 训练 qwen-moe-a2.7B 的那个模型,很早之前开源的, 发现对比同尺寸的14B 的模型,训练速度竟然还要更慢, 都是单机8 卡的模型, 一直怀疑是不是自己MOE 的配置有问题。 |
Beta Was this translation helpful? Give feedback.
-
|
@Jintao-Huang Thanks for the guide on finetuneing! It would be great if you could also update the ms-swift documentation at https://github.com/QwenLM/Qwen3/blob/main/docs/source/training/ms_swift.rst. |
Beta Was this translation helpful? Give feedback.
-
哈哈哈哈,支持的 |
Beta Was this translation helpful? Give feedback.
-
Alright, thanks ~ |
Beta Was this translation helpful? Give feedback.
-
|
对于Qwen3-235B-A22B这个模型: |
Beta Was this translation helpful? Give feedback.
-
|
When performing SFT on the Qwen3-235B-A22B model, Additionally, what is the minimum number of GPUs required, |
Beta Was this translation helpful? Give feedback.
-
|
Hi @Jintao-Huang , 感谢开源!请问Qwen3-235B-A22B这个大小的模型最佳的学习率和batch size大概多少呢?我尝试着用很小的lr (1e-6) 加上cosine scheduler 在5w条数据上fine tune了一下,结果没过多久grad norm就saturate并且loss变为0。请问有什么经验可以分享吗?非常感谢! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @Jintao-Huang, may I ask you whether Swift supports fine-tuning Qwen3-30B-A3B with GRPO method? |
Beta Was this translation helpful? Give feedback.
-
|
@Jintao-Huang 对于Qwen3-235B-A22B这个模型: |
Beta Was this translation helpful? Give feedback.
-
|
Hi there, @Jintao-Huang , is the MOE training supported on NPU such as 910B? Is there any document can be referenced? |
Beta Was this translation helpful? Give feedback.
-
|
swift训练的模型可以使用vllm启动服务吗? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
中文版notebook: https://modelscope.cn/notebook/share/ipynb/d4d8765f/qwen3.ipynb
Hello, everyone! We are thrilled to hear about the open-source release of Qwen3 and Qwen3-MoE. The ms-swift large model training framework has provided initial support for CPT/SFT/DPO/GRPO training for Qwen3/Qwen3-MoE. Additionally, it supports the implementation of Megatron training (CPT/SFT) for Qwen3/Qwen3-MoE, which is 10 times faster on MoE models compared to the training speed achieved using the transformers library.
For complete best practices, you can check out the details here: modelscope/ms-swift#4030. Everyone is welcome to try out our framework! 😊
We will showcase a runnable fine-tuning demo and provide the format for custom datasets.
Before starting the fine-tuning process, please ensure that your environment is properly set up.
Qwen3-8B SFT
The script for training Qwen3-8B is as follows, which can be run on the free A10 computing resources provided by ModelScope: https://modelscope.cn/my/mynotebook
The format for a custom dataset is as follows (the
systemfield is optional). Simply specify--dataset <dataset_path>:For more information, refer to the custom dataset documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]} {"messages": [{"role": "user", "content": "Where is the capital of Zhejiang? /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}10-Minute Quick Self-Cognition Fine-Tuning Demo (GPU Memory Usage: 22GB)
ref: https://github.com/modelscope/ms-swift/blob/51cafe59325603b2bf0f63cf688c659fbe9abc5d/swift/llm/dataset/dataset/llm.py#L835
CUDA_VISIBLE_DEVICES=0 \ swift sft \ --model Qwen/Qwen3-8B \ --train_type lora \ --dataset 'swift/Qwen3-SFT-Mixin#2000' \ 'swift/self-cognition:qwen3#600' \ --torch_dtype bfloat16 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 1e-4 \ --lora_rank 8 \ --lora_alpha 32 \ --target_modules all-linear \ --gradient_accumulation_steps 16 \ --eval_steps 50 \ --save_steps 50 \ --save_total_limit 2 \ --logging_steps 5 \ --max_length 2048 \ --output_dir output \ --warmup_ratio 0.05 \ --dataloader_num_workers 4 \ --use_liger_kernel true \ --model_author swift \ --model_name swift-robotInference and test the fine-tuning results:
CUDA_VISIBLE_DEVICES=0 \ swift infer \ --adapters output/vx-xxx/checkpoint-xxx \ --stream true \ --temperature 0 \ --max_new_tokens 2048Qwen3-8B GRPO
Taking Qwen3-8B as an example, the following uses the ms-swift framework to conduct GRPO training. For more details about GRPO, refer to the GRPO documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html
The AI-MO/NuminaMath-TIR dataset is used, and the accuracy function is employed to compute the model’s response accuracy reward. The following environment needs to be installed to calculate rewards:
The custom dataset format is similar to SFT, where the assistant part is optional. If using the accuracy reward, a solution column is required to compute the accuracy.
{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]} {"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]} {"messages": [{"role": "user", "content": "What is your name?"}]}You can also train with custom reward functions or reward models. Columns in the dataset will be passed into
**kwargsof the reward function. An example of a custom reward function can be found here: swift/examples/train/grpo/plugin/plugin.py--external_plugins examples/train/grpo/plugin/plugin.py \ --reward_funcs external_math_acc external_math_format \ --reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2During training, we use vLLM to accelerate the sampling process. Setting num_infer_workers=8, we deploy one vLLM engine on each device to speed up the sampling process.
The training script is as follows:
Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)
ms-swift introduces Megatron's parallel technology to accelerate large model training, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models like Qwen3, Qwen3-MoE, Qwen2.5, Llama3, Deepseek-R1 distillation series, etc.
For environment preparation (image) and the conversion between HF and MCore model weights, please refer to the Megatron-SWIFT training documentation; it is not covered here: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html
We use DLC to initiate the training command. The training environment consists of 2 machines with 8 * 80GiB A800:
More multi-node launch methods can be found here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node
Training loss (partial):
效果截图:
The custom dataset format is the same as
swift sft, which can be found above. Specify--dataset <dataset_path>.Below is the comparison of full-parameter training speed/GPU memory usage for the Qwen3-30B-A3B model using
megatron sftandswift sft:Beta Was this translation helpful? Give feedback.
All reactions