🚀[Fine-tuning] Qwen3-MoE Megatron Training Implementation and Best Practices👋 #1301

Jintao-Huang · 2025-04-29T00:46:49Z

Jintao-Huang
Apr 29, 2025

中文版notebook: https://modelscope.cn/notebook/share/ipynb/d4d8765f/qwen3.ipynb

Hello, everyone! We are thrilled to hear about the open-source release of Qwen3 and Qwen3-MoE. The ms-swift large model training framework has provided initial support for CPT/SFT/DPO/GRPO training for Qwen3/Qwen3-MoE. Additionally, it supports the implementation of Megatron training (CPT/SFT) for Qwen3/Qwen3-MoE, which is 10 times faster on MoE models compared to the training speed achieved using the transformers library.

For complete best practices, you can check out the details here: modelscope/ms-swift#4030. Everyone is welcome to try out our framework! 😊

We will showcase a runnable fine-tuning demo and provide the format for custom datasets.

Before starting the fine-tuning process, please ensure that your environment is properly set up.

git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

pip install liger-kernel transformers -U

Qwen3-8B SFT

The script for training Qwen3-8B is as follows, which can be run on the free A10 computing resources provided by ModelScope: https://modelscope.cn/my/mynotebook

# Training GPU memory: 22GB
# You can specify `--dataset AI-ModelScope/alpaca-gpt4-data-zh` to run the experiment
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset '<dataset-path>' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --packing true \
    --use_liger_kernel true \
    --attn_impl flash_attn

The format for a custom dataset is as follows (the system field is optional). Simply specify --dataset <dataset_path>:

For more information, refer to the custom dataset documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang? /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}

10-Minute Quick Self-Cognition Fine-Tuning Demo (GPU Memory Usage: 22GB)

ref: https://github.com/modelscope/ms-swift/blob/51cafe59325603b2bf0f63cf688c659fbe9abc5d/swift/llm/dataset/dataset/llm.py#L835

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --model_author swift \
    --model_name swift-robot

Inference and test the fine-tuning results:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048

Qwen3-8B GRPO

Taking Qwen3-8B as an example, the following uses the ms-swift framework to conduct GRPO training. For more details about GRPO, refer to the GRPO documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html

The AI-MO/NuminaMath-TIR dataset is used, and the accuracy function is employed to compute the model’s response accuracy reward. The following environment needs to be installed to calculate rewards:

pip install math_verify==0.5.2

The custom dataset format is similar to SFT, where the assistant part is optional. If using the accuracy reward, a solution column is required to compute the accuracy.

{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}

You can also train with custom reward functions or reward models. Columns in the dataset will be passed into **kwargs of the reward function. An example of a custom reward function can be found here: swift/examples/train/grpo/plugin/plugin.py

    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_math_acc external_math_format \
    --reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2

During training, we use vLLM to accelerate the sampling process. Setting num_infer_workers=8, we deploy one vLLM engine on each device to speed up the sampling process.

The training script is as follows:

# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset AI-MO/NuminaMath-TIR \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --gradient_accumulation_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 4096 \
    --vllm_max_model_len 8192 \
    --reward_funcs accuracy \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.4 \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --deepspeed zero3 \
    --num_infer_workers 8 \
    --tensor_parallel_size 1 \
    --temperature 1.0 \
    --top_p 0.85 \
    --report_to wandb \
    --log_completions true \
    --overlong_filter true

Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)

ms-swift introduces Megatron's parallel technology to accelerate large model training, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models like Qwen3, Qwen3-MoE, Qwen2.5, Llama3, Deepseek-R1 distillation series, etc.

For environment preparation (image) and the conversion between HF and MCore model weights, please refer to the Megatron-SWIFT training documentation; it is not covered here: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html

We use DLC to initiate the training command. The training environment consists of 2 machines with 8 * 80GiB A800:

More multi-node launch methods can be found here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# Please ensure that the weight saving paths are the same for both nodes.
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
    --load Qwen3-30B-A3B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.01 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_iters 100 \
    --min_lr 1e-6 \
    --save megatron_output/Qwen3-30B-A3B-Base \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --use_flash_attn true

Training loss (partial):

效果截图：

The custom dataset format is the same as swift sft, which can be found above. Specify --dataset <dataset_path>.

Below is the comparison of full-parameter training speed/GPU memory usage for the Qwen3-30B-A3B model using megatron sft and swift sft:

	Megatron-LM	DeepSpeed-ZERO2	DeepSpeed-ZERO3
Training Speed	9.6s/it	-	91.2s/it
GPU Memory Usage	16 * 60GiB	OOM	16 * 80GiB

yurochang · 2025-04-29T00:52:16Z

yurochang
Apr 29, 2025

could the GRPO work on 910b?

0 replies

cwz427 · 2025-04-29T01:19:30Z

cwz427
Apr 29, 2025

could the GRPO work on 910b?

大概率可以，没实践。

0 replies

rookiebird · 2025-04-29T02:32:40Z

rookiebird
Apr 29, 2025

牛的，俺来试试，之前试过开 deepspeed zero3 训练 qwen-moe-a2.7B 的那个模型，很早之前开源的，发现对比同尺寸的14B 的模型，训练速度竟然还要更慢，都是单机8 卡的模型，一直怀疑是不是自己MOE 的配置有问题。

0 replies

jklj077 · 2025-04-29T03:00:41Z

jklj077
Apr 29, 2025
Maintainer

@Jintao-Huang Thanks for the guide on finetuneing! It would be great if you could also update the ms-swift documentation at https://github.com/QwenLM/Qwen3/blob/main/docs/source/training/ms_swift.rst.

0 replies

Jintao-Huang · 2025-04-29T03:36:08Z

Jintao-Huang
Apr 29, 2025
Author

could the GRPO work on 910b?

大概率可以，没实践。

哈哈哈哈，支持的

0 replies

Jintao-Huang · 2025-04-29T03:37:06Z

Jintao-Huang
Apr 29, 2025
Author

@Jintao-Huang Thanks for the guide on finetuneing! It would be great if you could also update the ms-swift documentation at https://github.com/QwenLM/Qwen3/blob/main/docs/source/training/ms_swift.rst.

Alright, thanks ~

0 replies

husthuke · 2025-05-12T03:34:21Z

husthuke
May 12, 2025

对于Qwen3-235B-A22B这个模型:
"num_experts": 128,
"num_experts_per_tok": 8,
"num_hidden_layers": 94,
"num_key_value_heads": 4,
大概需要多少机器资源以及如何配置tensor_model_parallel_size 、expert_model_parallel_size、pipeline_model_parallel_size这些参数才能比较高效的进行sft呢？

0 replies

jeeHwon · 2025-06-05T08:11:04Z

jeeHwon
Jun 5, 2025

When performing SFT on the Qwen3-235B-A22B model,
what would be an efficient configuration for tensor_model_parallel_size, pipeline_model_parallel_size, and expert_model_parallel_size?

Additionally, what is the minimum number of GPUs required,
and assuming H100 80GB GPUs, what would be a realistic resource setup and parallelism strategy?

0 replies

zuxin666 · 2025-06-06T21:01:03Z

zuxin666
Jun 6, 2025

Hi @Jintao-Huang , 感谢开源！请问Qwen3-235B-A22B这个大小的模型最佳的学习率和batch size大概多少呢？我尝试着用很小的lr (1e-6) 加上cosine scheduler 在5w条数据上fine tune了一下，结果没过多久grad norm就saturate并且loss变为0。请问有什么经验可以分享吗？非常感谢！

{'loss': 0.0, 'grad_norm': 2.6457512378692627, 'learning_rate': 6.110124333925399e-07, 'epoch': 0.76}
{'loss': 0.0, 'grad_norm': 2.6457512378692627, 'learning_rate': 5.22202486678508e-07, 'epoch': 0.91}

0 replies

TruongTrongTien · 2025-09-12T09:20:57Z

TruongTrongTien
Sep 12, 2025

Hi @Jintao-Huang, may I ask you whether Swift supports fine-tuning Qwen3-30B-A3B with GRPO method?

0 replies

Orlando-CS · 2025-10-22T10:05:44Z

Orlando-CS
Oct 22, 2025

@Jintao-Huang 对于Qwen3-235B-A22B这个模型:
"num_experts": 128,
"num_experts_per_tok": 8,
"num_hidden_layers": 94,
"num_key_value_heads": 4,
大概需要多少机器资源以及如何配置tensor_model_parallel_size 、expert_model_parallel_size、pipeline_model_parallel_size这些参数才能比较高效的进行sft呢？

0 replies

champagnejin0511-hash · 2025-10-30T02:59:35Z

champagnejin0511-hash
Oct 30, 2025

Hi there, @Jintao-Huang , is the MOE training supported on NPU such as 910B? Is there any document can be referenced?

0 replies

xiakj · 2025-11-04T10:44:39Z

xiakj
Nov 4, 2025

swift训练的模型可以使用vllm启动服务吗？

1 reply

Orlando-CS Nov 5, 2025

可以的

🚀[Fine-tuning] Qwen3-MoE Megatron Training Implementation and Best Practices👋 #1301

Uh oh!

Uh oh!

Qwen3-8B SFT

Qwen3-8B GRPO

Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)

Replies: 13 comments · 1 reply

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jklj077 Apr 29, 2025 Maintainer

Uh oh!

Jintao-Huang Apr 29, 2025 Author

Uh oh!

Jintao-Huang Apr 29, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 13 comments 1 reply

jklj077
Apr 29, 2025
Maintainer

Jintao-Huang
Apr 29, 2025
Author

Jintao-Huang
Apr 29, 2025
Author