[wip][BREAKING][recipe, ckpt]add checkpoint engine for one step off policy #4601

x1314aq · 2025-12-18T16:37:02Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Introducing checkpoint-engine to achieve efficient parameter synchronization between trainer and rollouter.

This PR is somewhat similar to #4427, but employs a completely different implementation approach. The main differences are as follows:

It introduces checkpoint-engine as a dependency, rather than re-implementing its core logic.
It uses a P2P mode for parameter synchronization, instead of broadcast.
The checkpoint-engine runs as an independent process, not within the same process as the rollouter, and updates the weights of the rollouter via HTTP requests.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: [BREAKING][recipe, ckpt] feat: support parameter sync by checkpoint-engine. only for fully_async mode. #4427
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Dependent on checkpoint-engine PR and hixl issue to run on Ascend Atlas A2/A3 server.

Work on Megatron and SGLang is currently in progress and will be completed soon.

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

CLAassistant · 2025-12-18T16:37:10Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

x1314aq seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces a new CkptEngineWorker to manage weight synchronization using a ParameterServer for VLLM-based rollout workers, replacing the previous Ray collective group mechanism. The CkptEngineWorker is responsible for initializing its own process group, checking VLLM readiness, setting server addresses, and synchronizing rollout weights by registering and updating checkpoints via the ParameterServer. The DetachActorWorker is also updated to integrate with this new checkpoint engine, including its own ParameterServer initialization and a split_tensors method to prepare actor weights for synchronization. The PPOTrainer now calculates rank_offset and ps_world_size to configure the ParameterServer instances and explicitly creates a resource pool for the CkptEngine role, allocating specific CPU/NPU resources. The weight synchronization logic in ray_trainer.py is refactored to use the new sync_rollout_weights_by_ckpt_engine methods on both actor and checkpoint engine worker groups. Additionally, the RayResourcePool class is enhanced with a custom_bundle parameter to allow for more flexible resource allocation, particularly for NPU devices, and a new shell script grpo_0.6b_gsm8k_fsdp2_2_6_ckpt_engine.sh is added to demonstrate the new setup. Review comments highlight critical issues: hardcoded network ports and port ranges in CkptEngineWorker and DetachActorWorker should be made configurable to prevent conflicts, and the check_vllm_ready loop in CkptEngineWorker needs a maximum retry mechanism to prevent indefinite hanging, along with incrementing self.index after each checkpoint update to ensure unique checkpoint names.

gemini-code-assist · 2025-12-18T16:46:41Z

recipe/one_step_off_policy/ckpt_engine_worker.py

+        os.environ["HCCL_NPU_SOCKET_PORT_RANGE"] = "61020-61050"
+        self.ps.init_process_group(device_index=0, master_port=60010)


Hardcoding network ports and port ranges is a critical issue in distributed systems as it can easily lead to port conflicts, especially in a multi-job or multi-tenant environment. These values should be made configurable to ensure the system is robust and portable.

gemini-code-assist · 2025-12-18T16:46:41Z

recipe/one_step_off_policy/ckpt_engine_worker.py

+        self.ps.register_checkpoint(checkpoint_name=checkpoint_name)
+        self.ps.gather_metas(checkpoint_name)
+        ranks = list(range(self.ps_rank_offset, self.ps_world_size))
+        self.ps.update(checkpoint_name, req_func, ranks=ranks)


To fix the checkpoint naming issue, you should increment self.index after each update. This ensures that a unique checkpoint name is used for every synchronization.

Suggested change

self.ps.update(checkpoint_name, req_func, ranks=ranks)

self.ps.update(checkpoint_name, req_func, ranks=ranks)

self.index += 1

gemini-code-assist · 2025-12-18T16:46:41Z

recipe/one_step_off_policy/fsdp_workers.py

+        os.environ["HCCL_NPU_SOCKET_PORT_RANGE"] = "61020"
+        self.ps.init_process_group(device_index=0, master_port=60010)


Hardcoding network ports and port ranges can lead to conflicts in a multi-user or multi-job environment. It's better to make these configurable or use a dynamic port allocation mechanism.

gemini-code-assist · 2025-12-18T16:46:41Z

recipe/one_step_off_policy/fsdp_workers.py

+        self.ps.gather_metas(checkpoint_name)
+        ranks = list(range(self.ps_rank_offset, self.ps_world_size))
+
+        self.ps.update(checkpoint_name, req_func, ranks=ranks)


To fix the checkpoint naming issue, you should increment self.index after each update. This ensures that a unique checkpoint name is used for every synchronization.

Suggested change

self.ps.update(checkpoint_name, req_func, ranks=ranks)

self.ps.update(checkpoint_name, req_func, ranks=ranks)

self.index += 1

gemini-code-assist · 2025-12-18T16:46:41Z

recipe/one_step_off_policy/ckpt_engine_worker.py

+        retry_num = 0
+        transport = None
+        if uds is not None:
+            transport = httpx.HTTPTransport(uds=uds)
+        while True:
+            try:
+                response = httpx.Client(transport=transport).get(f"{self.endpoint}/health", timeout=10)
+                response.raise_for_status()
+                break
+            except (httpx.ConnectError, httpx.HTTPStatusError) as e:
+                retry_num += 1
+                logger.warning(f"fail to check vllm ready, retry {retry_num} times, error: {e}")
+                time.sleep(5)


The while True loop for checking vLLM readiness can run indefinitely if the server fails to start, causing the worker to hang. It's much safer to implement a timeout mechanism with a maximum number of retries. This ensures that the worker will eventually fail with a clear error message instead of getting stuck. I've also moved the httpx.Client instantiation out of the loop for efficiency.

Suggested change

retry_num = 0

transport = None

if uds is not None:

transport = httpx.HTTPTransport(uds=uds)

while True:

try:

response = httpx.Client(transport=transport).get(f"{self.endpoint}/health", timeout=10)

response.raise_for_status()

break

except (httpx.ConnectError, httpx.HTTPStatusError) as e:

retry_num += 1

logger.warning(f"fail to check vllm ready, retry {retry_num} times, error: {e}")

time.sleep(5)

retry_num = 0

max_retries = 60 # e.g., 5 minutes

transport = httpx.HTTPTransport(uds=uds) if uds is not None else None

client = httpx.Client(transport=transport)

while retry_num < max_retries:

try:

response = client.get(f"{self.endpoint}/health", timeout=10)

response.raise_for_status()

logger.info("vLLM server is ready.")

return

except (httpx.ConnectError, httpx.HTTPStatusError) as e:

retry_num += 1

logger.warning(f"fail to check vllm ready, retry {retry_num}/{max_retries} times, error: {e}")

time.sleep(5)

raise RuntimeError(f"vLLM server not ready after {max_retries} retries.")

x1314aq requested review from PeterSH6, chenhaiq, eric-haibin-lin, tongyx361, vermouth1992, wuxibin89 and zw0610 as code owners December 18, 2025 16:37

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

kip-cxj mentioned this pull request Dec 18, 2025

[WIP][BREAKING][worker, ckpt] support checkpoint engine for sync parameters in hybrid mode #4602

Open

7 tasks

x1314aq added 2 commits December 20, 2025 18:35

add checkpoint engine for one step off policy

a61202b

support megatron and sglang

9aa1c20

x1314aq force-pushed the ckpt branch from 2555243 to 9aa1c20 Compare December 20, 2025 10:47

fix bugs

0a07305

ZSL98 mentioned this pull request Dec 23, 2025

[recipe, ckpt] feat: bypass the CPU in the checkpoint engine for fully_async mode. #4654

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip][BREAKING][recipe, ckpt]add checkpoint engine for one step off policy #4601

[wip][BREAKING][recipe, ckpt]add checkpoint engine for one step off policy #4601

x1314aq commented Dec 18, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Dec 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		os.environ["HCCL_NPU_SOCKET_PORT_RANGE"] = "61020-61050"
		self.ps.init_process_group(device_index=0, master_port=60010)

	self.ps.update(checkpoint_name, req_func, ranks=ranks)
	self.ps.update(checkpoint_name, req_func, ranks=ranks)
	self.index += 1

[wip][BREAKING][recipe, ckpt]add checkpoint engine for one step off policy #4601

Are you sure you want to change the base?

[wip][BREAKING][recipe, ckpt]add checkpoint engine for one step off policy #4601

Conversation

x1314aq commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

x1314aq commented Dec 18, 2025 •

edited

Loading

CLAassistant commented Dec 18, 2025 •

edited

Loading