Skip to content

Conversation

@huerni
Copy link
Collaborator

@huerni huerni commented Jan 16, 2026

Summary by CodeRabbit

  • New Features
    • Added signal messages with flags and scheduled times for tasks and steps.
    • Signals now propagate through task and step payloads for coordinated handling.
    • Automatic scheduling and cleanup of per-step signal timers tied to task time limits.
    • Role-aware handling: batch-only signals are ignored for non-primary steps where appropriate.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

Adds a new protobuf message Signal and signals fields to task/ and step-level messages; propagates signals into StepToD; introduces per-step signal timers in TaskManager that are scheduled and canceled based on signal metadata and step role.

Changes

Cohort / File(s) Summary
Protobuf: Signal definition & fields
protos/PublicDefs.proto
Adds message Signal { int32 signal_number; uint32 signal_time; enum SignalFlag { NONE=0; BATCH_ONLY=1; RESERVATION_OVERLAP=2 } signal_flag = ... } and repeated Signal signals to TaskToCtld and StepToCtld.
Signal propagation (Ctld)
src/CraneCtld/CtldPublicDefs.cpp
Copies signals from the job/task into generated StepToD via mutable_signals()->CopyFrom(...) in GetStepToD().
TaskManager API & storage
src/Craned/Supervisor/TaskManager.h
Adds signal_timers vector to StepInstance, [[nodiscard]] bool IsPrimary() const noexcept;, and declarations for AddSignalTimer_() / DelSignalTimers_().
TaskManager implementation & flow changes
src/Craned/Supervisor/TaskManager.cpp
Implements IsPrimary(). Adds DelSignalTimers_() cleanup calls in task finish/time-limit paths. Schedules per-signal timers in EvGrpcExecuteTaskCb_ and EvCleanChangeTaskTimeLimitQueueCb_, skipping BATCH_ONLY signals for non-primary steps and creating timers for eligible signals; adds per-signal timer teardown where termination timers are removed.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Ctld as Ctld (StepToD)
participant Craned as Craned.TaskManager
participant UVW as uvw::timer_handle
participant Task as Task/Process

Note over Ctld,Craned: StepToD includes `signals` propagated from TaskToCtld
Ctld->>Craned: Deliver StepToD (with `signals`)
Craned->>Craned: On execute / time-limit set: evaluate each Signal
Craned->>UVW: AddSignalTimer_(delay = sec - signal_time, signal_number)
UVW-->>Craned: Store timer in StepInstance.signal_timers
UVW->>Task: Timer expires -> send signal / terminate task
Craned->>UVW: DelSignalTimers_() on task finish/termination
UVW-->>Craned: Timers closed and cleared

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • Nativu5
  • L-Xiafeng
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: cbatch signal' accurately describes the main change—adding signal support to the cbatch system through new protobuf messages and timer management.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)

2318-2348: Align signal_time equality handling with initial scheduling.

EvGrpcExecuteTaskCb_ schedules when signal_time == time_limit, but this path skips it (>=). That can drop a signal when the time limit is updated to exactly signal_time. Consider matching the initial behavior (or explicitly firing immediately).

🔧 Suggested fix
-        if (signal.signal_time() >= new_sec) {
+        if (signal.signal_time() > new_sec) {
           CRANE_TRACE(
               "signal {} time of {}s >= time_limit {}s, ignore this signal.",
               signal.signal_number(), signal.signal_time(), new_sec);
           continue;
         }
🤖 Fix all issues with AI agents
In `@src/Craned/Supervisor/TaskManager.cpp`:
- Around line 2381-2398: The CRANE_INFO call inside the loop that logs signal
details has a format-string/argument mismatch in TaskManager (use
AddSignalTimer_ and m_step_ references): the message uses one {} but supplies
two values (signal.signal_time(), signal.signal_number()); update the format
string to include two placeholders (e.g., "Add a signal time of {}s for signal
{}") so the log invocation matches the provided arguments and preserves correct
logging.
🧹 Nitpick comments (1)
protos/PublicDefs.proto (1)

164-173: Clarify SignalFlag semantics (B/R).

The enum values are terse; consider comments or more descriptive names to prevent misuse.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b3162e and 198c353.

📒 Files selected for processing (4)
  • protos/PublicDefs.proto
  • src/CraneCtld/CtldPublicDefs.cpp
  • src/Craned/Supervisor/TaskManager.cpp
  • src/Craned/Supervisor/TaskManager.h
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.

Applied to files:

  • src/Craned/Supervisor/TaskManager.h
  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.

Applied to files:

  • src/Craned/Supervisor/TaskManager.cpp
🔇 Additional comments (10)
src/Craned/Supervisor/TaskManager.h (3)

49-50: LGTM: per-step signal timer tracking is clear.

This makes timer lifecycle management explicit and easier to clean up.


121-124: LGTM: explicit primary-step predicate.

Helps keep step-type checks readable in the caller logic.


518-539: LGTM: signal timer scheduling and cleanup look solid.

Storing handles and explicitly closing them reduces the risk of dangling timers.

src/Craned/Supervisor/TaskManager.cpp (4)

100-102: LGTM: IsPrimary implementation is straightforward.


1886-1890: Good cleanup: signal timers cleared on step completion.


2130-2134: Good cleanup: signal timers cleared on timeout path.


2445-2445: No review needed for namespace close.

protos/PublicDefs.proto (2)

233-236: LGTM: signals are now carried on TaskToCtld.


396-399: LGTM: StepToD now transports signal data.

src/CraneCtld/CtldPublicDefs.cpp (1)

952-955: LGTM: signal propagation into StepToD is clear.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@huerni huerni requested review from L-Xiafeng and Nativu5 January 19, 2026 02:53
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/Craned/Supervisor/TaskManager.cpp`:
- Around line 2337-2347: The signal-time comparison currently uses ">=" which
skips signals whose signal_time equals new_sec and prevents immediate firing; in
the loop iterating m_step_.GetStep().signals() and in the other signal handling
block that logs ">" but still uses ">=", change the condition from
"signal.signal_time() >= new_sec" to "signal.signal_time() > new_sec" so
AddSignalTimer_(new_sec - signal.signal_time(), signal.signal_number()) can be
called with zero delay for immediate signals; ensure the CRANE_TRACE message
matches the new comparison logic.

@huerni huerni force-pushed the dev/cbatch_signal branch from ed22466 to e797046 Compare January 26, 2026 05:47
@huerni huerni requested a review from Nativu5 January 26, 2026 05:49
@huerni huerni force-pushed the dev/cbatch_signal branch from 99e06d2 to 71ed989 Compare January 28, 2026 02:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/Craned/Supervisor/TaskManager.cpp`:
- Around line 2540-2546: The loop currently skips non-BATCH_ONLY signals for
primary steps; change the filtering so only BATCH_ONLY signals are restricted to
primary steps. Replace the two-condition logic around
m_step_.GetStep().signals() with a single check: if signal.signal_flag() ==
crane::grpc::Signal_SignalFlag_BATCH_ONLY && !m_step_.IsPrimary() continue;
(remove the check that skips non-BATCH_ONLY when m_step_.IsPrimary()). Apply the
same fix to the second loop that also uses m_step_.GetStep().signals() (the
block around lines ~2591-2597).

@github-actions github-actions bot added the test-passed Build and test success label Jan 28, 2026
@Nativu5 Nativu5 merged commit 1e9cd7d into master Jan 28, 2026
2 checks passed
@Nativu5 Nativu5 deleted the dev/cbatch_signal branch January 28, 2026 15:22
@Nativu5 Nativu5 linked an issue Jan 28, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-passed Build and test success

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cbatch增加--signal

4 participants