-
Notifications
You must be signed in to change notification settings - Fork 34
feat: cbatch signal #801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: cbatch signal #801
Conversation
📝 WalkthroughWalkthroughAdds a new protobuf message Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
2318-2348: Alignsignal_timeequality handling with initial scheduling.
EvGrpcExecuteTaskCb_schedules whensignal_time == time_limit, but this path skips it (>=). That can drop a signal when the time limit is updated to exactlysignal_time. Consider matching the initial behavior (or explicitly firing immediately).🔧 Suggested fix
- if (signal.signal_time() >= new_sec) { + if (signal.signal_time() > new_sec) { CRANE_TRACE( "signal {} time of {}s >= time_limit {}s, ignore this signal.", signal.signal_number(), signal.signal_time(), new_sec); continue; }
🤖 Fix all issues with AI agents
In `@src/Craned/Supervisor/TaskManager.cpp`:
- Around line 2381-2398: The CRANE_INFO call inside the loop that logs signal
details has a format-string/argument mismatch in TaskManager (use
AddSignalTimer_ and m_step_ references): the message uses one {} but supplies
two values (signal.signal_time(), signal.signal_number()); update the format
string to include two placeholders (e.g., "Add a signal time of {}s for signal
{}") so the log invocation matches the provided arguments and preserves correct
logging.
🧹 Nitpick comments (1)
protos/PublicDefs.proto (1)
164-173: ClarifySignalFlagsemantics (B/R).The enum values are terse; consider comments or more descriptive names to prevent misuse.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
protos/PublicDefs.protosrc/CraneCtld/CtldPublicDefs.cppsrc/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-12-08T08:11:40.332Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 727
File: src/Craned/Supervisor/TaskManager.cpp:1401-1407
Timestamp: 2025-12-08T08:11:40.332Z
Learning: In src/Craned/Supervisor/TaskManager.cpp, StepInstance::Prepare() is always called before any task's Spawn() method in the execution flow (via EvGrpcExecuteTaskCb_). If Prepare() fails, tasks are immediately finished and Spawn() is never invoked. For Crun tasks, StepInstance::Prepare() guarantees that x11_meta is set (even if x11 is false), so accessing x11_meta.value() in Spawn() when both IsCrun() and x11 are true is safe without additional has_value() checks.
Applied to files:
src/Craned/Supervisor/TaskManager.hsrc/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-25T04:08:03.273Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/Supervisor.cpp:68-69
Timestamp: 2025-05-25T04:08:03.273Z
Learning: In the Crane supervisor component (src/Craned/Supervisor/), the supervisor process communicates with Craned through STDOUT using protobuf messages. The supervisor must not send any information to STDOUT before sending the "ready" message to Craned, as this would interfere with the inter-process communication protocol. Therefore, adding logging statements that might write to STDOUT before the ready message is sent could break the communication.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-26T11:04:56.055Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:72-72
Timestamp: 2025-05-26T11:04:56.055Z
Learning: The CraneSched project prefers using global variables (like `g_supervisor_keeper`) over dependency injection patterns. The team does not follow dependency injection approaches for managing singleton instances.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
🔇 Additional comments (10)
src/Craned/Supervisor/TaskManager.h (3)
49-50: LGTM: per-step signal timer tracking is clear.This makes timer lifecycle management explicit and easier to clean up.
121-124: LGTM: explicit primary-step predicate.Helps keep step-type checks readable in the caller logic.
518-539: LGTM: signal timer scheduling and cleanup look solid.Storing handles and explicitly closing them reduces the risk of dangling timers.
src/Craned/Supervisor/TaskManager.cpp (4)
100-102: LGTM: IsPrimary implementation is straightforward.
1886-1890: Good cleanup: signal timers cleared on step completion.
2130-2134: Good cleanup: signal timers cleared on timeout path.
2445-2445: No review needed for namespace close.protos/PublicDefs.proto (2)
233-236: LGTM: signals are now carried on TaskToCtld.
396-399: LGTM: StepToD now transports signal data.src/CraneCtld/CtldPublicDefs.cpp (1)
952-955: LGTM: signal propagation into StepToD is clear.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/Craned/Supervisor/TaskManager.cpp`:
- Around line 2337-2347: The signal-time comparison currently uses ">=" which
skips signals whose signal_time equals new_sec and prevents immediate firing; in
the loop iterating m_step_.GetStep().signals() and in the other signal handling
block that logs ">" but still uses ">=", change the condition from
"signal.signal_time() >= new_sec" to "signal.signal_time() > new_sec" so
AddSignalTimer_(new_sec - signal.signal_time(), signal.signal_number()) can be
called with zero delay for immediate signals; ensure the CRANE_TRACE message
matches the new comparison logic.
ed22466 to
e797046
Compare
99e06d2 to
71ed989
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/Craned/Supervisor/TaskManager.cpp`:
- Around line 2540-2546: The loop currently skips non-BATCH_ONLY signals for
primary steps; change the filtering so only BATCH_ONLY signals are restricted to
primary steps. Replace the two-condition logic around
m_step_.GetStep().signals() with a single check: if signal.signal_flag() ==
crane::grpc::Signal_SignalFlag_BATCH_ONLY && !m_step_.IsPrimary() continue;
(remove the check that skips non-BATCH_ONLY when m_step_.IsPrimary()). Apply the
same fix to the second loop that also uses m_step_.GetStep().signals() (the
block around lines ~2591-2597).
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.