-
Notifications
You must be signed in to change notification settings - Fork 34
feat: add cbatch --deadline #630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughAdds per-task deadline fields and enums to protobufs and DB persistence, propagates deadline_time through Ctld structures and RPCs, introduces deadline timers and cancellation paths in TaskScheduler and Supervisor, renames time-limit APIs to optional time-constraint variants, and documents CLI --deadline options. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as User/CLI
participant Ctld as Ctld (TaskScheduler)
participant Deadline as DeadlineTimerThread
participant Craned as Craned (Supervisor/TaskManager)
participant Task as Task Process
CLI->>Ctld: ChangeTaskTimeConstraint(task_id, time_limit?, deadline?)
Ctld->>Ctld: validate deadline > now (if present)
Ctld->>Ctld: update TaskInCtld.deadline_time
Ctld->>Deadline: enqueue CreateDeadlineTimer(job_id, task_id, deadline)
Deadline->>Deadline: start uvw timer for deadline
rect rgb(200,240,220)
Note over Task,Craned: Normal completion
Task->>Craned: exits normally
Craned->>Ctld: report completion
Ctld->>DB: persist normal outcome
end
rect rgb(255,220,220)
Note over Deadline,Task: Deadline expiry path
Deadline->>Ctld: timer fires -> enqueue cancellation
Ctld->>Craned: request termination (deadline)
Craned->>Task: terminate (SIGTERM/SIGKILL)
Craned->>Ctld: report TERMINATED_BY_DEADLINE / status=Deadline
Ctld->>DB: persist deadline outcome (EC_REACHED_DEADLINE)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
protos/PublicDefs.proto(3 hunks)src/CraneCtld/CtldPublicDefs.cpp(2 hunks)src/CraneCtld/CtldPublicDefs.h(1 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(1 hunks)src/CraneCtld/TaskScheduler.cpp(1 hunks)src/CraneCtld/TaskScheduler.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(3 hunks)src/Craned/Supervisor/TaskManager.h(4 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (2)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (2)
EvDeadlineTaskTimerCb_(1772-1787)EvDeadlineTaskTimerCb_(1772-1772)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)src/Craned/Supervisor/TaskManager.h (2)
DelDeadlineTerminationTimer_(353-358)AddDeadlineTerminationTimer_(334-343)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
|
在Supervisor上是不是把deadline转换成一个timelimit比较好? |
是指1.和time_limit共用一个变量还是2.只是将deadline类型转换为和time_limit一样的类型? |
Slurm 如果是 --deadline 原因结束一个正在运行的任务,任务状态应该是 TIMEOUT,和 timelimit 杀掉任务是一样的输出。因此我认为可以复用同一个 uvw timer。 |
我注意到存在修改time_limit的函数,而调研文档里并没有提到提交任务后还能修改--deadline的val,如果复用可能存在 想修改time_limit但不想修改deadline导致deadline被误修改的情况,所以我认为独立开 会是一个比较好的选择 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
1981-1994: Clamp negative deadline intervals before arming the timer.If the deadline has already passed when this code executes (due to network skew, scheduling delays, or clock differences),
deadline_secbecomes negative. Passing this toAddDeadlineTerminationTimer_()will create an invalid timer that never fires, asstd::chrono::seconds(negative_value)passed touvw::timer_handle::start()causes undefined behavior.This issue was flagged in a previous review but remains unresolved.
Apply this diff to handle expired deadlines:
if (m_step_.GetStep().has_deadline_time()) { int64_t deadline_sec = m_step_.GetStep().deadline_time().seconds() - absl::ToUnixSeconds(absl::Now()); - if (deadline_sec <= sec) { + if (deadline_sec <= 0) { + CRANE_TRACE("Task deadline already reached; firing immediately."); + EvDeadlineTaskTimerCb_(); + } else if (deadline_sec <= sec) { AddDeadlineTerminationTimer_(deadline_sec); CRANE_TRACE("Add a deadline timer of {} seconds", deadline_sec); } else { AddTerminationTimer_(sec); CRANE_TRACE("Add a timer of {} seconds", sec); }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/CraneCtld/TaskScheduler.cpp(2 hunks)src/Craned/Supervisor/TaskManager.cpp(6 hunks)src/Craned/Supervisor/TaskManager.h(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/CraneCtld/TaskScheduler.cpp
- src/Craned/Supervisor/TaskManager.h
🧰 Additional context used
🧬 Code graph analysis (1)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)src/Craned/Supervisor/TaskManager.h (3)
DelTerminationTimer_(345-351)AddDeadlineTerminationTimer_(334-343)AddTerminationTimer_(323-332)
🔇 Additional comments (3)
src/Craned/Supervisor/TaskManager.cpp (3)
1670-1673: LGTM!The logic correctly populates the optional
deadline_secfield with the absolute deadline timestamp whendeadline_timeis present.
1772-1787: LGTM!The deadline timer callback correctly mirrors the timeout callback structure and uses the proper
TERMINATION_BY_DEADLINEenum value andExitCodeReachedDeadlineexit code. The past review comment about the invalid field has been addressed.
1837-1842: LGTM!The deadline termination cases are correctly handled for both signaled and non-signaled scenarios, following the same pattern as OOM and timeout handling. The status and exit codes are appropriate.
Also applies to: 1857-1862
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
src/Craned/Supervisor/TaskManager.cpp (2)
1939-1948: Guard against negative deadline intervals.If
deadline_secis negative (deadline already passed due to network latency, clock skew, or late scheduling), passing it toAddTerminationTimer_()can cause undefined behavior or result in a very large unsigned timeout that never fires. Please check ifdeadline_sec <= 0and handle it by triggering immediate termination before callingAddTerminationTimer_().Apply this diff to guard against expired deadlines:
} else if (elem.deadline_sec) { int64_t deadline_sec = elem.deadline_sec.value() - absl::ToUnixSeconds(absl::Now()); int64_t new_time_limit_sec = ToInt64Seconds((new_time_limit - (absl::Now() - start_time))); + if (deadline_sec <= 0) { + // Deadline already passed, terminate immediately + TaskTerminateQueueElem ev_task_terminate{ + .termination_reason = TerminatedBy::TERMINATION_BY_DEADLINE}; + m_task_terminate_queue_.enqueue(ev_task_terminate); + m_terminate_task_async_handle_->send(); + } else if (deadline_sec <= new_time_limit_sec) { - if (deadline_sec <= new_time_limit_sec) { AddTerminationTimer_(deadline_sec, true); } else { AddTerminationTimer_(new_time_limit_sec, false); } + }Based on past review comments.
1970-1984: Guard against negative deadline intervals.If
deadline_secis negative (deadline already in the past), passing it toAddTerminationTimer_()causes undefined behavior. The code computesmin(deadline_sec, sec)but doesn't check ifdeadline_secis negative beforehand. If negative and smaller thansec, it will still be passed to the timer, which can cause the timer to never fire or behave unexpectedly.Apply this diff to guard against expired deadlines:
int64_t sec = m_step_.GetStep().time_limit().seconds(); if (m_step_.GetStep().has_deadline_time()) { int64_t deadline_sec = m_step_.GetStep().deadline_time().seconds() - absl::ToUnixSeconds(absl::Now()); + if (deadline_sec <= 0) { + CRANE_TRACE("Task deadline already reached; firing immediately."); + EvTaskTimerCb_(true); + } else { - bool is_deadline = deadline_sec <= sec ? true : false; - std::string log_timer = deadline_sec <= sec ? "deadline" : "time_limit"; - deadline_sec = deadline_sec <= sec ? deadline_sec : sec; - AddTerminationTimer_(deadline_sec, is_deadline); - CRANE_TRACE("Add a {} timer of {} seconds", log_timer, deadline_sec); + bool is_deadline = deadline_sec <= sec; + std::string log_timer = is_deadline ? "deadline" : "time_limit"; + int64_t timer_secs = is_deadline ? deadline_sec : sec; + AddTerminationTimer_(timer_secs, is_deadline); + CRANE_TRACE("Add a {} timer of {} seconds", log_timer, timer_secs); + } } else { AddTerminationTimer_(sec, false); CRANE_TRACE("Add a time_limit timer of {} seconds", sec); }Based on past review comments.
src/Craned/Supervisor/TaskManager.h (1)
384-385: Remove dead code declaration.
EvDeadlineTaskTimerCb_()is an empty function in the implementation and should be removed along with its declaration. The deadline functionality has been properly merged intoEvTaskTimerCb_(bool is_deadline).Apply this diff:
void EvTaskTimerCb_(bool is_deadline); - void EvDeadlineTaskTimerCb_(); void EvCleanTaskStopQueueCb_();
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/CraneCtld/RpcService/CranedKeeper.cpp(0 hunks)src/CraneCtld/RpcService/CranedKeeper.h(0 hunks)src/CraneCtld/TaskScheduler.cpp(3 hunks)src/CraneCtld/TaskScheduler.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(6 hunks)src/Craned/Supervisor/TaskManager.h(3 hunks)
💤 Files with no reviewable changes (2)
- src/CraneCtld/RpcService/CranedKeeper.h
- src/CraneCtld/RpcService/CranedKeeper.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- src/CraneCtld/TaskScheduler.cpp
🧰 Additional context used
🧬 Code graph analysis (2)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (4)
EvTaskTimerCb_(1755-1777)EvTaskTimerCb_(1755-1755)EvDeadlineTaskTimerCb_(1779-1779)EvDeadlineTaskTimerCb_(1779-1779)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)src/Craned/Supervisor/TaskManager.h (2)
DelTerminationTimer_(334-340)AddTerminationTimer_(323-332)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format
🔇 Additional comments (7)
src/CraneCtld/TaskScheduler.h (1)
674-677: LGTM!The addition of the
reasonfield toCancelPendingTaskQueueElemenables tracking of cancellation causes (e.g., user-triggered vs deadline-driven), which aligns well with the broader deadline feature implementation.src/Craned/Supervisor/TaskManager.cpp (3)
1670-1673: LGTM!The code correctly propagates the deadline information into the queue element for later processing.
1755-1777: LGTM!The function correctly differentiates between deadline and timeout terminations, using the appropriate status codes (
TaskStatus::DeadlinevsTaskStatus::ExceedTimeLimit) and termination reasons (TERMINATION_BY_DEADLINEvsTERMINATION_BY_TIMEOUT).
1829-1834: LGTM!The code correctly handles deadline termination for both signal-based and non-signal-based exits, reporting
TaskStatus::Deadlinewith appropriate exit codes and descriptive messages.Also applies to: 1849-1854
src/Craned/Supervisor/TaskManager.h (3)
30-36: LGTM!The addition of
TERMINATION_BY_DEADLINEto theTerminatedByenum is correctly placed and enables proper tracking of deadline-driven terminations.
323-332: LGTM!The new
AddTerminationTimer_overload correctly captures theis_deadlineflag and passes it toEvTaskTimerCb_, enabling proper differentiation between deadline and timeout terminations.
375-379: LGTM!The addition of the
deadline_secfield toChangeTaskTimeLimitQueueElemcorrectly usesstd::optionalsince not all tasks have deadlines, with appropriate default initialization tostd::nullopt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/Craned/Supervisor/TaskManager.h (1)
310-321: Unify duplicated AddTerminationTimer_ variants; fix unit mismatch and typo; prevent double timers
- Two nearly identical implementations; merge into one API to avoid drift.
- Current versions use milliseconds (template) vs seconds (int64) → inconsistent behavior/precision.
- Call DelTerminationTimer_() before creating a new timer to prevent double callbacks/leaks.
- Typo: termination_handel → termination_handle.
- Minor: avoid unused param warning for lambda arg h.
Suggested refactor:
- template <typename Duration> - void AddTerminationTimer_(Duration duration) { - auto termination_handel = m_uvw_loop_->resource<uvw::timer_handle>(); - termination_handel->on<uvw::timer_event>( - [this](const uvw::timer_event&, uvw::timer_handle& h) { - EvTaskTimerCb_(false); - }); - termination_handel->start( - std::chrono::duration_cast<std::chrono::milliseconds>(duration), - std::chrono::seconds(0)); - m_step_.termination_timer = termination_handel; - } - - void AddTerminationTimer_(int64_t secs, bool is_deadline) { - auto termination_handel = m_uvw_loop_->resource<uvw::timer_handle>(); - termination_handel->on<uvw::timer_event>( - [this, is_deadline](const uvw::timer_event&, uvw::timer_handle& h) { - EvTaskTimerCb_(is_deadline); - }); - termination_handel->start(std::chrono::seconds(secs), - std::chrono::seconds(0)); - m_step_.termination_timer = termination_handel; - } + template <typename Duration> + void AddTerminationTimer_(Duration duration, bool is_deadline = false) { + DelTerminationTimer_(); + auto termination_handle = m_uvw_loop_->resource<uvw::timer_handle>(); + termination_handle->on<uvw::timer_event>( + [this, is_deadline](const uvw::timer_event&, uvw::timer_handle& /*h*/) { + EvTaskTimerCb_(is_deadline); + }); + termination_handle->start( + std::chrono::duration_cast<std::chrono::milliseconds>(duration), + std::chrono::seconds(0)); + m_step_.termination_timer = termination_handle; + }Also applies to: 323-332
🧹 Nitpick comments (1)
src/Craned/Supervisor/TaskManager.h (1)
378-379: Unify time types; clarify deadline semantics
- time_limit uses absl::Duration, but deadline uses std::optional<int64_t> seconds. Prefer consistent types and explicit semantics.
- If deadline is absolute, use absl::Time; if relative, use absl::Duration.
Proposal:
- Rename to deadline (not deadline_sec) and change type to std::optionalabsl::Duration (or absl::Time if absolute).
- Document whether it’s absolute epoch or relative.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/CraneCtld/TaskScheduler.cpp(3 hunks)src/Craned/Supervisor/TaskManager.h(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/CraneCtld/TaskScheduler.cpp
🧰 Additional context used
🧬 Code graph analysis (1)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (2)
EvTaskTimerCb_(1755-1777)EvTaskTimerCb_(1755-1755)
🔇 Additional comments (2)
src/Craned/Supervisor/TaskManager.h (2)
34-36: Enum extension looks good; check downstream handlingAdding TERMINATION_BY_DEADLINE is fine. Ensure all switch statements, metrics, and any (de)serialization/IPC using TerminatedBy are updated accordingly.
384-385: Callback unification LGTMEvTaskTimerCb_(bool is_deadline) cleanly merges deadline/timelimit paths. Matches TaskManager.cpp usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
1937-1946: Clamp negative deadline intervals or terminate immediately.If the deadline has already passed (network latency, clock skew, or late scheduling),
deadline_secbecomes negative or zero at line 1938-1939. Passing this toAddTerminationTimer_()at line 1943 can cause the timer to never fire or behave unexpectedly, as negative durations instd::chrono::seconds()are problematic.Apply this diff to guard against expired deadlines:
} else if (elem.deadline_sec) { int64_t deadline_sec = elem.deadline_sec.value() - absl::ToUnixSeconds(absl::Now()); int64_t new_time_limit_sec = ToInt64Seconds((new_time_limit - (absl::Now() - start_time))); + if (deadline_sec <= 0) { + // Deadline already passed, terminate immediately + TaskTerminateQueueElem ev_task_terminate{ + .termination_reason = TerminatedBy::TERMINATION_BY_DEADLINE}; + m_task_terminate_queue_.enqueue(ev_task_terminate); + m_terminate_task_async_handle_->send(); + } else if (deadline_sec <= new_time_limit_sec) { - if (deadline_sec <= new_time_limit_sec) { AddTerminationTimer_(deadline_sec, true); } else { AddTerminationTimer_(new_time_limit_sec, false); }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/CraneCtld/TaskScheduler.cpp(4 hunks)src/CraneCtld/TaskScheduler.h(2 hunks)src/Craned/Supervisor/TaskManager.cpp(6 hunks)src/Craned/Supervisor/TaskManager.h(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/CraneCtld/TaskScheduler.h
- src/CraneCtld/TaskScheduler.cpp
🧰 Additional context used
🧬 Code graph analysis (2)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (2)
EvTaskTimerCb_(1755-1777)EvTaskTimerCb_(1755-1755)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)src/Craned/Supervisor/TaskManager.h (2)
DelTerminationTimer_(334-340)AddTerminationTimer_(323-332)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format
🔇 Additional comments (9)
src/Craned/Supervisor/TaskManager.h (6)
34-35: LGTM! Clean enum extension.The addition of
TERMINATION_BY_DEADLINEis well-placed and consistent with the existing enum structure.
311-321: LGTM! Unified timer handling.The template method correctly unifies deadline and timeout timers by accepting an
is_deadlineflag and propagating it to the callback. This design addresses the previous suggestion to merge timer functions.
323-332: LGTM! Consistent overload.The overload correctly mirrors the template version's behavior and consistently handles the
is_deadlineflag.
384-384: LGTM! Unified callback signature.The callback signature correctly accepts the
is_deadlineflag, enabling a single unified handler for both timeout and deadline events. This addresses the previous suggestion to merge separate callbacks.
384-384: Flag AI summary inconsistency.The AI-generated enriched summary mentions "Added new public method: EvDeadlineTaskTimerCb_() to handle deadline-based timer events," but this method does not exist in the provided code. The actual implementation uses a unified
EvTaskTimerCb_(bool is_deadline)method instead of separate callbacks.
378-378: No issues with deadline_sec handling: processing logic distinguishestime_limitanddeadline_secupdates correctly and applies appropriate timers.src/Craned/Supervisor/TaskManager.cpp (3)
1670-1673: LGTM!The deadline propagation logic correctly extracts and forwards the deadline timestamp when present, enabling downstream timer rescheduling to consider both deadline and time_limit.
1755-1777: LGTM!The timer callback now correctly distinguishes between deadline and timeout expirations via the
is_deadlineparameter. Batch tasks enqueue the appropriate termination reason, while non-batch tasks immediately report the correct status (Deadline vs ExceedTimeLimit) with the proper exit code.
1827-1832: LGTM!The deadline termination handling correctly maps
TERMINATION_BY_DEADLINEtoTaskStatus::Deadlinein both signal and no-signal paths, with appropriate exit codes and descriptive messages. This ensures batch jobs report deadline expiration distinctly from timeouts.Also applies to: 1847-1852
L-Xiafeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要新增ccontrol update job deadline功能
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/Craned/Supervisor/SupervisorServer.cpp (1)
63-86: Handle one-of semantics: downstream uses .value() unguarded and will crash if only one field is setSupervisor’s TaskManager (EvCleanChangeTaskTimeConstraintQueueCb_, see src/Craned/Supervisor/TaskManager.cpp lines 1931‑1968 in the provided snippet) dereferences both optionals with .value(). If only time_limit or only deadline is provided, this will throw/UB and kill the supervisor.
Recommended fix (in TaskManager.cpp): treat each optional independently and compute the earliest expiry only from present fields.
// Replace TaskManager::EvCleanChangeTaskTimeConstraintQueueCb_ with guarded logic void TaskManager::EvCleanChangeTaskTimeConstraintQueueCb_() { absl::Time now = absl::Now(); ChangeTaskTimeConstraintQueueElem elem; while (m_task_time_constraint_change_queue_.try_dequeue(elem)) { DelTerminationTimer_(); CRANE_TRACE("Delete the old timer."); absl::Time start_time = absl::FromUnixSeconds(m_step_.GetStep().start_time().seconds()); const auto& tl = elem.time_limit; // optional<absl::Duration> const auto& ddl = elem.deadline_time; // optional<int64_t> (unix sec) bool timed_out = false; if (tl && now - start_time >= *tl) timed_out = true; if (ddl && absl::ToUnixSeconds(now) >= *ddl) timed_out = true; if (timed_out) { TaskTerminateQueueElem ev{ .termination_reason = TerminatedBy::TERMINATION_BY_TIMEOUT }; m_task_terminate_queue_.enqueue(ev); m_terminate_task_async_handle_->send(); } else { int64_t ttl_by_tl = tl ? ToInt64Seconds((*tl) - (now - start_time)) : std::numeric_limits<int64_t>::max(); int64_t ttl_by_ddl = ddl ? (*ddl - absl::ToUnixSeconds(now)) : std::numeric_limits<int64_t>::max(); int64_t secs = std::min(ttl_by_tl, ttl_by_ddl); bool is_deadline = ttl_by_ddl <= ttl_by_tl; AddTerminationTimer_(secs, is_deadline); CRANE_TRACE("Replace the old timer to {} timer of {} seconds", is_deadline ? "deadline" : "time_limit", secs); } elem.ok_prom.set_value(CraneErrCode::SUCCESS); } }Based on relevant snippet
src/Craned/Core/JobManager.cpp (1)
1024-1052: Set promise on all paths to avoid deadlockIf supervisor stub is missing or RPC fails, ok_prom is never set. The caller waits on future.get() and can block indefinitely. Set false in all failure branches.
while (m_task_time_constraint_change_queue_.try_dequeue(elem)) { if (auto job_ptr = m_job_map_.GetValueExclusivePtr(elem.job_id); job_ptr) { auto stub = g_supervisor_keeper->GetStub(elem.job_id); if (!stub) { CRANE_ERROR("Supervisor for task #{} not found", elem.job_id); - continue; + elem.ok_prom.set_value(false); + continue; } auto err = stub->ChangeTaskTimeConstraint(elem.time_limit, elem.deadline_time); if (err != CraneErrCode::SUCCESS) { CRANE_ERROR("Failed to change task time limit or deadline for task #{}", elem.job_id); - continue; + elem.ok_prom.set_value(false); + continue; } elem.ok_prom.set_value(true); } else { CRANE_ERROR( "Try to update the time limit or deadline of a non-existent task " "#{}.", elem.job_id); elem.ok_prom.set_value(false); } }src/Craned/Supervisor/TaskManager.cpp (1)
1932-1969: Critical issues: Unsafe optional access and incorrect termination reason.This function has several critical issues:
Unsafe optional access (lines 1943, 1946): The code calls
.value()on optional fields without checking if they have values. This will crash if either isstd::nullopt. While theChangeTaskTimeConstraintAsyncfunction (lines 1670-1686) always populates both fields, this assumption makes the code fragile.Incorrect termination reason (line 1949): When the task times out or reaches its deadline, the code always enqueues
TERMINATION_BY_TIMEOUTregardless of which constraint was exceeded. If the deadline was exceeded, it should useTERMINATION_BY_DEADLINE.Negative deadline handling (lines 1955-1956): If
elem.deadline_time.value()is in the past (due to network latency, clock skew, or late processing),deadline_secbecomes negative. Passing this tostd::minand then toAddTerminationTimer_will cause the timer to never fire or behave unexpectedly.Apply this diff to fix these issues:
void TaskManager::EvCleanChangeTaskTimeConstraintQueueCb_() { absl::Time now = absl::Now(); ChangeTaskTimeConstraintQueueElem elem; while (m_task_time_constraint_change_queue_.try_dequeue(elem)) { // Delete the old timer. DelTerminationTimer_(); CRANE_TRACE("Delete the old timer."); absl::Time start_time = absl::FromUnixSeconds(m_step_.GetStep().start_time().seconds()); - absl::Duration const& new_time_limit = elem.time_limit.value(); + + if (!elem.time_limit.has_value() || !elem.deadline_time.has_value()) { + CRANE_ERROR("time_limit or deadline_time not set in constraint change"); + elem.ok_prom.set_value(CraneErrCode::ERR_INVALID_PARAM); + continue; + } + + absl::Duration const& new_time_limit = elem.time_limit.value(); + int64_t deadline_time = elem.deadline_time.value(); - if (now - start_time >= new_time_limit || - absl::ToUnixSeconds(now) >= elem.deadline_time.value()) { + bool time_limit_exceeded = now - start_time >= new_time_limit; + bool deadline_exceeded = absl::ToUnixSeconds(now) >= deadline_time; + + if (time_limit_exceeded || deadline_exceeded) { // If the task times out, terminate it. + TerminatedBy reason = deadline_exceeded + ? TerminatedBy::TERMINATION_BY_DEADLINE + : TerminatedBy::TERMINATION_BY_TIMEOUT; TaskTerminateQueueElem ev_task_terminate{ - .termination_reason = TerminatedBy::TERMINATION_BY_TIMEOUT}; + .termination_reason = reason}; m_task_terminate_queue_.enqueue(ev_task_terminate); m_terminate_task_async_handle_->send(); } else { // If the task haven't timed out, set up a new timer. - int64_t deadline_sec = - elem.deadline_time.value() - absl::ToUnixSeconds(absl::Now()); + int64_t deadline_sec = deadline_time - absl::ToUnixSeconds(absl::Now()); int64_t new_time_limit_sec = ToInt64Seconds((new_time_limit - (absl::Now() - start_time))); - bool is_deadline = deadline_sec <= new_time_limit_sec ? true : false; + + if (deadline_sec <= 0) { + // Deadline already passed, terminate immediately + TaskTerminateQueueElem ev_task_terminate{ + .termination_reason = TerminatedBy::TERMINATION_BY_DEADLINE}; + m_task_terminate_queue_.enqueue(ev_task_terminate); + m_terminate_task_async_handle_->send(); + elem.ok_prom.set_value(CraneErrCode::SUCCESS); + continue; + } + + bool is_deadline = deadline_sec <= new_time_limit_sec; std::string log_timer = deadline_sec <= new_time_limit_sec ? "deadline" : "time_limit"; AddTerminationTimer_(std::min(deadline_sec, new_time_limit_sec), is_deadline); CRANE_TRACE("Replace the old timer to {} timer of {} seconds", log_timer, std::min(deadline_sec, new_time_limit_sec)); } elem.ok_prom.set_value(CraneErrCode::SUCCESS); } }
♻️ Duplicate comments (1)
src/CraneCtld/CtldPublicDefs.cpp (1)
209-212: Reset deadline_time when field is absent to avoid stale carry-overIf the incoming proto omits deadline_time, the existing value persists on reused TaskInCtld objects. Reset to InfiniteFuture to represent “no deadline”.
- if (val.has_deadline_time()) { - deadline_time = absl::FromUnixSeconds(val.deadline_time().seconds()); - } + if (val.has_deadline_time()) { + deadline_time = absl::FromUnixSeconds(val.deadline_time().seconds()); + } else { + deadline_time = absl::InfiniteFuture(); + }
🧹 Nitpick comments (12)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
590-611: Preserve nanoseconds when converting Timestamp → absl::TimeCurrent code ignores nanos; keep full precision by adding nanos.
- absl::Time deadline_time = - absl::FromUnixSeconds(request->deadline_time().seconds()); + const auto& ts = request->deadline_time(); + absl::Time deadline_time = + absl::FromUnixSeconds(ts.seconds()) + absl::Nanoseconds(ts.nanos());protos/Crane.proto (2)
146-155: Clarify units and semantics for time constraintsSpecify in comments that:
- time_limit_seconds: wall-clock seconds (>=0).
- deadline_time: Unix epoch seconds (UTC).
This avoids ambiguity across languages and clients. Optionally consider google.protobuf.Timestamp for deadline_time here too for consistency with ModifyTaskRequest.
229-230: Consistent deadline representation across APIsModifyTaskRequest uses Timestamp while ChangeJobTimeConstraintRequest uses int64 seconds. It works, but mixed types increase conversion points. Consider unifying on Timestamp for both or documenting the rationale.
Also applies to: 245-246
src/CraneCtld/CtldPublicDefs.cpp (2)
309-310: Avoid serializing InfiniteFuture as a concrete timestampEmitting a max-ish Unix timestamp can confuse clients. Only set the field when finite; otherwise leave it unset/cleared.
- task_info->mutable_deadline_time()->set_seconds(ToUnixSeconds(deadline_time)); + if (absl::IsFinite(deadline_time)) { + task_info->mutable_deadline_time()->set_seconds( + ToUnixSeconds(deadline_time)); + } else { + task_info->clear_deadline_time(); + }
360-362: Likewise: don’t populate TaskToD.deadline_time for InfiniteFutureSkip setting when no deadline to avoid sending sentinel values.
- task_to_d.mutable_deadline_time()->set_seconds( - ToUnixSeconds(this->deadline_time)); + if (absl::IsFinite(this->deadline_time)) { + task_to_d.mutable_deadline_time()->set_seconds( + ToUnixSeconds(this->deadline_time)); + } else { + task_to_d.clear_deadline_time(); + }src/CraneCtld/DbClient.cpp (2)
1090-1110: Field array comment index is off; also consider centralizing field namesThe comment says “35 - 39” but there are 4 fields (indices 35–38). Cosmetic, but please fix. Also, field name lists are duplicated in two functions—consider a single constexpr array to avoid drift.
Apply this diff to fix the comment:
- // 35 - 39 + // 35 - 38 "mem_alloc", "device_map", "container", "deadline_time"
1179-1199: Duplicate field name list and off-by-one comment (Ctld variant)Same nits as the EmbeddedDb variant: fix the “35 - 39” comment and consider de-duplicating the field list.
Diff:
- // 35 - 39 + // 35 - 38 "mem_alloc", "device_map", "container", "deadline_time"src/Craned/Core/SupervisorKeeper.cpp (1)
84-103: Set a client-side RPC deadline (optional)To avoid hangs under network partitions, set a short deadline on ClientContext (e.g., 5–10s).
Example:
ClientContext context; context.set_deadline(absl::ToChronoTime(absl::Now() + absl::Seconds(10)));protos/Supervisor.proto (1)
104-107: Document field units and semanticsPlease specify units to prevent ambiguity across components.
Diff:
-message ChangeTaskTimeConstraintRequest{ - optional int64 time_limit_seconds = 1; - optional int64 deadline_time = 2; +message ChangeTaskTimeConstraintRequest{ + // Time limit duration in seconds. + optional int64 time_limit_seconds = 1; + // Absolute deadline, Unix epoch seconds (UTC). If unset, no deadline. + optional int64 deadline_time = 2; }src/CraneCtld/TaskScheduler.h (1)
488-491: API change looks good; add brief unit docsClarify units for secs (seconds) and deadline_time (absl::Time absolute).
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
266-269: Unify log message to avoid “Task/Job” mismatchMinor nit: the error log still says ChangeTaskTimeConstraintAsync while this client method is ChangeJobTimeConstraint. Suggest standardizing to “Job”.
- CRANE_ERROR("ChangeTaskTimeConstraintAsync to Craned {} failed: {} ", + CRANE_ERROR("ChangeJobTimeConstraint to Craned {} failed: {} ", m_craned_id_, status.error_message());src/CraneCtld/TaskScheduler.cpp (1)
572-591: Guard against “no deadline” default when pre-scanning pending tasksEnsure tasks without a deadline aren’t cancelled. If TaskInCtld::deadline_time isn’t guaranteed to be absl::InfiniteFuture(), add an explicit guard.
- for (const auto& [task_id, task] : m_pending_task_map_) { - if (task->deadline_time < absl_begin) { + for (const auto& [task_id, task] : m_pending_task_map_) { + if (task->deadline_time != absl::InfiniteFuture() && + task->deadline_time < absl_begin) { CRANE_DEBUG("Task #{} was added to cancel_tasks", task_id); cancel_tasks.push_back(task_id); } }Also consider <= to cancel exactly at the deadline boundary.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
protos/Crane.proto(4 hunks)protos/PublicDefs.proto(4 hunks)protos/Supervisor.proto(2 hunks)src/CraneCtld/CtldPublicDefs.cpp(3 hunks)src/CraneCtld/DbClient.cpp(10 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(1 hunks)src/CraneCtld/RpcService/CranedKeeper.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(2 hunks)src/CraneCtld/TaskScheduler.cpp(8 hunks)src/CraneCtld/TaskScheduler.h(3 hunks)src/Craned/Core/CranedServer.cpp(1 hunks)src/Craned/Core/CranedServer.h(1 hunks)src/Craned/Core/JobManager.cpp(3 hunks)src/Craned/Core/JobManager.h(4 hunks)src/Craned/Core/SupervisorKeeper.cpp(1 hunks)src/Craned/Core/SupervisorKeeper.h(1 hunks)src/Craned/Supervisor/SupervisorServer.cpp(1 hunks)src/Craned/Supervisor/SupervisorServer.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(8 hunks)src/Craned/Supervisor/TaskManager.h(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- protos/PublicDefs.proto
- src/CraneCtld/RpcService/CranedKeeper.h
🧰 Additional context used
🧬 Code graph analysis (13)
src/Craned/Core/CranedServer.h (2)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
ChangeJobTimeConstraint(241-277)ChangeJobTimeConstraint(241-243)src/Craned/Core/CranedServer.cpp (2)
ChangeJobTimeConstraint(309-335)ChangeJobTimeConstraint(309-312)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
ChangeJobTimeConstraint(241-277)ChangeJobTimeConstraint(241-243)
src/Craned/Core/SupervisorKeeper.h (3)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1021-1136)ChangeTaskTimeConstraint(1021-1023)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(84-103)ChangeTaskTimeConstraint(84-86)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-86)ChangeTaskTimeConstraint(63-66)
src/Craned/Supervisor/SupervisorServer.h (3)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1021-1136)ChangeTaskTimeConstraint(1021-1023)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(84-103)ChangeTaskTimeConstraint(84-86)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-86)ChangeTaskTimeConstraint(63-66)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
src/Craned/Core/CranedServer.cpp (2)
ChangeJobTimeConstraint(309-335)ChangeJobTimeConstraint(309-312)
src/CraneCtld/TaskScheduler.h (3)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1021-1136)ChangeTaskTimeConstraint(1021-1023)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(84-103)ChangeTaskTimeConstraint(84-86)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-86)ChangeTaskTimeConstraint(63-66)
src/Craned/Core/JobManager.h (2)
src/Craned/Core/JobManager.cpp (4)
ChangeJobTimeConstraintAsync(995-1009)ChangeJobTimeConstraintAsync(995-997)EvCleanChangeTaskTimeConstraintQueueCb_(1024-1052)EvCleanChangeTaskTimeConstraintQueueCb_(1024-1024)src/Craned/Supervisor/TaskManager.cpp (2)
EvCleanChangeTaskTimeConstraintQueueCb_(1932-1969)EvCleanChangeTaskTimeConstraintQueueCb_(1932-1932)
src/Craned/Supervisor/TaskManager.cpp (3)
src/Craned/Core/JobManager.cpp (2)
EvCleanChangeTaskTimeConstraintQueueCb_(1024-1052)EvCleanChangeTaskTimeConstraintQueueCb_(1024-1024)src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)src/Craned/Supervisor/TaskManager.h (2)
DelTerminationTimer_(336-342)AddTerminationTimer_(325-334)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
EvTaskTimerCb_(1769-1791)EvTaskTimerCb_(1769-1769)ChangeTaskTimeConstraintAsync(1662-1691)ChangeTaskTimeConstraintAsync(1662-1664)EvCleanTaskStopQueueCb_(1793-1887)EvCleanTaskStopQueueCb_(1793-1793)EvCleanTerminateTaskQueueCb_(1889-1930)EvCleanTerminateTaskQueueCb_(1889-1889)EvCleanChangeTaskTimeConstraintQueueCb_(1932-1969)EvCleanChangeTaskTimeConstraintQueueCb_(1932-1932)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Supervisor/TaskManager.cpp (2)
EvCleanChangeTaskTimeConstraintQueueCb_(1932-1969)EvCleanChangeTaskTimeConstraintQueueCb_(1932-1932)
src/Craned/Core/SupervisorKeeper.cpp (2)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1021-1136)ChangeTaskTimeConstraint(1021-1023)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-86)ChangeTaskTimeConstraint(63-66)
src/Craned/Supervisor/SupervisorServer.cpp (2)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1021-1136)ChangeTaskTimeConstraint(1021-1023)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(84-103)ChangeTaskTimeConstraint(84-86)
src/CraneCtld/TaskScheduler.cpp (3)
src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(84-103)ChangeTaskTimeConstraint(84-86)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-86)ChangeTaskTimeConstraint(63-66)src/CraneCtld/CtldPublicDefs.h (1)
CheckIfTimeLimitSecIsValid(705-707)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format
🔇 Additional comments (23)
protos/Crane.proto (1)
1014-1015: RPC rename and payloads look consistentChangeJobTimeConstraint RPC and messages align with server/client changes.
Please ensure all dependents regenerated protos (C++, Go, Python) after adding proto3 optional fields.
src/Craned/Core/CranedServer.cpp (1)
319-335: LGTM: correctly handles optional fieldsOptionals are parsed and forwarded to JobManager. No issues spotted.
src/Craned/Supervisor/SupervisorServer.h (1)
53-58: Signature update aligns with SupervisorServer.cppMethod rename and types match the implementation and proto.
src/CraneCtld/DbClient.cpp (2)
1120-1156: Persisting zero deadline_time: confirm semanticsZero seconds will be stored when deadline is unset. Ensure downstream treats 0 as “unset” and does not schedule/timer off epoch.
Would you prefer skipping the field entirely when unset? That would require deviating from the tuple builder; if desired, I can propose a small helper to omit optional fields.
1209-1237: Ctld variant: zero deadline_time semanticsSame as the EmbeddedDb path: confirm that 0 means “unset” across timers/scheduler and reporting. If not, consider omitting the field when unset.
src/CraneCtld/TaskScheduler.h (2)
540-543: Ensure cancel path uses finish_statusConfirm CancelTaskAsyncCb_ and downstream state transitions honor finish_status (e.g., Cancelled vs Deadline).
If not yet wired, I can sketch the minimal changes.
679-680: finish_status addition acknowledgedCarrying finish_status in the queue is a solid change for clarity of final state.
protos/Supervisor.proto (1)
131-131: Verify RPC rename and generated code
No instances ofChangeTaskTimeLimitor the newChangeTaskTimeConstraintwere found in generated code; manually confirm all call sites were updated and ensure your proto3 optional toolchain supportshas_*accessors.src/Craned/Core/CranedServer.h (1)
99-103: RPC rename and signature update look correctHeader matches the implementation in CranedServer.cpp and new proto types.
src/CraneCtld/TaskScheduler.cpp (1)
1124-1131: Good: forwards both constraints to craned with clear error pathForwarding both optionals and logging specific failure is correct.
src/Craned/Core/JobManager.h (3)
106-108: LGTM! Function signature extended for deadline support.The function signature correctly extends the API to support both time limit and deadline constraints using optional parameters.
145-150: LGTM! Data structure updated for constraint-based semantics.The struct is correctly renamed and updated with optional fields to support flexible time constraint modifications.
219-219: LGTM! Consistent renaming of internal members.All internal queue, handle, and callback names are consistently updated to reflect the constraint-based semantics.
Also applies to: 247-250
src/Craned/Supervisor/TaskManager.h (6)
34-35: LGTM! New termination reason added.The
TERMINATION_BY_DEADLINEenum value correctly extends the termination reasons to support deadline-driven task termination.
89-90: LGTM! Accessor method for mutable step data.The
GetMutableStep()method appropriately provides mutable access to the step specification for updating time constraints.
313-323: LGTM! Timer setup extended for deadline awareness.The template method correctly passes the
is_deadlineflag through the timer callback, enabling proper deadline vs. timeout handling.
325-334: LGTM! Overload consistently extended.The
int64_toverload is consistently updated with theis_deadlineparameter, maintaining API consistency.
357-359: LGTM! Public API and internal structures updated consistently.The function signatures, queue element structure, and callback declarations are all consistently updated to support time constraint modifications with optional time limit and deadline.
Also applies to: 379-383, 388-388, 392-392
410-413: LGTM! Internal member variables renamed consistently.The async handle and queue member variables are appropriately renamed to reflect the broader constraint-based semantics.
src/Craned/Supervisor/TaskManager.cpp (4)
1523-1528: LGTM! Async handle setup updated.The async handle is correctly renamed and registered with the updated callback function.
1662-1691: Function implementation is correct, but verify deadline handling in downstream code.The function correctly handles optional parameters and ensures both
time_limitanddeadline_timeare populated in the queue element. However, whendeadline_timeis not provided, it falls back tom_step_.GetStep().deadline_time().seconds(). If the step was initialized without a deadline, this may return 0 or an unset value, which could cause issues in downstream deadline calculations (see concerns in other review comments).
1769-1791: LGTM! Timer callback correctly handles deadline vs. timeout.The callback properly distinguishes between deadline expiration and timeout, setting appropriate termination reasons and status codes for both batch and non-batch tasks.
1841-1846: LGTM! Deadline termination handled in task stop callback.The code correctly handles
TERMINATION_BY_DEADLINEin both the signal and non-signal termination paths, setting the appropriate status and exit code.Also applies to: 1861-1866
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/CraneCtld/DbClient.cpp (1)
382-383: Guard reading deadline_time to avoid exceptions on legacy docsOld records may not have deadline_time; direct get_int64() risks exceptions and aborts the loop. Guard presence and type.
Apply this diff:
- task->mutable_deadline_time()->set_seconds( - view["deadline_time"].get_int64().value); + if (auto e = view["deadline_time"]) { + int64_t secs = 0; + if (e.type() == bsoncxx::type::k_int64) secs = e.get_int64().value; + else if (e.type() == bsoncxx::type::k_int32) secs = e.get_int32().value; + if (secs > 0) task->mutable_deadline_time()->set_seconds(secs); + }
🧹 Nitpick comments (1)
src/Craned/Supervisor/TaskManager.h (1)
379-383: Verify optional field handling in dequeue logic.The struct correctly uses
std::optionalfor both constraints. However, inEvCleanChangeTaskTimeConstraintQueueCb_(TaskManager.cpp:1945), the code callselem.deadline_time.value()without checking, relying onChangeTaskTimeConstraintAsyncalways populating optionals from existing step values. While this works currently, it creates a fragile dependency between the enqueue and dequeue logic.Consider adding defensive checks or assertions in the dequeue handler:
void TaskManager::EvCleanChangeTaskTimeConstraintQueueCb_() { ChangeTaskTimeConstraintQueueElem elem; while (m_task_time_constraint_change_queue_.try_dequeue(elem)) { // Add defensive check CRANE_ASSERT(elem.time_limit.has_value() && elem.deadline_time.has_value(), "Both time_limit and deadline_time must be set in queue elem"); // ... rest of implementation } }Or explicitly document the invariant that both fields are always populated when enqueued.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
protos/Crane.proto(4 hunks)protos/Supervisor.proto(2 hunks)src/CraneCtld/DbClient.cpp(10 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(2 hunks)src/Craned/Supervisor/TaskManager.h(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/Craned/Supervisor/TaskManager.h (2)
src/Craned/Supervisor/TaskManager.cpp (10)
EvTaskTimerCb_(1769-1791)EvTaskTimerCb_(1769-1769)ChangeTaskTimeConstraintAsync(1662-1691)ChangeTaskTimeConstraintAsync(1662-1664)EvCleanTaskStopQueueCb_(1793-1887)EvCleanTaskStopQueueCb_(1793-1793)EvCleanTerminateTaskQueueCb_(1889-1930)EvCleanTerminateTaskQueueCb_(1889-1889)EvCleanChangeTaskTimeConstraintQueueCb_(1932-1969)EvCleanChangeTaskTimeConstraintQueueCb_(1932-1932)src/Craned/Core/JobManager.cpp (4)
EvCleanTerminateTaskQueueCb_(919-981)EvCleanTerminateTaskQueueCb_(919-919)EvCleanChangeTaskTimeConstraintQueueCb_(1024-1052)EvCleanChangeTaskTimeConstraintQueueCb_(1024-1024)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (15)
src/CraneCtld/DbClient.cpp (3)
314-315: LGTM!Comment accurately reflects the addition of the deadline_time field.
1094-1156: LGTM!Field arrays and value tuples correctly extended to include deadline_time. The implementation properly serializes deadline_time to MongoDB documents for TaskInEmbeddedDb.
1183-1238: LGTM!Field arrays and value tuples correctly extended to include deadline_time. The implementation properly serializes deadline_time to MongoDB documents for TaskInCtld.
src/CraneCtld/RpcService/CtldGrpcServer.cpp (2)
532-534: LGTM!The type mismatch issue has been resolved. The code now correctly constructs std::optional<int64_t> from request->time_limit_seconds().
589-610: LGTM!The new Deadline handling branch is well-implemented. The logic correctly:
- Converts proto Timestamp to absl::Time
- Calls ChangeTaskTimeConstraint with appropriate optional parameters
- Handles all error cases consistently with existing TimeLimit handling
protos/Supervisor.proto (1)
104-111: LGTM!Proto changes correctly rename the RPC and extend the request to support both time_limit_seconds and deadline_time as optional fields. This allows flexible constraint updates.
Also applies to: 131-131
protos/Crane.proto (3)
146-154: LGTM!Message definitions are correctly structured with appropriate field numbering and optional qualifiers for flexible constraint updates.
229-229: LGTM!The Deadline enum value and deadline_time field are correctly added without numbering conflicts.
Also applies to: 245-245
1014-1014: LGTM!RPC signature correctly updated to use the new ChangeJobTimeConstraint message types.
src/Craned/Supervisor/TaskManager.h (6)
34-35: LGTM! Enum extension is consistent.The addition of
TERMINATION_BY_DEADLINEfollows the existing pattern and integrates well with the termination handling logic shown in the implementation.
313-334: LGTM! Timer overloads properly handle deadline distinction.The
is_deadlineparameter successfully addresses the past review comment about distinguishing deadline from time_limit timers. The lambda capture by value is appropriate for the boolean type, and the unified callback approach is cleaner than maintaining separate timer handlers.Based on past review comments.
357-359: LGTM! API rename and optional parameters improve flexibility.The rename from
ChangeTaskTimeLimitAsynctoChangeTaskTimeConstraintAsyncmore accurately reflects the method's purpose. The optional parameters allow callers to update either time_limit, deadline_time, or both independently, which aligns with the PR objective of keeping deadline as an independent field.
388-388: LGTM! Timer callback signature consistent with design.The
is_deadlineparameter addition properly distinguishes the termination reason, enabling correct status and exit code routing in the implementation (TaskManager.cpp:1768-1790).
392-392: LGTM! Method rename maintains consistency.The rename from
EvCleanChangeTaskTimeLimitQueueCb_toEvCleanChangeTaskTimeConstraintQueueCb_maintains naming consistency with the struct and async method renames.
410-413: LGTM! Member variable renames complete the refactoring.The handle and queue renames align with the constraint-based semantics and maintain consistency across the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
1934-1964: Guard deadline/time-limit timers against missing or past deadlines.The consumer side still assumes
elem.deadline_timealways carries a value and immediately dereferences it. Combined with the previous coercion, any job without a real deadline is terminated. Even after fixing the enqueue path, we still underflow timers when the deadline is in the past (negative seconds wrap into huge unsigned delays) and we continue to report deadline expirations as generic timeouts. Please treatdeadline_timeas optional, short‑circuit if the deadline is already exceeded, clamp negative durations to zero, and only mark a timer as deadline-driven when a real deadline exists.- if (now - start_time >= new_time_limit || - absl::ToUnixSeconds(now) >= elem.deadline_time.value()) { + const bool has_deadline = elem.deadline_time.has_value(); + if (now - start_time >= new_time_limit || + (has_deadline && + absl::ToUnixSeconds(now) >= elem.deadline_time.value())) { // If the task times out, terminate it. - TaskTerminateQueueElem ev_task_terminate{ - .termination_reason = TerminatedBy::TERMINATION_BY_TIMEOUT}; + TaskTerminateQueueElem ev_task_terminate{ + .termination_reason = has_deadline + ? TerminatedBy::TERMINATION_BY_DEADLINE + : TerminatedBy::TERMINATION_BY_TIMEOUT}; m_task_terminate_queue_.enqueue(ev_task_terminate); m_terminate_task_async_handle_->send(); } else { - // If the task haven't timed out, set up a new timer. - int64_t deadline_sec = - elem.deadline_time.value() - absl::ToUnixSeconds(absl::Now()); - int64_t new_time_limit_sec = - ToInt64Seconds((new_time_limit - (absl::Now() - start_time))); - bool is_deadline = deadline_sec <= new_time_limit_sec ? true : false; - std::string log_timer = - deadline_sec <= new_time_limit_sec ? "deadline" : "time_limit"; - AddTerminationTimer_(std::min(deadline_sec, new_time_limit_sec), - is_deadline); - CRANE_TRACE("Replace the old timer to {} timer of {} seconds", log_timer, - std::min(deadline_sec, new_time_limit_sec)); + // If the task hasn't timed out, set up a new timer. + int64_t new_time_limit_sec = + ToInt64Seconds(new_time_limit - (absl::Now() - start_time)); + new_time_limit_sec = std::max<int64_t>(new_time_limit_sec, 0); + + int64_t deadline_sec = std::numeric_limits<int64_t>::max(); + if (has_deadline) { + deadline_sec = + elem.deadline_time.value() - absl::ToUnixSeconds(absl::Now()); + if (deadline_sec <= 0) { + TaskTerminateQueueElem ev_task_terminate{ + .termination_reason = TerminatedBy::TERMINATION_BY_DEADLINE}; + m_task_terminate_queue_.enqueue(ev_task_terminate); + m_terminate_task_async_handle_->send(); + elem.ok_prom.set_value(CraneErrCode::SUCCESS); + continue; + } + } + + bool is_deadline = has_deadline && deadline_sec <= new_time_limit_sec; + int64_t timer_sec = is_deadline ? deadline_sec : new_time_limit_sec; + AddTerminationTimer_(timer_sec, is_deadline); + CRANE_TRACE("Replace the old timer to {} timer of {} seconds", + is_deadline ? "deadline" : "time_limit", timer_sec); }
♻️ Duplicate comments (2)
src/Craned/Core/SupervisorKeeper.h (1)
37-39: Add missing#include <optional>for std::optional.This declaration uses
std::optionalbut the header doesn't include<optional>. While it may work due to transitive includes, this is fragile and non-portable.Apply this diff:
#include "CranedPublicDefs.h" // Precompiled header comes first. #include <protos/Supervisor.grpc.pb.h> +#include <optional> #include "JobManager.h"Additionally, consider adding brief parameter documentation:
// time_limit_seconds: optional time limit in seconds // deadline_time: optional deadline as Unix epoch seconds CraneErrCode ChangeTaskTimeConstraint( std::optional<int64_t> time_limit_seconds, std::optional<int64_t> deadline_time);src/Craned/Supervisor/TaskManager.cpp (1)
1983-1993: Clamp negative timer intervals before arming uv timers.When the supervisor starts a task with a deadline already in the past (or no deadline at all),
deadline_secbecomes negative, andAddTerminationTimer_(std::min(deadline_sec, time_limit_sec), …)feeds a negative duration into libuv. That underflows into a huge unsigned timeout, so the deadline callback never runs. Please short‑circuit expired deadlines and only arm the timer with non‑negative intervals, falling back to the time limit when no deadline is set.- int64_t time_limit_sec = m_step_.GetStep().time_limit().seconds(); - int64_t deadline_sec = m_step_.GetStep().deadline_time().seconds() - - absl::ToUnixSeconds(absl::Now()); - - std::string log_timer = - deadline_sec <= time_limit_sec ? "deadline" : "time_limit"; - bool is_deadline = deadline_sec <= time_limit_sec ? true : false; - - AddTerminationTimer_(std::min(deadline_sec, time_limit_sec), is_deadline); - CRANE_TRACE("Add a {} timer of {} seconds", log_timer, - std::min(deadline_sec, time_limit_sec)); + int64_t time_limit_sec = + std::max<int64_t>(m_step_.GetStep().time_limit().seconds(), 0); + + if (m_step_.GetStep().has_deadline_time()) { + int64_t deadline_sec = m_step_.GetStep().deadline_time().seconds() - + absl::ToUnixSeconds(absl::Now()); + if (deadline_sec <= 0) { + CRANE_TRACE("Task deadline already reached; terminating immediately."); + EvTaskTimerCb_(true); + } else { + bool is_deadline = deadline_sec <= time_limit_sec; + int64_t timer_sec = is_deadline ? deadline_sec : time_limit_sec; + AddTerminationTimer_(timer_sec, is_deadline); + CRANE_TRACE("Add a {} timer of {} seconds", + is_deadline ? "deadline" : "time_limit", timer_sec); + } + } else { + AddTerminationTimer_(time_limit_sec, false); + CRANE_TRACE("Add a time_limit timer of {} seconds", time_limit_sec); + }
🧹 Nitpick comments (1)
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
592-615: Consider extracting common error handling logic.The Deadline branch duplicates error handling logic from the TimeLimit branch (lines 538-553). While functionally correct, this creates maintenance burden.
Consider extracting the common response logic into a helper lambda:
auto handle_result = [&](CraneErrCode err, task_id_t task_id, const char* constraint_name) { if (err == CraneErrCode::SUCCESS) { response->add_modified_tasks(task_id); } else if (err == CraneErrCode::ERR_NON_EXISTENT) { response->add_not_modified_tasks(task_id); response->add_not_modified_reasons(fmt::format( "Task #{} was not found in running or pending queue.", task_id)); } else if (err == CraneErrCode::ERR_INVALID_PARAM) { response->add_not_modified_tasks(task_id); response->add_not_modified_reasons(fmt::format("Invalid {} value.", constraint_name)); } else { response->add_not_modified_tasks(task_id); response->add_not_modified_reasons( fmt::format("Failed to change the {} of Task#{}: {}.", constraint_name, task_id, CraneErrStr(err))); } };Then use it in both branches to reduce duplication.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
src/CraneCtld/RpcService/CranedKeeper.cpp(1 hunks)src/CraneCtld/RpcService/CranedKeeper.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(2 hunks)src/CraneCtld/TaskScheduler.cpp(8 hunks)src/CraneCtld/TaskScheduler.h(3 hunks)src/Craned/Core/CranedServer.cpp(1 hunks)src/Craned/Core/JobManager.cpp(3 hunks)src/Craned/Core/JobManager.h(4 hunks)src/Craned/Core/SupervisorKeeper.cpp(1 hunks)src/Craned/Core/SupervisorKeeper.h(1 hunks)src/Craned/Supervisor/SupervisorServer.cpp(1 hunks)src/Craned/Supervisor/TaskManager.cpp(8 hunks)src/Craned/Supervisor/TaskManager.h(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/Craned/Core/CranedServer.cpp
- src/CraneCtld/RpcService/CranedKeeper.h
🧰 Additional context used
🧬 Code graph analysis (9)
src/CraneCtld/TaskScheduler.h (3)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1023-1150)ChangeTaskTimeConstraint(1023-1025)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(84-103)ChangeTaskTimeConstraint(84-86)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-85)ChangeTaskTimeConstraint(63-66)
src/Craned/Supervisor/SupervisorServer.cpp (2)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1023-1150)ChangeTaskTimeConstraint(1023-1025)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(84-103)ChangeTaskTimeConstraint(84-86)
src/CraneCtld/TaskScheduler.cpp (3)
src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(84-103)ChangeTaskTimeConstraint(84-86)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-85)ChangeTaskTimeConstraint(63-66)src/CraneCtld/CtldPublicDefs.h (1)
CheckIfTimeLimitSecIsValid(705-707)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
src/Craned/Core/CranedServer.cpp (2)
ChangeJobTimeConstraint(309-334)ChangeJobTimeConstraint(309-312)
src/Craned/Core/SupervisorKeeper.h (3)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1023-1150)ChangeTaskTimeConstraint(1023-1025)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(84-103)ChangeTaskTimeConstraint(84-86)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-85)ChangeTaskTimeConstraint(63-66)
src/Craned/Core/JobManager.h (1)
src/Craned/Core/JobManager.cpp (4)
ChangeJobTimeConstraintAsync(995-1010)ChangeJobTimeConstraintAsync(995-997)EvCleanChangeJobTimeConstraintQueueCb_(1025-1053)EvCleanChangeJobTimeConstraintQueueCb_(1025-1025)
src/Craned/Core/SupervisorKeeper.cpp (2)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1023-1150)ChangeTaskTimeConstraint(1023-1025)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-85)ChangeTaskTimeConstraint(63-66)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
EvTaskTimerCb_(1768-1790)EvTaskTimerCb_(1768-1768)ChangeTaskTimeConstraintAsync(1662-1690)ChangeTaskTimeConstraintAsync(1662-1663)EvCleanTaskStopQueueCb_(1792-1886)EvCleanTaskStopQueueCb_(1792-1792)EvCleanTerminateTaskQueueCb_(1888-1929)EvCleanTerminateTaskQueueCb_(1888-1888)EvCleanChangeTaskTimeConstraintQueueCb_(1931-1968)EvCleanChangeTaskTimeConstraintQueueCb_(1931-1931)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)src/Craned/Supervisor/TaskManager.h (2)
DelTerminationTimer_(336-342)AddTerminationTimer_(325-334)
🔇 Additional comments (9)
src/CraneCtld/TaskScheduler.h (2)
488-490: LGTM!The method signature correctly accepts optional parameters for both time_limit_seconds and deadline_time, allowing callers to modify either constraint independently.
540-542: LGTM!Adding
finish_statustoCancelPendingTaskQueueElemproperly distinguishes between cancellation and deadline expiry, enabling correct task status reporting.Also applies to: 677-679
src/Craned/Core/SupervisorKeeper.cpp (1)
84-103: LGTM!The implementation correctly handles optional parameters by conditionally setting request fields only when values are present. Error handling and RPC semantics are preserved.
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
531-537: LGTM!The TimeLimit handling correctly constructs the optional and calls
ChangeTaskTimeConstraintwith the appropriate nullopt for deadline_time.src/Craned/Supervisor/SupervisorServer.cpp (1)
63-84: LGTM!The RPC handler correctly extracts optional time_limit_seconds and deadline_time from the request and passes them to the async task manager method. Response handling is appropriate.
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
241-277: LGTM!The method correctly handles optional parameters by conditionally setting request fields only when values are present. Error messages have been updated to reflect the new method name, and the RPC semantics are preserved.
src/Craned/Core/JobManager.cpp (3)
99-104: LGTM!The async handle has been appropriately renamed from
m_change_task_time_limit_async_handle_tom_change_job_time_constraint_async_handle_to reflect the updated terminology.
995-1010: LGTM!The method signature correctly accepts optional parameters for both time_limit_seconds and deadline_time, enabling independent modification of either constraint. The queue element structure has been updated accordingly.
1025-1053: LGTM!The callback correctly processes the updated queue element structure, passes optional parameters to the supervisor stub, and provides appropriate error messages. The terminology has been updated to consistently use "job" rather than "task".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/Craned/Supervisor/TaskManager.cpp (1)
1934-1971: Unsafe access of optional values without presence checks.Multiple locations access
.value()onstd::optional<int64_t>fields without verifying they contain a value:
- Line 1945:
elem.time_limit.value()- Line 1948:
elem.deadline_time.value()- Line 1958:
elem.deadline_time.value()If either field is
std::nullopt, this throwsstd::bad_optional_access. The logic must checkhas_value()before dereferencing, and handle the case where no deadline or time limit is set.Apply this diff to safely handle optional values:
DelTerminationTimer_(); absl::Time start_time = absl::FromUnixSeconds(m_step_.GetStep().start_time().seconds()); - absl::Duration const& new_time_limit = - absl::Seconds(elem.time_limit.value()); - - if (now - start_time >= new_time_limit || - absl::ToUnixSeconds(now) >= elem.deadline_time.value()) { + + if (!elem.time_limit.has_value()) { + CRANE_ERROR("time_limit is not set in ChangeTaskTimeConstraint request"); + elem.ok_prom.set_value(CraneErrCode::ERR_INVALID_PARAM); + continue; + } + + absl::Duration const& new_time_limit = + absl::Seconds(elem.time_limit.value()); + + bool timed_out = (now - start_time >= new_time_limit); + if (elem.deadline_time.has_value()) { + int64_t deadline_sec = + elem.deadline_time.value() - absl::ToUnixSeconds(now); + timed_out |= (deadline_sec <= 0); + } + + if (timed_out) { // If the task times out, terminate it. TaskTerminateQueueElem ev_task_terminate{ .termination_reason = TerminatedBy::TERMINATION_BY_TIMEOUT}; m_task_terminate_queue_.enqueue(ev_task_terminate); m_terminate_task_async_handle_->send(); } else { // If the task haven't timed out, set up a new timer. - int64_t deadline_sec = - elem.deadline_time.value() - absl::ToUnixSeconds(absl::Now()); int64_t new_time_limit_sec = ToInt64Seconds((new_time_limit - (absl::Now() - start_time))); - bool is_deadline = deadline_sec <= new_time_limit_sec ? true : false; - std::string log_timer = - deadline_sec <= new_time_limit_sec ? "deadline" : "time_limit"; - AddTerminationTimer_(std::min(deadline_sec, new_time_limit_sec), - is_deadline); - CRANE_TRACE("Replace the old timer to {} timer of {} seconds", log_timer, - std::min(deadline_sec, new_time_limit_sec)); + + int64_t timer_sec = new_time_limit_sec; + bool is_deadline = false; + + if (elem.deadline_time.has_value()) { + int64_t deadline_sec = + elem.deadline_time.value() - absl::ToUnixSeconds(absl::Now()); + if (deadline_sec > 0 && deadline_sec < new_time_limit_sec) { + timer_sec = deadline_sec; + is_deadline = true; + } + } + + AddTerminationTimer_(timer_sec, is_deadline); + CRANE_TRACE("Replace the old timer to {} timer of {} seconds", + is_deadline ? "deadline" : "time_limit", timer_sec); } elem.ok_prom.set_value(CraneErrCode::SUCCESS); }
♻️ Duplicate comments (2)
src/Craned/Supervisor/TaskManager.cpp (2)
1678-1685: Check proto field presence before reading optional values.When the caller omits
deadline_time, line 1684 readsdeadline_time().seconds()which defaults to0if the proto field is unset. This converts a missing deadline into an epoch timestamp (unix time 0), causing the consumer to believe an expired deadline exists. Similarly, line 1675 may read an unsettime_limitas0.Apply this diff to preserve
std::nulloptwhen fields are unset:if (time_limit) { elem.time_limit = time_limit; m_step_.GetMutableStep().mutable_time_limit()->set_seconds( time_limit.value()); - } else { + } else if (m_step_.GetStep().has_time_limit()) { elem.time_limit = std::optional<int64_t>(m_step_.GetStep().time_limit().seconds()); + } else { + elem.time_limit = std::nullopt; } if (deadline_time) { elem.deadline_time = deadline_time; m_step_.GetMutableStep().mutable_deadline_time()->set_seconds( deadline_time.value()); - } else { + } else if (m_step_.GetStep().has_deadline_time()) { elem.deadline_time = std::optional<int64_t>(m_step_.GetStep().deadline_time().seconds()); + } else { + elem.deadline_time = std::nullopt; + m_step_.GetMutableStep().clear_deadline_time(); }
1985-1996: Missing deadline presence check causes incorrect timer logic.Lines 1987-1988 read
deadline_time().seconds()without checkinghas_deadline_time()first. If no deadline is configured,seconds()returns0, makingdeadline_secdeeply negative (current time minus epoch). This causes line 1991 to incorrectly treat the task as having an expired deadline, and line 1994 passes a negative value toAddTerminationTimer_.Apply this diff to check deadline presence and handle missing/expired deadlines:
// Add a timer to limit the execution time of a task. // Note: event_new and event_add in this function is not thread safe, // so we move it outside the multithreading part. int64_t time_limit_sec = m_step_.GetStep().time_limit().seconds(); - int64_t deadline_sec = m_step_.GetStep().deadline_time().seconds() - - absl::ToUnixSeconds(absl::Now()); - - std::string log_timer = - deadline_sec <= time_limit_sec ? "deadline" : "time_limit"; - bool is_deadline = deadline_sec <= time_limit_sec ? true : false; - - AddTerminationTimer_(std::min(deadline_sec, time_limit_sec), is_deadline); - CRANE_TRACE("Add a {} timer of {} seconds", log_timer, - std::min(deadline_sec, time_limit_sec)); + + if (m_step_.GetStep().has_deadline_time()) { + int64_t deadline_sec = m_step_.GetStep().deadline_time().seconds() - + absl::ToUnixSeconds(absl::Now()); + + if (deadline_sec <= 0) { + CRANE_TRACE("Task deadline already reached; terminating immediately."); + EvTaskTimerCb_(true); + elem.ok_prom.set_value(CraneErrCode::SUCCESS); + continue; + } else if (deadline_sec <= time_limit_sec) { + AddTerminationTimer_(deadline_sec, true); + CRANE_TRACE("Add a deadline timer of {} seconds", deadline_sec); + } else { + AddTerminationTimer_(time_limit_sec, false); + CRANE_TRACE("Add a time_limit timer of {} seconds", time_limit_sec); + } + } else { + AddTerminationTimer_(time_limit_sec, false); + CRANE_TRACE("Add a time_limit timer of {} seconds", time_limit_sec); + }
🧹 Nitpick comments (9)
docs/command/cqueue.md (4)
26-26: Default column vs option inconsistency: is Deadline always shown?You’ve added Deadline to the default “主要输出项”. If Deadline is always in default output, the --deadline option becomes redundant. Please confirm intended behavior and make docs consistent (either remove from default list, or redefine the option).
Possible resolution A (keep as default, note redundancy):
- Add a note near --deadline: “默认已显示,无需指定;仅用于与 --format 组合控制列顺序。”
Possible resolution B (opt-in like -S/--start):
- Remove “Deadline” from the default list and keep --deadline as the toggle to include the column.
31-31: Clarify what --deadline does (toggle column vs filter vs range).“显示作业的截止时间” is vague. If it toggles an extra column (like -S/--start), say so and note interaction with -o/--format. If it filters by a time or range, document accepted syntax and examples.
Apply one of these doc tweaks:
- - **--deadline** **string**:显示作业的截止时间 + - **--deadline**:在输出中增加“Deadline”列(等同于在 --format 中添加 %Deadline,可与 -o 组合控制位置)。默认结果中已包含该列则无需指定。or, if it’s a filter:
- - **--deadline** **string**:显示作业的截止时间 + - **--deadline** **string**:按截止时间筛选,支持: + - 精确时间:2025-10-20T12:00:00 + - 时间范围:start~end(ISO 8601),如 2025-10-01T00:00:00~2025-10-31T23:59:59 + - 仅上/下界:~end 或 start~
36-36: Specify format/timezone and empty-display policy for %deadline/%Deadline.Document the output format (e.g., ISO 8601, local timezone vs UTC), and what prints when a job has no deadline (e.g., “-” or empty).
Example addition:
- 输出格式:YYYY-MM-DDTHH:MM:SS(本地时区,24 小时制)
- 无截止时间:显示为 “-”
28-28: Fix heading level to satisfy markdownlint (MD001).Heading jumps from H1 to H4. Use H2 and avoid emphasis for headings.
-#### **主要参数** +## 主要参数Based on static analysis hints.
docs/command/cbatch.md (5)
41-41: Document accepted time formats/timezone for --begin.Specify input formats (e.g., ISO 8601 “YYYY-MM-DDTHH:MM:SS”), timezone (local vs UTC), and behavior if the time is in the past.
Recommended text:
- -b/--begin string:指定作业的最早开始时间,格式支持:
- 绝对时间:2025-10-20T12:00:00(本地时区,24 小时制)
- 相对时间(若支持):now+2h、now+30m
- 过去时间:若早于当前时间,则视为尽快开始
45-45: Clarify semantics and precedence of --deadline vs --time.Document input format/timezone, and how it interacts with --time. Typical expectation: job is terminated at the earlier of (start_time + time_limit) and deadline. Also state whether deadline can be modified after submission.
Proposed wording:
- --deadline string:指定作业的截止时间(绝对时间,ISO 8601,如 2025-10-20T12:00:00,本地时区)。当同时设置 --time 时,系统在以下时间点中较早者终止作业:开始时间+时间限制、截止时间。提交后是否允许修改截止时间:请在文档中明确(建议:不可修改)。
48-52: Add concise examples for new flags (--export, --extra-attr, --get-user-env, --gres).Examples reduce ambiguity on syntax and quoting.
Example block to append under “例”:
- --export 示例:--export VAR1=foo,VAR2=bar 或 --export ALL
- --extra-attr 示例:--extra-attr '{"project":"A","ticket":"T-123"}'
- --get-user-env 示例:--get-user-env
- --gres 示例:--gres gpu:a100:1 或 --gres gpu:1
45-45: Mail and state mapping note for deadline terminations.If deadline-triggered termination maps to TIMEOUT (same as timelimit), mention that it triggers --mail-type=TIMELIMIT and yields一致的作业状态/退出码说明。
Add a short note near --deadline or --mail-type: “达到截止时间的终止将以 TIMEOUT 处理(与 timelimit 一致),会触发 --mail-type=TIMELIMIT 通知。若系统定义了专用退出码,请在文档中列出。”
37-39: Avoid “bold as heading” (MD036).Change bolded “主要参数:” to a proper heading to satisfy markdownlint and keep style consistent.
-**主要参数:** +## 主要参数Based on static analysis hints.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/command/cacct.md(1 hunks)docs/command/cbatch.md(1 hunks)docs/command/ccontrol.md(2 hunks)docs/command/cqueue.md(1 hunks)src/Craned/Supervisor/TaskManager.cpp(9 hunks)
✅ Files skipped from review due to trivial changes (2)
- docs/command/cacct.md
- docs/command/ccontrol.md
🧰 Additional context used
🧬 Code graph analysis (1)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)src/Craned/Supervisor/TaskManager.h (2)
DelTerminationTimer_(336-342)AddTerminationTimer_(325-334)
🪛 markdownlint-cli2 (0.18.1)
docs/command/cbatch.md
45-45: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
docs/command/cqueue.md
28-28: Heading levels should only increment by one level at a time
Expected: h2; Actual: h4
(MD001, heading-increment)
🔇 Additional comments (5)
src/Craned/Supervisor/TaskManager.cpp (5)
1523-1528: LGTM!The renaming from
time_limittotime_constraintaccurately reflects the expanded API that now handles both time limits and deadlines.
1768-1790: LGTM!The timer callback correctly distinguishes deadline expiration from time-limit timeout, mapping each to the appropriate status and exit code for both batch and non-batch tasks.
1840-1845: LGTM!The deadline termination case is correctly integrated into the status change logic, with appropriate status mapping and exit code for signaled tasks.
1860-1865: LGTM!Deadline termination is correctly handled for non-signaled tasks, following the same pattern as OOM detection.
1912-1914: LGTM!The deadline termination reason is correctly propagated to the task instance, consistent with other termination types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/CraneCtld/TaskScheduler.cpp(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/CraneCtld/TaskScheduler.cpp (3)
src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(84-103)ChangeTaskTimeConstraint(84-86)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-85)ChangeTaskTimeConstraint(63-66)src/CraneCtld/CtldPublicDefs.h (1)
CheckIfTimeLimitSecIsValid(705-707)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/Craned/Common/CgroupManager.cpp (1)
569-597: Correct the line numbers and verify the UNDEFINED cgroup version case requires explicit handling.The review identifies a real issue, but line numbers are incorrect. The problematic code is at lines 1579–1594 and 1610–1625 in
DedicatedResourceAllocator::Allocate(both overloads), not 569–597.The concern is valid: When
SetDeviceAccessfails andCgroupManager::GetCgroupVersion()returnsCGROUP_V2::UNDEFINED, the function returnstrueat line 1593 (and 1624), silently ignoring the device access failure. TheUNDEFINEDstate is set when the system runs in hybrid cgroup mode (detected at line 52 during initialization), which is a valid and supported scenario.Fix: Add explicit handling for the
UNDEFINEDcase. Either:
- Add an
else if (CgroupManager::GetCgroupVersion() == CgConstant::CgroupVersion::UNDEFINED) { CRANE_WARN("..."); return false; }before the finalreturn true;- Change the final
return true;toreturn false;as a safe default for unsupported versionsApply to both
DedicatedResourceAllocator::Allocateoverloads at lines 1579–1594 and 1610–1625.src/Utilities/PublicHeader/include/crane/String.h (1)
19-39: Missing standard headers for used typesThis header uses std::array, std::tuple, std::pair, std::vector but doesn’t include their headers. Add to prevent ODR/ABI surprises.
Apply:
#include <charconv> #include <expected> #include <filesystem> #include <fstream> #include <list> +#include <array> +#include <tuple> +#include <utility> +#include <vector> #include <ranges> #include <string>
♻️ Duplicate comments (6)
src/CraneCtld/TaskScheduler.cpp (1)
1094-1105: Fix reservation end_time calculation to use new constraint values.When both
time_limit_secondsandabsl_deadline_timeare provided, the current logic uses a mix of old and new values:
- Lines 1094-1099 use new
time_limit_secondswith oldtask->deadline_time- Lines 1101-1105 use old
task->time_limitwith newabsl_deadline_timeSince task fields are only updated later (lines 1117-1127), the reservation
end_timebookkeeping becomes inconsistent with the actual constraints.Compute the effective new values locally first, then use them for the
std::mincalculations:+ absl::Duration effective_time_limit = + time_limit_seconds ? absl::Seconds(*time_limit_seconds) + : task->time_limit; + absl::Time effective_deadline = + absl_deadline_time ? *absl_deadline_time : task->deadline_time; + + reservation_meta->logical_part.rn_task_res_map[task_id].end_time = + std::min(task->StartTime() + effective_time_limit, + effective_deadline); - if (time_limit_seconds) { - reservation_meta->logical_part.rn_task_res_map[task_id].end_time = - std::min(task->StartTime() + - absl::Seconds(time_limit_seconds.value()), - task->deadline_time); - } - - if (absl_deadline_time) { - reservation_meta->logical_part.rn_task_res_map[task_id].end_time = - std::min(task->StartTime() + task->time_limit, - absl_deadline_time.value()); - }Based on learnings
src/Craned/Supervisor/TaskManager.cpp (4)
1802-1803: Remove dead function EvDeadlineTaskTimerCb_Deadline handling has been merged into EvTaskTimerCb_(bool). Drop this unused stub.
-void TaskManager::EvDeadlineTaskTimerCb_() {}
1949-1992: Fix broken logic: duplicate else, unsafe .value(), wrong termination reason, negative intervals
- There’s a stray extra else causing a syntax error.
- Uses .value() on optional deadline without presence check.
- Immediate-expiry always enqueues TIMEOUT; should enqueue DEADLINE when the deadline passed.
- No clamping for negative intervals.
Replace the block with guarded, unified logic:
- absl::Time start_time = - absl::FromUnixSeconds(m_step_.GetStep().start_time().seconds()); - absl::Duration const& new_time_limit = - absl::Seconds(elem.time_limit.value()); - - if (now - start_time >= new_time_limit || - absl::ToUnixSeconds(now) >= elem.deadline_time.value()) { - // If the task times out, terminate it. - TaskTerminateQueueElem ev_task_terminate{ - .termination_reason = TerminatedBy::TERMINATION_BY_TIMEOUT}; - m_task_terminate_queue_.enqueue(ev_task_terminate); - m_terminate_task_async_handle_->send(); - - } else { - // If the task haven't timed out, set up a new timer. - int64_t deadline_sec = - elem.deadline_time.value() - absl::ToUnixSeconds(absl::Now()); - int64_t new_time_limit_sec = - ToInt64Seconds((new_time_limit - (absl::Now() - start_time))); - if (deadline_sec <= new_time_limit_sec) { - AddTerminationTimer_(deadline_sec, true); - } else { - AddTerminationTimer_(new_time_limit_sec, false); - } - } else { - // If the task haven't timed out, set up a new timer. - int64_t deadline_sec = - elem.deadline_time.value() - absl::ToUnixSeconds(absl::Now()); - int64_t new_time_limit_sec = - ToInt64Seconds((new_time_limit - (absl::Now() - start_time))); - bool is_deadline = deadline_sec <= new_time_limit_sec ? true : false; - std::string log_timer = - deadline_sec <= new_time_limit_sec ? "deadline" : "time_limit"; - AddTerminationTimer_(std::min(deadline_sec, new_time_limit_sec), - is_deadline); - CRANE_TRACE("Replace the old timer to {} timer of {} seconds", log_timer, - std::min(deadline_sec, new_time_limit_sec)); - } + const absl::Time start_time = + absl::FromUnixSeconds(m_step_.GetStep().start_time().seconds()); + const absl::Duration elapsed = now - start_time; + const int64_t cfg_time_limit_sec = + elem.time_limit.has_value() + ? elem.time_limit.value() + : m_step_.GetStep().time_limit().seconds(); + int64_t tl_remaining_sec = + ToInt64Seconds(absl::Seconds(cfg_time_limit_sec) - elapsed); + + std::optional<int64_t> dl_remaining_sec = std::nullopt; + if (elem.deadline_time.has_value()) { + dl_remaining_sec = + elem.deadline_time.value() - absl::ToUnixSeconds(now); + } else if (m_step_.GetStep().has_deadline_time()) { + dl_remaining_sec = m_step_.GetStep().deadline_time().seconds() - + absl::ToUnixSeconds(now); + } + + // Immediate expiry + if (dl_remaining_sec.has_value() && dl_remaining_sec.value() <= 0) { + m_task_terminate_queue_.enqueue(TaskTerminateQueueElem{ + .termination_reason = TerminatedBy::TERMINATION_BY_DEADLINE}); + m_terminate_task_async_handle_->send(); + } else if (tl_remaining_sec <= 0) { + m_task_terminate_queue_.enqueue(TaskTerminateQueueElem{ + .termination_reason = TerminatedBy::TERMINATION_BY_TIMEOUT}); + m_terminate_task_async_handle_->send(); + } else { + bool is_deadline = dl_remaining_sec.has_value() && + dl_remaining_sec.value() <= tl_remaining_sec; + int64_t timer_sec = + is_deadline ? dl_remaining_sec.value() : tl_remaining_sec; + timer_sec = std::max<int64_t>(0, timer_sec); + AddTerminationTimer_(timer_sec, is_deadline); + CRANE_TRACE("Replace the old timer to {} timer of {} seconds", + is_deadline ? "deadline" : "time_limit", timer_sec); + }
1671-1694: Don’t coerce missing deadline to 0; preserve nullopt and clear protoWhen no deadline is set, the else branch forces deadline_time to epoch (0), causing immediate “expired” behavior. Preserve absence and clear the proto field.
Apply:
if (deadline_time) { elem.deadline_time = deadline_time; m_step_.GetMutableStep().mutable_deadline_time()->set_seconds( deadline_time.value()); - } else { - elem.deadline_time = - std::optional<int64_t>(m_step_.GetStep().deadline_time().seconds()); + } else if (m_step_.GetStep().has_deadline_time()) { + elem.deadline_time = std::optional<int64_t>( + m_step_.GetStep().deadline_time().seconds()); + } else { + elem.deadline_time = std::nullopt; + m_step_.GetMutableStep().clear_deadline_time(); }
2010-2033: Timer setup duplicated and broken (undefined ‘sec’, no presence checks, wrong window)
- Two competing timer blocks; one uses undefined sec and will double-arm timers.
- Compares deadline against configured time_limit instead of remaining time.
- No guard for missing/past deadline.
Replace with a single, correct computation:
- int64_t time_limit_sec = m_step_.GetStep().time_limit().seconds(); - int64_t deadline_sec = m_step_.GetStep().deadline_time().seconds() - - absl::ToUnixSeconds(absl::Now()); - - std::string log_timer = - deadline_sec <= time_limit_sec ? "deadline" : "time_limit"; - bool is_deadline = deadline_sec <= time_limit_sec ? true : false; - - AddTerminationTimer_(std::min(deadline_sec, time_limit_sec), is_deadline); - CRANE_TRACE("Add a {} timer of {} seconds", log_timer, - std::min(deadline_sec, time_limit_sec)); - - if (m_step_.GetStep().has_deadline_time()) { - int64_t deadline_sec = m_step_.GetStep().deadline_time().seconds() - - absl::ToUnixSeconds(absl::Now()); - bool is_deadline = deadline_sec <= sec ? true : false; - std::string log_timer = deadline_sec <= sec ? "deadline" : "time_limit"; - deadline_sec = deadline_sec <= sec ? deadline_sec : sec; - AddTerminationTimer_(deadline_sec, is_deadline); - CRANE_TRACE("Add a {} timer of {} seconds", log_timer, deadline_sec); - } else { - AddTerminationTimer_(sec, false); - CRANE_TRACE("Add a time_limit timer of {} seconds", sec); - } + const int64_t now_sec = absl::ToUnixSeconds(absl::Now()); + const int64_t start_sec = m_step_.GetStep().start_time().seconds(); + const int64_t elapsed = now_sec - start_sec; + + const int64_t cfg_time_limit_sec = + m_step_.GetStep().time_limit().seconds(); + int64_t tl_remaining_sec = cfg_time_limit_sec - elapsed; + + std::optional<int64_t> dl_remaining_sec = std::nullopt; + if (m_step_.GetStep().has_deadline_time()) { + dl_remaining_sec = + m_step_.GetStep().deadline_time().seconds() - now_sec; + } + + if (dl_remaining_sec.has_value() && dl_remaining_sec.value() <= 0) { + CRANE_TRACE("Task deadline already reached; firing immediately."); + EvTaskTimerCb_(true); + } else if (tl_remaining_sec <= 0) { + CRANE_TRACE("Task time limit already exceeded; firing immediately."); + EvTaskTimerCb_(false); + } else { + bool is_deadline = dl_remaining_sec.has_value() && + dl_remaining_sec.value() <= tl_remaining_sec; + int64_t timer_sec = + is_deadline ? dl_remaining_sec.value() : tl_remaining_sec; + timer_sec = std::max<int64_t>(0, timer_sec); + AddTerminationTimer_(timer_sec, is_deadline); + CRANE_TRACE("Add a {} timer of {} seconds", + is_deadline ? "deadline" : "time_limit", timer_sec); + }src/Craned/Supervisor/TaskManager.h (1)
335-344: Remove AddDeadlineTerminationTimer_ in favor of unified AddTerminationTimer_Deadline handling now routes through EvTaskTimerCb_(bool). This method (and its callback) is dead code.
- void AddDeadlineTerminationTimer_(int64_t secs) { - auto termination_handel = m_uvw_loop_->resource<uvw::timer_handle>(); - termination_handel->on<uvw::timer_event>( - [this](const uvw::timer_event&, uvw::timer_handle& h) { - EvDeadlineTaskTimerCb_(); - }); - termination_handel->start(std::chrono::seconds(secs), - std::chrono::seconds(0)); - m_step_.termination_timer = termination_handel; - }Also remove the EvDeadlineTaskTimerCb_ declaration if present.
🧹 Nitpick comments (16)
docs/deployment/backend/eBPF.md (2)
79-82: Specify the language for the fenced code block.The code block at line 79 lacks a language identifier. Add
bashto improve syntax highlighting and readability.-``` +```bash -DCRANE_ENABLE_CGROUP_V2=ON -DCRANE_ENABLE_BPF=ON -``` +```
15-135: Inconsistent code block styling throughout the document.Markdownlint indicates that all fenced code blocks (
```) should use indented formatting instead. Consider whether your project has an established markdown style guide. If a consistent style is preferred across the documentation, all code blocks in this file (lines 15–135) should be converted from fenced to indented blocks.src/CraneCtld/CranedMetaContainer.cpp (1)
254-279: LGTM: Consistent refactoring with helpful debug logging.The changes improve the code by:
- Using
GetValueExclusivePtrfor more efficient lookup (single map access vs.Contains+operator[])- Updating error messages to correctly reference
resv_id- Adding a debug log when freeing reservation resources, which aids troubleshooting
Optional: Consider removing the extra blank line at Line 272 for consistency with the rest of the codebase's formatting style.
CRANE_DEBUG("[Job #{}] Freeing resource from reservation {}", task_id, resv_id); - ResourceV2 const& resources = resource_iter->second.resources;src/Craned/Core/CtldClient.cpp (1)
109-125: Error handling improved with explicit state transitions.The restructured error paths now consistently set
m_state_ = State::REQUESTING_CONFIGand callActionRequestConfig_()for both the!request.ok()and!request.has_token()cases. The new else block for token mismatch appropriately logs and waits for the correct reply without retriggering the request.Optional: Consider consolidating duplicate state transitions.
Lines 111-112 and lines 116-117 contain identical code. You could reduce duplication by consolidating these branches:
} else { - if (!request.ok()) { + if (!request.ok() || !request.has_token()) { - CRANE_LOGGER_WARN(m_logger_, "ConfigureCranedRequest failed from Ctld."); + if (!request.ok()) + CRANE_LOGGER_WARN(m_logger_, "ConfigureCranedRequest failed from Ctld."); + else + CRANE_LOGGER_DEBUG(m_logger_, + "No token in ConfigureCranedRequest from Ctld."); m_state_ = State::REQUESTING_CONFIG; ActionRequestConfig_(); - } else if (!request.has_token()) { - CRANE_LOGGER_DEBUG(m_logger_, - "No token in ConfigureCranedRequest from Ctld."); - m_state_ = State::REQUESTING_CONFIG; - ActionRequestConfig_(); - } else {docs/command/ccontrol.md (2)
213-213: Correct Chinese grammar on line 213.The phrase "现在completion自动补全代码命令已经无法使用" contains a grammatical issue. Consider rephrasing to "现在的自动补全(completion)命令已经无法使用" or "completion自动补全功能已经无法使用" for clarity.
-由于删除了cobra框架,现在completion自动补全代码命令已经无法使用 +由于删除了cobra框架,现在completion自动补全功能已经无法使用
215-223: Document the purpose and usage of new backend files.Lines 215–223 introduce new file structure documentation (help.go, ccontrol.go, CmdArgParser.go, parser.go) to reflect the shift away from the cobra framework. While this is a good structural addition, consider enhancing the descriptions to clarify their roles more precisely:
- help.go:由于删除了cobra框架,现在help命令的输出编写在help.go中 + help.go:处理帮助命令输出的文件,由于删除了cobra框架,help命令的输出逻辑统一编写于此 - ccontrol.go:向后端通信的函数实现 + ccontrol.go:实现与后端通信的业务逻辑函数 - CmdArgParser.go:命令解析文件,将使用解析到的命令行数据来调用ccontrol.go的函数 + CmdArgParser.go:命令解析及调度文件,将解析后的命令行参数调用相应的ccontrol.go函数 - parser.go:解析器函数,定义数据结构与解析规则 + parser.go:通用解析器实现,定义命令解析规则和数据结构This improves clarity for maintainers and future contributors.
src/Misc/BPF/cgroup_dev_bpf.c (3)
14-16: Struct field type migration looks good; prefer unsigned for bitfields and avoid packing
- __u64/__u32 use aligns with uapi/context types. Good.
- access/type are bitfields; make them __u16, not signed, to avoid UB on bitwise ops.
- Consider dropping #pragma pack: CO-RE/BTF expects natural alignment; packing can cause subtle layout issues across compilers/kernels. If you need 8‑byte alignment for keys, use attribute((aligned(8))) on the struct instead.
Apply within-range changes for bitfield signedness:
- __s16 access; - __s16 type; + __u16 access; + __u16 type;If you choose to remove packing (outside this range), a safe layout is:
struct BpfKey { __u64 cgroup_id; __u32 major; __u32 minor; } __attribute__((aligned(8))); struct BpfDeviceMeta { __u32 major; __u32 minor; __s32 permission; __u16 access; __u16 type; };Also applies to: 22-26
61-61: Use %llu for 64‑bit cgroup_id in bpf_printkkey.cgroup_id is __u64; prefer %llu for portability across archs.
- if (enable_logging) bpf_printk("ctx cgroup ID : %lu\n", key.cgroup_id); + if (enable_logging) bpf_printk("ctx cgroup ID : %llu\n", key.cgroup_id);
100-105: Nit: incorrect log message for MKNOD caseMessage says “Write access denied” in the MKNOD branch. Change to “Mknod access denied” to avoid confusion.
docs/reference/exit_code.md (1)
35-37: Optional: Consider varying sentence structure for readability.Three successive sentences begin with "Values". While acceptable for technical documentation, you could improve readability slightly:
-- Values **0–255** represent the **exit return value**. -- Values **256–320** represent **program exit signals**. -- Values **above 320** represent **Crane-defined codes**. +- **0–255**: Exit return value +- **256–320**: Program exit signals +- **Above 320**: Crane-defined codessrc/Craned/Core/SupervisorKeeper.cpp (1)
32-33: Log placeholders misleading for ExecuteTask failureMessage says “ok:{}” but prints error_message(). Consider “error: {}” or include ok.ok() separately.
docs/deployment/backend/Ubuntu.md (1)
50-57: SELinux section is not applicable to UbuntuUbuntu ships AppArmor by default; SELinux and setenforce usually aren’t present. Recommend removing or clarifying this as “if SELinux is installed” and otherwise reference AppArmor.
src/Utilities/PublicHeader/include/crane/String.h (2)
157-181: Status-to-string map lacks “Deadline” entryTaskStatus::Deadline was introduced; StepStatusToString will return “Invalid” for it. Update CraneStepStatusStrArr to include “Deadline” at the correct ordinal, or switch StepStatusToString to a switch-case over enum values to avoid index drift.
126-136: FlattenMapView relies on values being rangesFine as a utility; consider documenting requirement (kv.second must be a range), to prevent misuse.
src/CraneCtld/TaskScheduler.cpp (1)
3450-3456: Consider removing redundant deadline check.Tasks with
deadline_time < noware already canceled and removed fromm_pending_task_map_inScheduleThread_(lines 572-592) beforeGetOrderedTaskIdListis called. This check at lines 3450-3453 should never trigger.While this defensive code is harmless, removing it would simplify the logic since deadline-expired tasks are guaranteed not to reach this point.
src/Craned/Supervisor/TaskManager.h (1)
89-90: Consider narrowing mutable accessExposing GetMutableStep() invites accidental unsynchronized mutation. Prefer explicit setters (e.g., SetTimeLimitSeconds/SetDeadlineTime) invoked on the UV loop thread.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (30)
docs/images/architecture-V1.pngis excluded by!**/*.pngdocs/images/cbatch.pngis excluded by!**/*.pngdocs/images/cbatch_script.pngis excluded by!**/*.pngdocs/images/ccancel.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_hold.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_hold1.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_hold2.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_hold3.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_release.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_release1.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_release2.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_show.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_showjob.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_shownode.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_showp.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_update.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_updatejob.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_updatejob1.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_updatejob2.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_updatejob3.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_updatejob4.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_updatenode.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_updatenode1.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_updatenode2.pngis excluded by!**/*.pngdocs/images/ccontrol/ccontrol_updatenode3.pngis excluded by!**/*.pngdocs/images/ccontrol_node.pngis excluded by!**/*.pngdocs/images/ccontrol_partition.pngis excluded by!**/*.pngdocs/images/cqueue.pngis excluded by!**/*.pngdocs/images/logo.pngis excluded by!**/*.pngdocs/images/sacctmgr.pngis excluded by!**/*.png
📒 Files selected for processing (65)
.github/workflows/build.yaml(2 hunks).github/workflows/mkpages.yaml(0 hunks).github/workflows/pages.yaml(1 hunks).readthedocs.yaml(1 hunks)CMakeLists.txt(3 hunks)README.md(1 hunks)README_EN.md(1 hunks)dependencies/cmake/unqlite/CMakeLists.txt(1 hunks)dependencies/cmake/uvw/CMakeLists.txt(1 hunks)dependencies/cmake/uvw/uvw.diff(0 hunks)dependencies/cmake/yaml-cpp/CMakeLists.txt(1 hunks)docs/command/cacctmgr.md(1 hunks)docs/command/ccontrol.md(2 hunks)docs/command/ccontrol_yufa.md(0 hunks)docs/command/cqueue.md(2 hunks)docs/config/Centos7Configuration.md(0 hunks)docs/deployment/backend/CentOS7.md(1 hunks)docs/deployment/backend/Rocky9.md(1 hunks)docs/deployment/backend/Ubuntu.md(1 hunks)docs/deployment/backend/eBPF.md(1 hunks)docs/deployment/backend/packaging.md(1 hunks)docs/deployment/configuration/config.md(1 hunks)docs/deployment/configuration/database.md(1 hunks)docs/deployment/configuration/multi-node.md(1 hunks)docs/deployment/configuration/pam.md(1 hunks)docs/deployment/frontend/frontend.md(1 hunks)docs/deployment/frontend/plugins.md(1 hunks)docs/deployment/index.md(1 hunks)docs/index.md(1 hunks)docs/reference/exit_code.md(1 hunks)docs/requirements.txt(1 hunks)etc/config.yaml(1 hunks)mkdocs.yaml(1 hunks)protos/Crane.proto(8 hunks)protos/PublicDefs.proto(7 hunks)scripts/uninstall_craned.sh(1 hunks)src/CraneCtld/AccountManager.cpp(1 hunks)src/CraneCtld/CraneCtld.cpp(1 hunks)src/CraneCtld/CranedMetaContainer.cpp(2 hunks)src/CraneCtld/CranedMetaContainer.h(1 hunks)src/CraneCtld/CtldPublicDefs.h(2 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(22 hunks)src/CraneCtld/RpcService/CranedKeeper.h(6 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(4 hunks)src/CraneCtld/TaskScheduler.cpp(10 hunks)src/CraneCtld/TaskScheduler.h(4 hunks)src/Craned/Common/CgroupManager.cpp(3 hunks)src/Craned/Common/CgroupManager.h(3 hunks)src/Craned/Core/CranedPublicDefs.h(1 hunks)src/Craned/Core/CtldClient.cpp(1 hunks)src/Craned/Core/JobManager.cpp(5 hunks)src/Craned/Core/SupervisorKeeper.cpp(3 hunks)src/Craned/Supervisor/CforedClient.cpp(6 hunks)src/Craned/Supervisor/CforedClient.h(4 hunks)src/Craned/Supervisor/CranedClient.cpp(3 hunks)src/Craned/Supervisor/CranedClient.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(15 hunks)src/Craned/Supervisor/TaskManager.h(6 hunks)src/Misc/BPF/CMakeLists.txt(1 hunks)src/Misc/BPF/cgroup_dev_bpf.c(3 hunks)src/Utilities/PublicHeader/OS.cpp(2 hunks)src/Utilities/PublicHeader/String.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/OS.h(1 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(2 hunks)src/Utilities/PublicHeader/include/crane/String.h(1 hunks)
💤 Files with no reviewable changes (4)
- .github/workflows/mkpages.yaml
- docs/command/ccontrol_yufa.md
- dependencies/cmake/uvw/uvw.diff
- docs/config/Centos7Configuration.md
✅ Files skipped from review due to trivial changes (2)
- dependencies/cmake/unqlite/CMakeLists.txt
- docs/command/cqueue.md
🚧 Files skipped from review as they are similar to previous changes (2)
- src/Utilities/PublicHeader/include/crane/PublicHeader.h
- src/CraneCtld/CtldPublicDefs.h
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-23T07:53:30.513Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
Applied to files:
src/CraneCtld/CraneCtld.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/CranedClient.cpp
📚 Learning: 2025-04-27T06:47:51.751Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:26.012Z
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CranedKeeper.cpp:601-602
Timestamp: 2025-05-02T07:05:26.012Z
Learning: In the CraneCtld codebase, the variables m_disconnected_ and m_registered_ in CranedStub class are already defined as std::atomic_bool, making them thread-safe for concurrent access without additional synchronization.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-06-07T10:47:59.071Z
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.hsrc/Craned/Supervisor/CforedClient.hsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-02T07:05:49.032Z
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CranedKeeper.cpp:503-506
Timestamp: 2025-05-02T07:05:49.032Z
Learning: In CraneCtld/CranedKeeper.cpp, using m_unavail_craned_set_.at() is intentional as the key is guaranteed to exist by design, and crashing on missing key is preferred to silently handling it (fail-fast approach).
Applied to files:
src/CraneCtld/RpcService/CranedKeeper.cpp
🧬 Code graph analysis (20)
src/CraneCtld/CranedMetaContainer.h (1)
src/CraneCtld/CranedMetaContainer.cpp (2)
FreeResourceFromResv(254-279)FreeResourceFromResv(254-255)
src/Craned/Core/CtldClient.cpp (1)
src/Utilities/PublicHeader/GrpcHelper.cpp (2)
ProtoTimestampToString(48-51)ProtoTimestampToString(48-49)
src/Craned/Common/CgroupManager.h (2)
src/Craned/Common/CgroupManager.cpp (10)
Allocate(1543-1554)Allocate(1543-1544)Allocate(1556-1567)Allocate(1556-1557)Allocate(1569-1597)Allocate(1569-1570)Allocate(1599-1628)Allocate(1599-1601)AllocateAndGetCgroup(468-533)AllocateAndGetCgroup(469-471)src/Utilities/PublicHeader/PublicHeader.cpp (1)
ResourceInNode(484-487)
src/CraneCtld/TaskScheduler.cpp (3)
src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(88-109)ChangeTaskTimeConstraint(88-90)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-85)ChangeTaskTimeConstraint(63-66)src/CraneCtld/CtldPublicDefs.h (1)
CheckIfTimeLimitSecIsValid(705-707)
src/CraneCtld/CraneCtld.cpp (1)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)
src/Craned/Core/CranedPublicDefs.h (1)
src/Craned/Common/CgroupManager.h (2)
CgroupInterface(374-405)CgroupManager(514-621)
src/CraneCtld/AccountManager.cpp (1)
test/Misc/mongodb_test.cpp (1)
new_user(120-120)
src/Utilities/PublicHeader/include/crane/OS.h (1)
src/Utilities/PublicHeader/OS.cpp (2)
CreateFoldersForFileEx(109-148)CreateFoldersForFileEx(109-110)
src/Utilities/PublicHeader/include/crane/String.h (2)
src/Utilities/PublicHeader/String.cpp (8)
StepIdsToString(475-477)StepIdsToString(475-475)StepIdTupleToString(479-481)StepIdTupleToString(479-479)StepIdPairToString(483-485)StepIdPairToString(483-483)StepStatusToString(487-489)StepStatusToString(487-487)src/Utilities/PublicHeader/include/crane/PublicHeader.h (2)
Internal(130-238)string_view(252-254)
src/Craned/Supervisor/CranedClient.cpp (2)
src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)src/Utilities/PublicHeader/String.cpp (2)
StepStatusToString(487-489)StepStatusToString(487-487)
src/Craned/Core/SupervisorKeeper.cpp (2)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1024-1149)ChangeTaskTimeConstraint(1024-1026)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-85)ChangeTaskTimeConstraint(63-66)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Craned/Supervisor/TaskManager.h (3)
IsBatch(83-87)DelTerminationTimer_(346-352)AddTerminationTimer_(324-333)src/Craned/Core/CranedPublicDefs.h (1)
g_config(154-154)
src/CraneCtld/RpcService/CranedKeeper.h (2)
src/CraneCtld/RpcService/CranedKeeper.cpp (9)
CranedStub(27-33)Fini(35-37)Fini(35-35)ChangeJobTimeConstraint(239-275)ChangeJobTimeConstraint(239-241)CranedChannelConnFailNoLock_(783-789)CranedChannelConnFailNoLock_(783-783)ConnectCranedNode_(693-781)ConnectCranedNode_(693-694)src/Craned/Core/CranedServer.cpp (2)
ChangeJobTimeConstraint(309-334)ChangeJobTimeConstraint(309-312)
src/CraneCtld/TaskScheduler.h (3)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1024-1149)ChangeTaskTimeConstraint(1024-1026)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(88-109)ChangeTaskTimeConstraint(88-90)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(63-85)ChangeTaskTimeConstraint(63-66)
src/Craned/Supervisor/CforedClient.h (2)
src/Craned/Supervisor/TaskManager.h (1)
pty(142-152)src/Craned/Supervisor/CforedClient.cpp (16)
InitFwdMetaAndUvStdoutFwdHandler(80-99)InitFwdMetaAndUvStdoutFwdHandler(80-82)InitUvX11FwdHandler(101-108)InitUvX11FwdHandler(101-101)TaskOutputFinish(671-675)TaskOutputFinish(671-671)TaskEnd(683-685)TaskEnd(683-683)SetupX11forwarding_(110-207)SetupX11forwarding_(110-110)WriteStringToFd_(209-223)WriteStringToFd_(209-210)CleanStdoutFwdHandlerQueueCb_(225-352)CleanStdoutFwdHandlerQueueCb_(225-225)CleanX11FwdHandlerQueueCb_(354-360)CleanX11FwdHandlerQueueCb_(354-354)
src/Craned/Core/JobManager.cpp (1)
src/Craned/Common/CgroupManager.cpp (2)
CgroupStrByJobId(279-281)CgroupStrByJobId(279-279)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (19)
ITaskInstance(200-219)StepInstance(43-47)StepInstance(49-56)IsBatch(58-60)IsBatch(58-58)IsCrun(62-65)IsCrun(62-62)IsCalloc(67-70)IsCalloc(67-67)EvTaskTimerCb_(1777-1800)EvTaskTimerCb_(1777-1777)EvDeadlineTaskTimerCb_(1802-1802)EvDeadlineTaskTimerCb_(1802-1802)ChangeTaskTimeConstraintAsync(1671-1699)ChangeTaskTimeConstraintAsync(1671-1672)EvCleanTerminateTaskQueueCb_(1900-1944)EvCleanTerminateTaskQueueCb_(1900-1900)EvCleanChangeTaskTimeConstraintQueueCb_(1946-1994)EvCleanChangeTaskTimeConstraintQueueCb_(1946-1946)
src/Utilities/PublicHeader/OS.cpp (1)
test/CraneCtld/EmbeddedDbClientTest.cpp (1)
exists(33-40)
src/Craned/Supervisor/CforedClient.cpp (2)
src/Utilities/PublicHeader/String.cpp (2)
SetCurrentThreadName(340-348)SetCurrentThreadName(340-340)src/Utilities/PublicHeader/OS.cpp (2)
SetFdNonBlocking(153-160)SetFdNonBlocking(153-153)
src/CraneCtld/RpcService/CranedKeeper.cpp (4)
src/Craned/Core/CranedPublicDefs.h (2)
g_runtime_status(160-160)g_config(154-154)src/Craned/Core/CranedServer.cpp (2)
ChangeJobTimeConstraint(309-334)ChangeJobTimeConstraint(309-312)src/Utilities/PublicHeader/Logger.cpp (6)
AddLogger(73-99)AddLogger(73-75)AddLogger(101-124)AddLogger(101-103)StrToLogLevel(23-40)StrToLogLevel(23-24)src/Utilities/PublicHeader/GrpcHelper.cpp (2)
ProtoTimestampToString(48-51)ProtoTimestampToString(48-49)
🪛 LanguageTool
docs/reference/exit_code.md
[grammar] ~21-~21: There might be a mistake here.
Context: ... Crane-defined codes: - 64: Terminated - 65: Permission Denied - 66: Cgroup Error...
(QB_NEW_EN)
[grammar] ~22-~22: There might be a mistake here.
Context: ...- 64: Terminated - 65: Permission Denied - 66: Cgroup Error - 67: File Not Found - ...
(QB_NEW_EN)
[grammar] ~23-~23: There might be a mistake here.
Context: ...65: Permission Denied - 66: Cgroup Error - 67: File Not Found - 68: Spawn Process F...
(QB_NEW_EN)
[grammar] ~24-~24: There might be a mistake here.
Context: ... - 66: Cgroup Error - 67: File Not Found - 68: Spawn Process Failed - 69: Exceeded ...
(QB_NEW_EN)
[grammar] ~25-~25: There might be a mistake here.
Context: ...ile Not Found - 68: Spawn Process Failed - 69: Exceeded Time Limit - 70: Crane Daem...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...Process Failed - 69: Exceeded Time Limit - 70: Crane Daemon Down - 71: Execution Er...
(QB_NEW_EN)
[grammar] ~27-~27: There might be a mistake here.
Context: ...eeded Time Limit - 70: Crane Daemon Down - 71: Execution Error - 72: RPC Failure -...
(QB_NEW_EN)
[grammar] ~28-~28: There might be a mistake here.
Context: ... Crane Daemon Down - 71: Execution Error - 72: RPC Failure --- ## JSON Format Exp...
(QB_NEW_EN)
[style] ~37-~37: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...represent program exit signals. - Values above 320 represent **Crane-defined...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
docs/index.md
[grammar] ~102-~102: There might be a mistake here.
Context: ...ces and provides cgroup-based isolation. - AI jobs: the Container Manager uses Kube...
(QB_NEW_EN)
[grammar] ~109-~109: There might be a mistake here.
Context: ...md), calloc, cinfo - Admin commands: [cacct](command/cacct.md...
(QB_NEW_EN)
[grammar] ~110-~110: There might be a mistake here.
Context: ...ccontrol](command/ccontrol.md), ccancel - Exit codes: [reference](reference/exit_c...
(QB_NEW_EN)
[grammar] ~117-~117: There might be a mistake here.
Context: ... https://hpc.pku.edu.cn/demo/cranesched - Backend: <https://github.com/PKUHPC/Cran...
(QB_NEW_EN)
[grammar] ~118-~118: There might be a mistake here.
Context: ...d: https://github.com/PKUHPC/CraneSched - Frontend: <https://github.com/PKUHPC/Cra...
(QB_NEW_EN)
docs/deployment/index.md
[grammar] ~15-~15: There might be a mistake here.
Context: ...ces (cfored, cplugind) | As needed | | Database | MongoDB for storing job...
(QB_NEW_EN)
[grammar] ~116-~116: There might be a mistake here.
Context: ...ure: - Cluster name and control machine - Node definitions (CPU, memory, GPUs) - P...
(QB_NEW_EN)
[grammar] ~117-~117: There might be a mistake here.
Context: ...e - Node definitions (CPU, memory, GPUs) - Partitions and scheduling priorities - N...
(QB_NEW_EN)
[grammar] ~118-~118: There might be a mistake here.
Context: ...) - Partitions and scheduling priorities - Network settings and advanced options #...
(QB_NEW_EN)
docs/deployment/configuration/pam.md
[style] ~57-~57: Using many exclamation marks might seem excessive (in this case: 6 exclamation marks for a text that’s 1590 characters long)
Context: ...nal pam_reauthorize.so prepare ``` !!! warning The line `session required ...
(EN_EXCESSIVE_EXCLAMATION)
docs/deployment/configuration/database.md
[style] ~145-~145: Using many exclamation marks might seem excessive (in this case: 15 exclamation marks for a text that’s 1808 characters long)
Context: ...SetName: crane_rs DbName: crane_db ``` !!! warning This file should only exist...
(EN_EXCESSIVE_EXCLAMATION)
docs/deployment/configuration/multi-node.md
[grammar] ~137-~137: There might be a mistake here.
Context: ...not found**: Install the pdsh package from EPEL repository. Permission denied...
(QB_NEW_EN)
README_EN.md
[grammar] ~27-~27: There might be a mistake here.
Context: ...>100,000 scheduling decisions per second - Scalability: Scales to million-core cl...
(QB_NEW_EN)
[grammar] ~28-~28: There might be a mistake here.
Context: ...ility**: Scales to million-core clusters - Usability: Simple user and admin comma...
(QB_NEW_EN)
[grammar] ~29-~29: There might be a mistake here.
Context: ...bility**: Simple user and admin commands - Security: RBAC and encrypted communica...
(QB_NEW_EN)
[grammar] ~30-~30: There might be a mistake here.
Context: ...rity**: RBAC and encrypted communication - Resilience: Automatic job recovery, no...
(QB_NEW_EN)
[grammar] ~31-~31: There might be a mistake here.
Context: ...b recovery, no SPOF, fast state recovery - Open Source: All code available # Arc...
(QB_NEW_EN)
[grammar] ~44-~44: There might be a mistake here.
Context: ...resources and isolates jobs via cgroups. - For AI jobs: the **Container Manager...
(QB_NEW_EN)
docs/deployment/backend/Ubuntu.md
[grammar] ~8-~8: There might be a mistake here.
Context: ...reparation ### 1.1 Update Package Lists It is recommended to change the package ...
(QB_NEW_EN)
[grammar] ~65-~65: There might be a mistake here.
Context: ...tional configuration](eBPF.md) for GRES, or you can switch the system to use CGro...
(QB_NEW_EN)
[style] ~171-~171: Using many exclamation marks might seem excessive (in this case: 9 exclamation marks for a text that’s 2585 characters long)
Context: ... . ``` 2. Install the built binaries: !!! tip We recommend deploying CraneSch...
(EN_EXCESSIVE_EXCLAMATION)
docs/deployment/backend/eBPF.md
[style] ~8-~8: Using many exclamation marks might seem excessive (in this case: 3 exclamation marks for a text that’s 1399 characters long)
Context: ...ompute nodes. ## 1. Install Clang 19+ !!! tip If your system already installe...
(EN_EXCESSIVE_EXCLAMATION)
docs/deployment/backend/CentOS7.md
[grammar] ~99-~99: There might be a mistake here.
Context: ...wing toolchain versions: * CMake ≥ 3.24 * libstdc++ ≥ 11 * clang ≥ 19 or g++ ≥ 14 ...
(QB_NEW_EN)
[grammar] ~100-~100: There might be a mistake here.
Context: ...rsions: * CMake ≥ 3.24 * libstdc++ ≥ 11 * clang ≥ 19 or g++ ≥ 14 ### 4.1 CMake `...
(QB_NEW_EN)
[style] ~160-~160: Using many exclamation marks might seem excessive (in this case: 12 exclamation marks for a text that’s 2311 characters long)
Context: ...atabase.md) for detailed instructions. !!! info CentOS 7 supports up to MongoD...
(EN_EXCESSIVE_EXCLAMATION)
docs/command/ccontrol.md
[uncategorized] ~213-~213: 您的意思是“"不"全”?
Context: ... ### 删除的命令 由于删除了cobra框架,现在completion自动补全代码命令已经无法使用 ### 文件结构 help.go:由于删除了cobr...
(BU)
docs/deployment/backend/Rocky9.md
[grammar] ~65-~65: There might be a mistake here.
Context: ...eed additional configuration, or you can switch the system to use CGro...
(QB_NEW_EN)
[style] ~167-~167: Using many exclamation marks might seem excessive (in this case: 9 exclamation marks for a text that’s 2482 characters long)
Context: ... . ``` 2. Install the built binaries: !!! tip We recommend deploying CraneSch...
(EN_EXCESSIVE_EXCLAMATION)
docs/deployment/backend/packaging.md
[grammar] ~16-~16: There might be a mistake here.
Context: ...ntrol daemon package (for control nodes) - craned - Execution daemon package (for...
(QB_NEW_EN)
[grammar] ~21-~21: There might be a mistake here.
Context: ... package includes: - Binary executables - Systemd service files - Configuration fi...
(QB_NEW_EN)
[grammar] ~22-~22: There might be a mistake here.
Context: ...nary executables - Systemd service files - Configuration file templates - PAM secur...
(QB_NEW_EN)
[grammar] ~23-~23: There might be a mistake here.
Context: ...ice files - Configuration file templates - PAM security module (for craned package)...
(QB_NEW_EN)
[grammar] ~63-~63: There might be a mistake here.
Context: ...!!! tip For production deployments, use Release build type: ```bash cma...
(QB_NEW_EN)
[style] ~82-~82: Using many exclamation marks might seem excessive (in this case: 9 exclamation marks for a text that’s 2459 characters long)
Context: ...ate only DEB packages cpack -G DEB ``` !!! info The default configuration (`cp...
(EN_EXCESSIVE_EXCLAMATION)
[grammar] ~83-~83: There might be a mistake here.
Context: ... DEB packages cpack -G DEB ``` !!! info The default configuration (cpack witho...
(QB_NEW_EN)
[grammar] ~171-~171: There might be a mistake here.
Context: ...ssions 3. Creates /etc/crane directory 4. Copies sample configuration files to `/e...
(QB_NEW_EN)
[grammar] ~173-~173: There might be a mistake here.
Context: ...se.yaml` (if not exists, cranectld only) 6. Sets appropriate file ownership and perm...
(QB_NEW_EN)
[grammar] ~174-~174: There might be a mistake here.
Context: ...propriate file ownership and permissions
(QB_NEW_EN)
docs/deployment/configuration/config.md
[grammar] ~87-~87: There might be a mistake here.
Context: ... Hostname or range (e.g., node[01-10]) - cpu: Number of CPU cores - memory:...
(QB_NEW_EN)
[grammar] ~88-~88: There might be a mistake here.
Context: ...[01-10]`) - cpu: Number of CPU cores - memory: Total memory (supports K, M, G...
(QB_NEW_EN)
[grammar] ~89-~89: There might be a mistake here.
Context: ...al memory (supports K, M, G, T suffixes) - gres: Generic resources like GPUs (opt...
(QB_NEW_EN)
[grammar] ~121-~121: There might be a mistake here.
Context: ...ers:** - name: Partition identifier - nodes: Node range belonging to this pa...
(QB_NEW_EN)
[grammar] ~122-~122: There might be a mistake here.
Context: ...: Node range belonging to this partition - priority: Higher values = higher prior...
(QB_NEW_EN)
[grammar] ~123-~123: There might be a mistake here.
Context: ...s = higher priority (affects scheduling) - DefaultMemPerCpu: Default memory per C...
(QB_NEW_EN)
[grammar] ~124-~124: There might be a mistake here.
Context: ...per CPU in MB (0 = let scheduler decide) - MaxMemPerCpu: Maximum memory per CPU i...
(QB_NEW_EN)
[style] ~246-~246: Using many exclamation marks might seem excessive (in this case: 6 exclamation marks for a text that’s 2668 characters long)
Context: ... DeniedAccounts: banned_account ``` !!! warning AllowedAccounts and `Deni...
(EN_EXCESSIVE_EXCLAMATION)
[grammar] ~294-~294: There might be a mistake here.
Context: ...oubleshooting Nodes not appearing: Check ControlMachine hostname matches actual ...
(QB_NEW_EN)
[grammar] ~294-~294: There might be a mistake here.
Context: ...earing**: Check ControlMachine hostname matches actual control node hostname. **Config...
(QB_NEW_EN)
docs/deployment/frontend/frontend.md
[style] ~2-~2: Using many exclamation marks might seem excessive (in this case: 3 exclamation marks for a text that’s 1862 characters long)
Context: ...eployment Guide of Frontend Components !!! tip This tutorial has been verified...
(EN_EXCESSIVE_EXCLAMATION)
[grammar] ~24-~24: There might be a mistake here.
Context: ... (Frontend daemon for interactive jobs): - Provides support for interactive jobs (u...
(QB_NEW_EN)
[grammar] ~25-~25: There might be a mistake here.
Context: ...ractive jobs (used by crun, calloc). - Typically runs on login nodes where inte...
(QB_NEW_EN)
[grammar] ~28-~28: There might be a mistake here.
Context: ...service. - cplugind` (Plugin daemon): - Loads and manages plugins (mail, monitor...
(QB_NEW_EN)
[grammar] ~29-~29: There might be a mistake here.
Context: ...lugin services to CraneSched components. - Must be running on nodes that require pl...
(QB_NEW_EN)
docs/deployment/frontend/plugins.md
[grammar] ~14-~14: There might be a mistake here.
Context: ...ry** (.so file): Plugin implementation - Plugin configuration (.yaml file, op...
(QB_NEW_EN)
[style] ~25-~25: To form a complete sentence, be sure to include a subject.
Context: ...or.yaml`) : Individual plugin settings. Can be located anywhere readable, with the ...
(MISSING_IT_THERE)
[grammar] ~31-~31: There might be a mistake here.
Context: ...owing plugins: | Plugin | Description | |--------|-------------| | Mail | Se...
(QB_NEW_EN)
[grammar] ~32-~32: There might be a mistake here.
Context: ...| Description | |--------|-------------| | Mail | Send email notifications on...
(QB_NEW_EN)
[grammar] ~33-~33: There might be a mistake here.
Context: ...ail notifications on job state changes | | Monitor | Collect job resource usa...
(QB_NEW_EN)
[grammar] ~34-~34: There might be a mistake here.
Context: ...fluxDB (Grafana integration supported) | | Energy | Monitor power consumption...
(QB_NEW_EN)
[grammar] ~35-~35: There might be a mistake here.
Context: ...umption for nodes and jobs to InfluxDB | | Event | Record node state changes ...
(QB_NEW_EN)
[grammar] ~70-~70: Ensure spelling is correct
Context: ... CraneCtld/Craned uses the plugin system - PlugindDebugLevel: Log level (trace/debug/info; recomme...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~91-~91: There might be a mistake here.
Context: ...following information: - Username: your_username - Bucket: your_bucket_name - Org: ...
(QB_NEW_EN)
[grammar] ~92-~92: There might be a mistake here.
Context: ...ername**: your_username - Bucket: your_bucket_name - Org: your_organization - Token: ...
(QB_NEW_EN)
[grammar] ~93-~93: There might be a mistake here.
Context: ...Bucket**: your_bucket_name - Org: your_organization - Token: your_token - Measurement:...
(QB_NEW_EN)
[grammar] ~94-~94: There might be a mistake here.
Context: ...Org**: your_organization - Token: your_token - Measurement: ResourceUsage ### Conf...
(QB_NEW_EN)
[grammar] ~180-~180: There might be a mistake here.
Context: ...** in /etc/crane/config.yaml (similar to monitor plugin). ### Using Email Notif...
(QB_NEW_EN)
[grammar] ~214-~214: There might be a mistake here.
Context: ... ### Mail Types | Type | Description | |------|-------------| | BEGIN | Job s...
(QB_NEW_EN)
[grammar] ~215-~215: There might be a mistake here.
Context: ...e | Description | |------|-------------| | BEGIN | Job starts running (Pending ...
(QB_NEW_EN)
[grammar] ~216-~216: There might be a mistake here.
Context: ...Job starts running (Pending → Running) | | FAILED | Job execution fails | | `TI...
(QB_NEW_EN)
[grammar] ~217-~217: There might be a mistake here.
Context: ...ng) | | FAILED | Job execution fails | | TIMELIMIT | Job exceeds time limit |...
(QB_NEW_EN)
[grammar] ~218-~218: There might be a mistake here.
Context: ...| TIMELIMIT | Job exceeds time limit | | END | Job finishes (Running → Comple...
(QB_NEW_EN)
[grammar] ~219-~219: There might be a mistake here.
Context: ...(Running → Completed/Failed/Cancelled) | | ALL | All of the above events | ###...
(QB_NEW_EN)
[grammar] ~231-~231: There might be a mistake here.
Context: ...leshooting - Verify Enabled: true in /etc/crane/config.yaml - Check that cplugind is running: `syste...
(QB_NEW_EN)
[grammar] ~232-~232: There might be a mistake here.
Context: ...ml- Check thatcplugindis running:systemctl status cplugind- Ensure.so` file paths are absolute and...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.18.1)
docs/index.md
125-125: Bare URL used
(MD034, no-bare-urls)
docs/deployment/configuration/database.md
25-25: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
46-46: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
71-71: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
80-80: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
99-99: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
109-109: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
115-115: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
137-137: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
README_EN.md
9-9: Heading style
Expected: atx; Actual: atx_closed
(MD003, heading-style)
25-25: Heading style
Expected: atx; Actual: atx_closed
(MD003, heading-style)
34-34: Heading style
Expected: atx; Actual: atx_closed
(MD003, heading-style)
47-47: Heading style
Expected: atx; Actual: atx_closed
(MD003, heading-style)
docs/deployment/backend/Ubuntu.md
34-34: Code block style
Expected: fenced; Actual: indented
(MD046, code-block-style)
docs/deployment/backend/eBPF.md
15-15: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
25-25: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
62-62: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
79-79: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
79-79: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
89-89: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
94-94: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
99-99: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
107-107: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
117-117: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
124-124: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
129-129: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
134-134: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
README.md
25-25: Heading style
Expected: atx; Actual: atx_closed
(MD003, heading-style)
docs/command/ccontrol.md
1-1: No space inside hashes on closed atx style heading
(MD020, no-missing-space-closed-atx)
28-28: Heading levels should only increment by one level at a time
Expected: h2; Actual: h4
(MD001, heading-increment)
78-78: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
83-83: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
88-88: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
93-93: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
98-98: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
103-103: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
108-108: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
113-113: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
118-118: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
123-123: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
128-128: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
133-133: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
142-142: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
147-147: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
152-152: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
157-157: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
162-162: Code fence style
Expected: tilde; Actual: backtick
(MD048, code-fence-style)
docs/deployment/backend/Rocky9.md
13-13: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
20-20: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
34-34: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
41-41: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
52-52: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
72-72: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
86-86: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
106-106: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
121-121: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
127-127: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
152-152: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
170-170: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
190-190: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
197-197: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
docs/deployment/frontend/frontend.md
21-21: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
22-22: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
25-25: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
26-26: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
29-29: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
30-30: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
33-33: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
53-53: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
62-62: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
69-69: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
77-77: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
95-95: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
fold AddTerminationTimer_ and AddDeadlineTerminationTimer_
todo: check ChangeTimeConstraint feat
3d09c97 to
0ec11b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/CraneCtld/DbClient.cpp (2)
1549-1615: Fix field/value order: deadline_time and has_job_info are swappedCurrent tuple places int64 deadline before bool, but fields array has has_job_info before deadline_time. This writes deadline seconds into has_job_info and true into deadline_time.
Apply:
- std::tuple<..., std::optional<ContainerMetaInTask>, /*35-37*/ - int64_t, bool> /*38-39*/ + std::tuple<..., std::optional<ContainerMetaInTask>, /*35-37*/ + bool, int64_t> /*38-39*/ @@ - allocated_res_view.GetDeviceMap(), container_meta, - task_to_ctld.deadline_time().seconds(), - true /* Mark the document having complete job info */}; + allocated_res_view.GetDeviceMap(), container_meta, + true /* Mark the document having complete job info */, + task_to_ctld.deadline_time().seconds()};
1647-1706: Fix field/value order in TaskInCtldToDocument_ as wellSame swap needed here to match fields "... meta_container, has_job_info, deadline_time".
Apply:
- std::tuple<..., std::optional<ContainerMetaInTask>, /*35-37*/ - int64_t, bool> /*38-39*/ + std::tuple<..., std::optional<ContainerMetaInTask>, /*35-37*/ + bool, int64_t> /*38-39*/ @@ - task->allocated_res_view.GetDeviceMap(), container_meta, - task->TaskToCtld().deadline_time().seconds(), - true /* Mark the document having complete job info */ + task->allocated_res_view.GetDeviceMap(), container_meta, + true /* Mark the document having complete job info */, + task->TaskToCtld().deadline_time().seconds() };
♻️ Duplicate comments (13)
src/Craned/Supervisor/TaskManager.h (3)
103-103: Mutable Step access remains unsafe (data race risk).
GetMutableStep()exposes direct mutable access tom_step_to_supv_without documented thread-safety guarantees. Past reviews flagged this as critical (lines 103-103 and null-null in past comments). The implementation at TaskManager.cpp:1667-1675 shows mutations occur before enqueueing to the async queue, creating a race with the UV loop thread.Replace with explicit, thread-safe setters or make this method private and document that it's only safe within the event loop thread.
Based on learnings
448-449: Missing input validation and semantic documentation.
ChangeTaskTimeConstraintAsyncstill lacks:
- Input validation: No rejection of negative values for
time_limitordeadline_time- Semantic definition: Unclear what
nulloptmeans (no change? clear?) vs0(unlimited? invalid?)- Coexistence rules: Can both be set? Must at least one be provided?
Past review (lines 448-450) flagged these issues as critical. Implementation at TaskManager.cpp:1657-1685 and callback at lines 1970-2022 still call
.value()without null checks, risking crashes.Add validation, document semantics (e.g., "nullopt = retain current, 0 = clear/unlimited, negative = invalid"), and ensure callback handles all cases safely.
469-473: Queue element struct updated for time constraints.The renamed
ChangeTaskTimeConstraintQueueElemwith bothtime_limitanddeadline_timeoptionals appropriately models the dual-constraint API. However, without validation inChangeTaskTimeConstraintAsync, invalid combinations can be enqueued.docs/en/command/cqueue.md (1)
33-33: Fix double emphasis formatting (MD036 violation).Line 33 uses
**--deadline** **string**which triggers MD036. Past review on lines 27-33 and static analysis both flag this.Change to single emphasis for consistency with other options:
-- **--deadline** **string**: Display deadline +- **--deadline string**: Display deadlinedocs/zh/command/cbatch.md (1)
54-60: Fix unordered list indentation (MD007 violation).Lines 55-60 use 4-space indentation but Markdown style requires 2-space. Past review and static analysis both flag this.
Apply this fix:
- **--deadline string**: 作业的截止时间, 格式: - - `now + <count> <units>(无时默认为秒数)` - - `HH:MM[:SS]` - - `MM/DD[/YY]` - - `MMDD[YY]` - - `YYYY-MM-DD[THH:MM[:SS]]` - - `midnight/noon/teatime/elevenses/fika` + - `now + <count> <units>(无时默认为秒数)` + - `HH:MM[:SS]` + - `MM/DD[/YY]` + - `MMDD[YY]` + - `YYYY-MM-DD[THH:MM[:SS]]` + - `midnight/noon/teatime/elevenses/fika`src/CraneCtld/RpcService/CranedKeeper.h (1)
92-94: Still missing validation and semantic documentation.The method signature changed from
ChangeJobTimeLimittoChangeJobTimeConstraintwith two optional parameters, but past critical issues remain unresolved:
- No semantic documentation: What does
nulloptmean? (no change vs clear?) What about0or negative values?- No input validation: Code path from TaskScheduler → CranedKeeper → CranedServer → JobManager has no checks. Negative values pass through unchecked.
- Type ambiguity: Using raw
int64_tinstead ofstd::chronotypes loses compile-time unit safety.The implementation at CranedKeeper.cpp:271-306 and CranedServer.cpp:322-355 confirms no validation occurs.
Add:
- Header comment defining semantics (e.g., "nullopt = no change, 0 = clear/unlimited, negative = invalid")
- Input validation rejecting invalid values at API boundary
- Consider
std::chrono::secondsfor type safetysrc/Craned/Core/SupervisorKeeper.h (1)
24-28: Include for std::optional; add brief units docHeader uses std::optional but doesn’t include , which can break consumers. Also consider documenting parameter units for clarity.
Apply:
#include "JobManager.h" #include "crane/AtomicHashMap.h" +#include <optional>Optionally add:
class SupervisorStub { public: - CraneErrCode ChangeTaskTimeConstraint( - std::optional<int64_t> time_limit_seconds, - std::optional<int64_t> deadline_time); + // time_limit_seconds: seconds; deadline_time: Unix epoch seconds. + CraneErrCode ChangeTaskTimeConstraint( + std::optional<int64_t> time_limit_seconds, + std::optional<int64_t> deadline_time);Also applies to: 38-40
src/CraneCtld/RpcService/CtldGrpcServer.cpp (1)
560-567: Reject TimeLimit change when value is missing; avoid no‑op “success”If has_time_limit_seconds() is false, we still call ChangeTaskTimeConstraint with nullopt and then mark tasks modified. Fail fast and report a reason.
Apply:
- if (request->attribute() == ModifyTaskRequest::TimeLimit) { - std::optional<int64_t> time_limit_seconds = - request->has_time_limit_seconds() - ? std::optional<int64_t>(request->time_limit_seconds()) - : std::nullopt; + if (request->attribute() == ModifyTaskRequest::TimeLimit) { + if (!request->has_time_limit_seconds()) { + for (auto task_id : request->task_ids()) { + response->add_not_modified_tasks(task_id); + response->add_not_modified_reasons("Missing time_limit_seconds."); + } + return grpc::Status::OK; + } + std::optional<int64_t> time_limit_seconds = + std::optional<int64_t>(request->time_limit_seconds());src/Craned/Supervisor/TaskManager.cpp (3)
1899-1923: Logic bug: misspelled flag and ‘==’ used instead of ‘=’is_must_terinate is misspelled and never set true; the guards below never elevate priority. Rename and fix assignment.
- bool is_must_terinate = false; - if (elem.termination_reason == TerminatedBy::TERMINATION_BY_DEADLINE) { - is_must_terinate == true; - } + const bool must_terminate = + (elem.termination_reason == TerminatedBy::TERMINATION_BY_DEADLINE); @@ - if (!is_must_terinate && !elem.mark_as_orphaned && + if (!must_terminate && !elem.mark_as_orphaned && !g_runtime_status.CanStepOperate()) { @@ - if (!is_must_terinate && !elem.mark_as_orphaned && + if (!must_terminate && !elem.mark_as_orphaned && m_step_.AllTaskFinished()) {
1658-1686: Don’t coerce a missing deadline into 0; preserve nullopt and clear proto if neededFor absent inputs, keep std::nullopt so the consumer can distinguish “no change”. Coercing to 0 makes downstream treat it as epoch and kill immediately. Update producer now; consumer fix follows.
- if (time_limit) { - elem.time_limit = time_limit; - m_step_.GetMutableStep().mutable_time_limit()->set_seconds( - time_limit.value()); - } else { - elem.time_limit = - std::optional<int64_t>(m_step_.GetStep().time_limit().seconds()); - } + if (time_limit) { + elem.time_limit = time_limit; + m_step_.GetMutableStep().mutable_time_limit()->set_seconds(*time_limit); + } else { + elem.time_limit = std::nullopt; + } @@ - if (deadline_time) { - elem.deadline_time = deadline_time; - m_step_.GetMutableStep().mutable_deadline_time()->set_seconds( - deadline_time.value()); - } else { - elem.deadline_time = - std::optional<int64_t>(m_step_.GetStep().deadline_time().seconds()); - } + if (deadline_time) { + elem.deadline_time = deadline_time; + m_step_.GetMutableStep().mutable_deadline_time()->set_seconds(*deadline_time); + } else if (m_step_.GetStep().has_deadline_time()) { + elem.deadline_time = std::optional<int64_t>( + m_step_.GetStep().deadline_time().seconds()); + } else { + elem.deadline_time = std::nullopt; + m_step_.GetMutableStep().clear_deadline_time(); + }
2041-2052: Clamp negative/absent deadline and select correct timerIf no deadline is set or the deadline is in the past, this computes a negative value and arms an invalid timer (or wrongly prefers deadline). Check presence, clamp, and branch.
- int64_t time_limit_sec = m_step_.GetStep().time_limit().seconds(); - int64_t deadline_sec = m_step_.GetStep().deadline_time().seconds() - - absl::ToUnixSeconds(absl::Now()); - - std::string log_timer = - deadline_sec <= time_limit_sec ? "deadline" : "time_limit"; - bool is_deadline = deadline_sec <= time_limit_sec ? true : false; - - AddTerminationTimer_(std::min(deadline_sec, time_limit_sec), is_deadline); - CRANE_TRACE("Add a {} timer of {} seconds", log_timer, - std::min(deadline_sec, time_limit_sec)); + const int64_t time_limit_sec = m_step_.GetStep().time_limit().seconds(); + if (m_step_.GetStep().has_deadline_time()) { + const int64_t ddl_sec = m_step_.GetStep().deadline_time().seconds() - + absl::ToUnixSeconds(absl::Now()); + if (ddl_sec <= 0) { + CRANE_TRACE("Task deadline already reached; firing immediately."); + EvTaskTimerCb_(true); + } else { + const bool is_deadline = ddl_sec <= time_limit_sec; + const int64_t secs = is_deadline ? ddl_sec : time_limit_sec; + AddTerminationTimer_(secs, is_deadline); + CRANE_TRACE("Add a {} timer of {} seconds", + is_deadline ? "deadline" : "time_limit", secs); + } + } else { + AddTerminationTimer_(time_limit_sec, false); + CRANE_TRACE("Add a time_limit timer of {} seconds", time_limit_sec); + }src/CraneCtld/TaskScheduler.cpp (2)
1293-1339: Don’t abort loop on first expired deadline; signal cancel async handleUsing return here drops remaining queue items and fails to wake the cancel path. Replace with continue and send m_cancel_task_async_handle_.
- if (deadline_time <= absl::ToUnixSeconds(now)) { + if (deadline_time <= absl::ToUnixSeconds(now)) { CRANE_ERROR("Job #{}'s deadline is earlier than now", job_id); { LockGuard pending_guard(&m_pending_task_map_mtx_); auto pd_it = m_pending_task_map_.find(job_id); if (pd_it != m_pending_task_map_.end()) { auto& task = pd_it->second; m_cancel_task_queue_.enqueue(CancelPendingTaskQueueElem{ .task = std::move(task), .finish_status = crane::grpc::TaskStatus::Deadline}); m_pending_task_map_.erase(pd_it); } } - return; + m_cancel_task_async_handle_->send(); + continue; }
1450-1461: Reservation end_time must use the new effective constraints (min of both)Current code writes twice and compares against stale task->time_limit/deadline_time. Compute effective values first, then set once.
- if (time_limit_seconds) { - g_meta_container->GetResvMetaPtr(task->reservation)->end_time = - std::min(task->StartTime() + - absl::Seconds(time_limit_seconds.value()), - task->deadline_time); - } - - if (absl_deadline_time) { - g_meta_container->GetResvMetaPtr(task->reservation)->end_time = - std::min(task->StartTime() + task->time_limit, - absl_deadline_time.value()); - } + // Use the would-be new values if provided; otherwise fall back to current ones. + absl::Duration effective_time_limit = + time_limit_seconds ? absl::Seconds(*time_limit_seconds) + : task->time_limit; + absl::Time effective_deadline = + absl_deadline_time ? *absl_deadline_time : task->deadline_time; + g_meta_container->GetResvMetaPtr(task->reservation)->end_time = + std::min(task->StartTime() + effective_time_limit, + effective_deadline);
🧹 Nitpick comments (4)
src/CraneCtld/DbClient.cpp (1)
453-455: Optional: choose a clearer sentinel for “no deadline” in queriesUsing INT64_MAX as default may confuse consumers; consider omitting the field or using 0 and documenting semantics. If keeping INT64_MAX, document it in API.
protos/Supervisor.proto (1)
104-112: Document units for new fields; ensure clients regenerateAdd brief comments: time_limit_seconds in seconds; deadline_time as Unix epoch seconds. Regenerate stubs to pick up ‘optional’ presence bits.
Also applies to: 131-131
src/CraneCtld/TaskScheduler.h (1)
825-828: Confirm downstream usage of finish_status in cancel pathNew finish_status in CancelPendingTaskQueueElem requires CancelTaskAsyncCb_ and status propagation to honor Deadline vs Cancelled.
src/Craned/Supervisor/TaskManager.cpp (1)
1785-1801: Map deadline-triggered Calloc termination to Deadline, not TimeLimitThe Calloc branch always reports ExceedTimeLimit. Use is_deadline to emit Deadline for consistency.
- if (m_step_.IsCalloc()) { - // For calloc job, we use job id, send TaskFinish_ directly. - TaskFinish_(m_step_.job_id, crane::grpc::TaskStatus::ExceedTimeLimit, - ExitCode::EC_EXCEED_TIME_LIMIT, std::nullopt); + if (m_step_.IsCalloc()) { + const auto status = is_deadline ? crane::grpc::TaskStatus::Deadline + : crane::grpc::TaskStatus::ExceedTimeLimit; + TaskFinish_(m_step_.job_id, status, ExitCode::EC_EXCEED_TIME_LIMIT, + is_deadline ? std::optional<std::string>("Reached task's deadline") + : std::nullopt);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (33)
docs/en/command/cacct.md(2 hunks)docs/en/command/calloc.md(1 hunks)docs/en/command/cbatch.md(1 hunks)docs/en/command/ccontrol.md(2 hunks)docs/en/command/cqueue.md(2 hunks)docs/en/command/crun.md(1 hunks)docs/zh/command/cacct.md(2 hunks)docs/zh/command/calloc.md(1 hunks)docs/zh/command/cbatch.md(1 hunks)docs/zh/command/ccontrol.md(2 hunks)docs/zh/command/cqueue.md(2 hunks)docs/zh/command/crun.md(1 hunks)protos/Crane.proto(4 hunks)protos/PublicDefs.proto(7 hunks)protos/Supervisor.proto(2 hunks)src/CraneCtld/CtldPublicDefs.cpp(6 hunks)src/CraneCtld/CtldPublicDefs.h(2 hunks)src/CraneCtld/DbClient.cpp(10 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(1 hunks)src/CraneCtld/RpcService/CranedKeeper.h(1 hunks)src/CraneCtld/RpcService/CtldGrpcServer.cpp(2 hunks)src/CraneCtld/TaskScheduler.cpp(16 hunks)src/CraneCtld/TaskScheduler.h(5 hunks)src/Craned/Core/CranedServer.cpp(1 hunks)src/Craned/Core/CranedServer.h(1 hunks)src/Craned/Core/SupervisorKeeper.cpp(1 hunks)src/Craned/Core/SupervisorKeeper.h(1 hunks)src/Craned/Supervisor/SupervisorServer.cpp(1 hunks)src/Craned/Supervisor/SupervisorServer.h(1 hunks)src/Craned/Supervisor/TaskManager.cpp(10 hunks)src/Craned/Supervisor/TaskManager.h(7 hunks)src/Utilities/PublicHeader/include/crane/PublicHeader.h(1 hunks)src/Utilities/PublicHeader/include/crane/String.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- docs/zh/command/cqueue.md
- docs/en/command/ccontrol.md
- docs/en/command/crun.md
- docs/zh/command/calloc.md
- docs/en/command/cacct.md
- docs/zh/command/ccontrol.md
- src/CraneCtld/CtldPublicDefs.cpp
- docs/en/command/cbatch.md
- src/CraneCtld/CtldPublicDefs.h
- docs/en/command/calloc.md
🧰 Additional context used
🧠 Learnings (14)
📚 Learning: 2025-05-02T07:06:36.103Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/CraneCtld/CtldGrpcServer.cpp:113-118
Timestamp: 2025-05-02T07:06:36.103Z
Learning: In CraneSched, gRPC methods should generally return Status::OK even when handling error conditions, as non-OK statuses cause the connection to terminate. Error information should be communicated within the RPC response payload instead.
Applied to files:
src/Craned/Core/CranedServer.hsrc/Craned/Supervisor/SupervisorServer.hprotos/PublicDefs.protosrc/Craned/Core/CranedServer.cppsrc/Craned/Core/SupervisorKeeper.hsrc/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/RpcService/CranedKeeper.cppsrc/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-26T11:04:30.580Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Applied to files:
src/Craned/Supervisor/SupervisorServer.hsrc/Craned/Core/SupervisorKeeper.cppsrc/Craned/Core/SupervisorKeeper.hsrc/Craned/Supervisor/SupervisorServer.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/Craned/Core/CranedServer.cppsrc/CraneCtld/RpcService/CranedKeeper.hsrc/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.cppsrc/CraneCtld/RpcService/CtldGrpcServer.cppsrc/CraneCtld/RpcService/CranedKeeper.cpp
📚 Learning: 2025-05-09T02:15:30.422Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-26T11:00:54.563Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Core/SupervisorKeeper.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cppsrc/Craned/Core/SupervisorKeeper.hsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-08-12T08:58:39.772Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 577
File: src/Craned/Supervisor/TaskManager.cpp:124-130
Timestamp: 2025-08-12T08:58:39.772Z
Learning: In the CraneSched project using C++23, Class Template Argument Deduction (CTAD) allows std::unique_ptr declarations without explicit template parameters when the type can be deduced from the initializer, such as `std::unique_ptr task = std::move(m_task_map_.at(task_id))` where the template parameter is deduced from the move operation.
Applied to files:
src/CraneCtld/TaskScheduler.hsrc/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-07-01T08:00:05.383Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:1-1809
Timestamp: 2025-07-01T08:00:05.383Z
Learning: In TaskManager.cpp, the security model relies on administrator-configured command templates in ParseOCICmdPattern_ and system-provided usernames in ParseFilePathPattern_, with file permissions using the user's own uid/gid for privilege isolation. The user L-Xiafeng considers this security boundary sufficient and chooses not to fix job name-related path issues at this location.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/CraneCtld/TaskScheduler.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-30T08:43:44.470Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Applied to files:
src/CraneCtld/TaskScheduler.cpp
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.cppsrc/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-06-23T01:34:41.639Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 528
File: src/CraneCtld/DbClient.cpp:0-0
Timestamp: 2025-06-23T01:34:41.639Z
Learning: MemoryBytes() method returns uint64_t type in the CraneSched codebase, and when storing in MongoDB BSON documents, it should be cast to int64_t to maintain consistency with the existing schema and enable numeric database operations.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
📚 Learning: 2025-05-02T07:12:46.896Z
Learnt from: RileyWen
Repo: PKUHPC/CraneSched PR: 475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
Applied to files:
src/Craned/Supervisor/TaskManager.cpp
🧬 Code graph analysis (12)
src/Craned/Core/CranedServer.h (2)
src/Craned/Core/CranedServer.cpp (2)
ChangeJobTimeConstraint(323-356)ChangeJobTimeConstraint(323-326)src/CraneCtld/RpcService/CranedKeeper.cpp (2)
ChangeJobTimeConstraint(272-307)ChangeJobTimeConstraint(272-274)
src/Craned/Supervisor/SupervisorServer.h (3)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1379-1512)ChangeTaskTimeConstraint(1379-1381)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(91-112)ChangeTaskTimeConstraint(91-93)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(67-89)ChangeTaskTimeConstraint(67-70)
src/Craned/Core/CranedServer.cpp (1)
src/CraneCtld/RpcService/CranedKeeper.cpp (2)
ChangeJobTimeConstraint(272-307)ChangeJobTimeConstraint(272-274)
src/CraneCtld/RpcService/CranedKeeper.h (2)
src/Craned/Core/CranedServer.cpp (2)
ChangeJobTimeConstraint(323-356)ChangeJobTimeConstraint(323-326)src/CraneCtld/RpcService/CranedKeeper.cpp (2)
ChangeJobTimeConstraint(272-307)ChangeJobTimeConstraint(272-274)
src/CraneCtld/TaskScheduler.h (3)
src/CraneCtld/TaskScheduler.cpp (10)
ChangeTaskTimeConstraint(1379-1512)ChangeTaskTimeConstraint(1379-1381)DeadlineTimerThread_(731-750)DeadlineTimerThread_(731-731)CancelDeadlineTaskCb_(1350-1365)CancelDeadlineTaskCb_(1350-1350)CreateDeadlineTimerCb_(1293-1339)CreateDeadlineTimerCb_(1293-1293)DelDeadlineTimer_(1341-1348)DelDeadlineTimer_(1341-1341)src/Craned/Core/SupervisorKeeper.cpp (3)
ChangeTaskTimeConstraint(91-112)ChangeTaskTimeConstraint(91-93)pair(132-192)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(67-89)ChangeTaskTimeConstraint(67-70)
src/Craned/Core/SupervisorKeeper.cpp (2)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1379-1512)ChangeTaskTimeConstraint(1379-1381)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(67-89)ChangeTaskTimeConstraint(67-70)
src/CraneCtld/TaskScheduler.cpp (4)
src/Utilities/PublicHeader/String.cpp (2)
SetCurrentThreadName(354-362)SetCurrentThreadName(354-354)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(91-112)ChangeTaskTimeConstraint(91-93)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(67-89)ChangeTaskTimeConstraint(67-70)src/CraneCtld/CtldPublicDefs.h (1)
CheckIfTimeLimitSecIsValid(1062-1064)
src/Craned/Core/SupervisorKeeper.h (3)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1379-1512)ChangeTaskTimeConstraint(1379-1381)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(91-112)ChangeTaskTimeConstraint(91-93)src/Craned/Supervisor/SupervisorServer.cpp (2)
ChangeTaskTimeConstraint(67-89)ChangeTaskTimeConstraint(67-70)
src/Craned/Supervisor/TaskManager.cpp (2)
src/Craned/Core/CranedPublicDefs.h (2)
g_config(156-156)g_runtime_status(162-162)src/Craned/Supervisor/TaskManager.h (2)
DelTerminationTimer_(421-427)AddTerminationTimer_(410-419)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
src/Craned/Core/CranedServer.cpp (2)
ChangeJobTimeConstraint(323-356)ChangeJobTimeConstraint(323-326)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
EvTaskTimerCb_(1785-1802)EvTaskTimerCb_(1785-1785)ChangeTaskTimeConstraintAsync(1658-1686)ChangeTaskTimeConstraintAsync(1658-1659)EvCleanTaskStopQueueCb_(1804-1897)EvCleanTaskStopQueueCb_(1804-1804)EvCleanTerminateTaskQueueCb_(1899-1969)EvCleanTerminateTaskQueueCb_(1899-1899)EvCleanChangeTaskTimeConstraintQueueCb_(1971-2023)EvCleanChangeTaskTimeConstraintQueueCb_(1971-1971)
src/Craned/Supervisor/SupervisorServer.cpp (2)
src/CraneCtld/TaskScheduler.cpp (2)
ChangeTaskTimeConstraint(1379-1512)ChangeTaskTimeConstraint(1379-1381)src/Craned/Core/SupervisorKeeper.cpp (2)
ChangeTaskTimeConstraint(91-112)ChangeTaskTimeConstraint(91-93)
🪛 markdownlint-cli2 (0.18.1)
docs/zh/command/cbatch.md
55-55: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
56-56: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
57-57: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
58-58: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
59-59: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
60-60: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (21)
src/Utilities/PublicHeader/include/crane/String.h (1)
183-183: LGTM! String mapping correctly updated for Deadline status.The update properly maps TaskStatus value 10 to "Deadline", aligning with the new deadline feature.
src/Craned/Supervisor/TaskManager.h (4)
484-484: Timer callback signature extended with deadline flag.
EvTaskTimerCb_(bool is_deadline)now distinguishes termination reasons. Verify the implementation at TaskManager.cpp:1784-1801 correctly sets status based on this flag.
516-521: Member variables renamed for time constraint clarity.The renaming from
m_change_task_time_limit_*tom_change_task_time_constraint_*improves naming consistency with the new dual-constraint model.
39-40: Code change verified and complete.The new
TERMINATION_BY_DEADLINEvariant is properly handled in the switch statement at TaskManager.cpp line 1852, with appropriate status mapping (crane::grpc::TaskStatus::Deadline) and message. The only switch statement handlingTerminatedByin the codebase now covers all enum variants (explicit cases for specific termination types, plus a default case for remaining values).
398-419: All timer cleanup verified correctly.The
DelTerminationTimer_()call at line 1988 properly precedes theAddTerminationTimer_()call at line 2012 in the time constraint change handler. The secondAddTerminationTimer_()call at line 2049 in the task execution handler is for initial timer setup where no prior deletion is needed. Theis_deadlineparameter is correctly implemented to distinguish deadline from time-limit expirations. No dual-fire scenarios exist.docs/en/command/cqueue.md (2)
27-27: Deadline field documented.The new output field appropriately documents the deadline display feature.
67-67: Format identifier for deadline added.The
%deadlineand%Deadlineformat tokens appropriately support deadline display in custom formats.docs/zh/command/crun.md (1)
14-14: Deadline option documented for crun.The addition appropriately documents the
--deadlineCLI option for interactive tasks.src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)
125-125: Exit code for deadline termination added.The new
EC_REACHED_DEADLINEappropriately extends the exit code enum to signal deadline-based terminations.docs/zh/command/cacct.md (2)
54-54: Format identifier for deadline added.The
%deadline/%Deadlineformat option appropriately documents deadline display in custom output formats.
84-84: Display option for deadline documented.The
--deadline stringoption appropriately documents the deadline display feature.src/Craned/Supervisor/SupervisorServer.h (1)
53-57: LGTM: RPC rename and types align with protosrc/Craned/Core/CranedServer.cpp (1)
323-356: LGTM: forwards optional time_limit_seconds/deadline_time to SupervisorHandler matches proto and system semantics; logs and response handling look consistent.
src/Craned/Core/CranedServer.h (1)
97-101: LGTM: signature matches implementation and protosrc/CraneCtld/TaskScheduler.h (2)
762-765: LGTM: public API widened to optionals; matches callers
964-1070: Verify uvw loop/handles are constructed and used on the same threadEnsure m_job_deadline_timer_thread_ owns uvw_deadline_loop and that async/timer handles are created/used within that loop’s thread to avoid races.
src/Craned/Core/SupervisorKeeper.cpp (1)
91-110: Forwarding optional time constraints looks correctThe stub now mirrors both optional fields cleanly, so Supervisor receives exactly what Ctld intends.
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
272-307: Nice alignment with the new RPC shapeConditional population of time limit vs. deadline keeps the wire format minimal while matching Supervisor expectations.
protos/PublicDefs.proto (1)
121-122: Good to see Deadline enumeratedAdding the explicit
Deadlinestate gives downstream consumers a first-class signal when a task stops because of deadline enforcement.protos/Crane.proto (1)
152-159: API rename and optional deadline field look solidThe new request names and optional
deadline_timekeep the contract clear for both CLI and Supervisor callers.src/Craned/Supervisor/SupervisorServer.cpp (1)
67-88: Server handler matches the new constraint semanticsExtracting the optionals and delegating to
ChangeTaskTimeConstraintAsyncpreserves previous behavior while adding deadline support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/CraneCtld/CtldPublicDefs.cpp (1)
1024-1026: Resetdeadline_timewhen the proto omits it.This issue has already been flagged in a previous review. When
TaskInCtldinstances are reused and the proto doesn't specifydeadline_time, the field retains stale values. Add an else clause to reset toabsl::InfiniteFuture().
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/CraneCtld/CtldPublicDefs.cpp(6 hunks)src/Craned/Supervisor/TaskManager.h(7 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-05-25T04:11:50.268Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-08T07:39:29.207Z
Learnt from: huerni
Repo: PKUHPC/CraneSched PR: 469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-09T02:16:56.723Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/Craned/TaskManager.cpp:1498-1506
Timestamp: 2025-05-09T02:16:56.723Z
Learning: The `QueryRunningTasksAsync()` method in TaskManager.cpp is designed to never be called from inside the event loop thread, so there's no risk of deadlock with the synchronous `res.get()` call.
Applied to files:
src/Craned/Supervisor/TaskManager.h
📚 Learning: 2025-05-23T02:32:43.952Z
Learnt from: 1daidai1
Repo: PKUHPC/CraneSched PR: 458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
Applied to files:
src/CraneCtld/CtldPublicDefs.cpp
📚 Learning: 2025-05-09T01:54:21.256Z
Learnt from: L-Xiafeng
Repo: PKUHPC/CraneSched PR: 496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Applied to files:
src/CraneCtld/CtldPublicDefs.cpp
🧬 Code graph analysis (1)
src/Craned/Supervisor/TaskManager.h (1)
src/Craned/Supervisor/TaskManager.cpp (10)
EvTaskTimerCb_(1785-1802)EvTaskTimerCb_(1785-1785)ChangeTaskTimeConstraintAsync(1658-1686)ChangeTaskTimeConstraintAsync(1658-1659)EvCleanTaskStopQueueCb_(1804-1897)EvCleanTaskStopQueueCb_(1804-1804)EvCleanTerminateTaskQueueCb_(1899-1969)EvCleanTerminateTaskQueueCb_(1899-1899)EvCleanChangeTaskTimeConstraintQueueCb_(1971-2023)EvCleanChangeTaskTimeConstraintQueueCb_(1971-1971)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (9)
src/CraneCtld/CtldPublicDefs.cpp (2)
552-552: LGTM!Correctly copies
deadline_timefrom the job to the primary step during initialization.
618-618: LGTM!Serialization of
deadline_timeinto proto messages (StepToCtld, StepToD, TaskInfo) is correctly implemented with proper time conversions.Also applies to: 668-669, 1141-1141
src/Craned/Supervisor/TaskManager.h (7)
39-40: LGTM: Enum addition is clean.The
TERMINATION_BY_DEADLINEvalue is correctly added to support the new deadline termination path.
398-408: Timer logic correctly captures deadline flag.The lambda properly captures
is_deadlineand forwards it toEvTaskTimerCb_, ensuring the timer callback knows which constraint triggered termination.
410-419: LGTM: Overload mirrors template version correctly.The int64_t overload maintains consistency with the template version and properly forwards the deadline flag.
469-473: LGTM: Struct evolution matches API changes.The renamed struct and optional fields properly support the dual time-constraint model.
484-484: LGTM: Callback signature updated for deadline awareness.The signature change enables the timer callback to distinguish between time limit and deadline expirations.
490-490: LGTM: Callback rename maintains consistency.The rename aligns with the broader time-constraint refactoring.
516-521: LGTM: Member variable renames are consistent.All time-constraint-related members are renamed uniformly, maintaining code clarity.
Summary by CodeRabbit
New Features
Documentation