Skip to content

Conversation

@huerni
Copy link
Collaborator

@huerni huerni commented Jun 18, 2025

Summary by CodeRabbit

  • New Features
    • Added support for partition-level resource and job limits for users, accounts, and QoS, enabling more granular quota management.
    • Expanded resource limit controls to include maximum jobs, submissions, wall time, and TRES (trackable resources) at user, account, QoS, and partition levels.
    • Enhanced querying and modification interfaces to display and manage partition-specific resource limits.
    • Improved error reporting with new error codes for resource and job limit violations.
  • Bug Fixes
    • Improved validation and error handling when modifying or deleting users, accounts, and QoS, especially regarding active tasks and resource constraints.
  • Refactor
    • Streamlined internal resource tracking and concurrency control for better performance and reliability.
  • Documentation
    • Extended error code descriptions and field name mappings for new resource management features.
  • Chores
    • Added new utility functions for parsing resource-related strings and updated operator overloads for resource types.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 18, 2025

Walkthrough

This change introduces comprehensive partition-level resource limit management for users, accounts, and QoS entities. It extends protobuf definitions, database schema, and in-memory structures to support hierarchical, partition-aware resource and job limit controls. The control flow and gRPC interfaces are updated to handle new resource fields, with enhanced validation, serialization, and concurrency handling.

Changes

File(s) Change Summary
protos/Crane.proto Added partition string field to ModifyAccountRequest message.
protos/PublicDefs.proto Expanded resource/job limit enums and messages; introduced PartitionResourceLimit; extended AccountInfo, UserInfo, and QosInfo with partition-aware resource limits and new error codes.
src/CraneCtld/AccountManager.cpp
src/CraneCtld/AccountManager.h
Added/modifed methods for managing partition-level resource limits for users/accounts; integrated partition resource limits into workflows; improved permission checks; updated deletion logic; added helper methods for resource map management.
src/CraneCtld/AccountMetaContainer.cpp
src/CraneCtld/AccountMetaContainer.h
Refactored and expanded resource allocation/checking: introduced striped locking, new resource meta maps, fine-grained resource checks, task count tracking, and new methods for meta resource management and deletion.
src/CraneCtld/CtldGrpcServer.cpp Extended gRPC service methods to handle new resource/job limit fields for QoS, accounts, and users; updated control flow for add/modify/query/submit operations; integrated new partition and TRES resource logic.
src/CraneCtld/CtldPublicDefs.h Added/extended structs for partition resource limits, account/user resource maps, and QoS fields; introduced new type aliases and field name arrays; added helper for field name lookup.
src/CraneCtld/DbClient.cpp
src/CraneCtld/DbClient.h
Extended MongoDB serialization/deserialization for new resource limit fields and nested maps; added support for ResourceView, device maps, and partition/account resource limits; updated default QoS initialization.
src/CraneCtld/TaskScheduler.cpp Integrated user existence validation and task count management during recovery/submission; replaced QoS resource calls with meta resource methods; added meta resource checks to scheduling; updated resource freeing logic.
src/CraneCtld/TaskScheduler.h Added UserHasTasks method to check if a user has active tasks.
src/Utilities/PublicHeader/PublicHeader.cpp
src/Utilities/PublicHeader/include/crane/PublicHeader.h
Added operator overloads for DeviceMap and ResourceView (+=, -=); added non-const device map accessor; extended error code string array.
src/Utilities/PublicHeader/String.cpp
src/Utilities/PublicHeader/include/crane/String.h
Added functions for parsing memory strings, device maps, and resource views from strings; declared new utility functions.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CtldGrpcServer
    participant AccountManager
    participant AccountMetaContainer
    participant DbClient

    Client->>CtldGrpcServer: ModifyUser/Account/Qos (with partition/resource fields)
    CtldGrpcServer->>AccountManager: ModifyUserPartitionResource/ModifyAccountPartitionResource/ModifyQosTres
    AccountManager->>DbClient: Update resource limits in DB
    AccountManager->>AccountMetaContainer: Update in-memory resource maps

    Client->>CtldGrpcServer: SubmitTaskToScheduler
    CtldGrpcServer->>AccountMetaContainer: TryMallocMetaSubmitResource
    AccountMetaContainer->>AccountManager: Validate resource/job limits (partition, user, account, qos)
    AccountMetaContainer-->>CtldGrpcServer: Success/Error
    CtldGrpcServer-->>Client: Submit result
Loading

Suggested labels

enhancement

Suggested reviewers

  • RileyWen
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@huerni
Copy link
Collaborator Author

huerni commented Jun 18, 2025

merged after #528

@huerni
Copy link
Collaborator Author

huerni commented Jun 18, 2025

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 18, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 23

🔭 Outside diff range comments (1)
protos/PublicDefs.proto (1)

502-528: Do NOT renumber existing enum values – this breaks backward compatibility

ModifyField had stable numeric tags.
By inserting new fields (Description = 3, etc.) every subsequent constant is
shifted, altering on-the-wire numbers and corrupting any persisted data
(in DBs, logs, cached protobufs) produced by earlier binaries.

Add new values only at the tail of the enum and keep historical numbers fixed.
Roll back to the original numbering or introduce a new enum if semantics
changed.

Also fix the typo FLagsFlags.

🧹 Nitpick comments (9)
src/Utilities/PublicHeader/include/crane/String.h (2)

50-51: Nit: add function comment or reference

ParseMemStringAsByte is public but undocumented here. A one-liner describing accepted formats (e.g. “10G”, “512Mi”) will save every future reader a trip to the implementation.


95-98: Document conversion helpers

ConvertStringToDeviceMap / ConvertStringToResourceView are extremely handy but tricky. Brief examples ("gpu:2(A100:1,H100:1)") in the header comment would be valuable.

src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

487-489: Return type exposes mutable internal state – intentional?

DeviceMap& GetDeviceMap() gives callers full write access to the internal map, bypassing invariants (e.g. non-negative counts). If that is required, mention it in the comment; otherwise expose a const view or dedicated mutators.

src/Utilities/PublicHeader/String.cpp (1)

503-553: Whitespace sensitivity may reject perfectly valid input

absl::StrSplit(s, ",") keeps leading/trailing spaces, therefore a commonly typed string like
"cpu=8, mem=32G, gres/gpu:2" fails because " mem=32G" does not satisfy item.starts_with("gres/") and later yields kv[0] == " mem".

Trim each item before further processing:

-  for (const auto& item : items) {
+  for (auto item : items) {
+    absl::StripAsciiWhitespace(&item);

This tiny change makes the parser much more forgiving without affecting strict inputs.

src/CraneCtld/AccountMetaContainer.h (1)

28-38: Synchronise stripe counts – avoid redundant locking

kNumStripes = 128, yet each parallel_flat_hash_map is already internally
striped (4 shards via the 6-th template arg).
Adding an additional 128-element std::array<std::mutex> per container
duplicates contention control, wastes memory, and complicates lock ordering.

Consider:

  1. Drop the manual m_*_stripes_ arrays and rely on phmap’s striped mutexes; or
  2. Reduce kNumStripes to the same value used in the map template and document
    the guaranteed lock ordering.

Either way, keep a single, consistent stratagem to minimise dead-lock risk and
runtime overhead.

Also applies to: 113-127

protos/PublicDefs.proto (1)

530-537: Nested map indirection looks unnecessary

UserInfo.PartitionToResourceLimit just wraps a single map. Unless a second
level of metadata is planned, replace:

map<string, PartitionToResourceLimit> account_to_partition_limit

with a direct map<string, PartitionResourceLimit> to keep the schema simple.

src/CraneCtld/DbClient.cpp (1)

657-658: Consider numeric types for memory serialization

Serializing memory values as strings may cause precision issues for very large values and increases storage overhead.

Consider using numeric BSON types:

-allocDoc.append(kvp("mem", std::to_string(value.MemoryBytes())));
-allocDoc.append(kvp("mem_sw", std::to_string(value.MemoryBytes())));
+allocDoc.append(kvp("mem", static_cast<int64_t>(value.MemoryBytes())));
+allocDoc.append(kvp("mem_sw", static_cast<int64_t>(value.MemoryBytes())));
src/CraneCtld/CtldPublicDefs.h (1)

948-959: Document memory implications of nested map structure

The MetaResourceStat structure uses multiple nested maps which could consume significant memory with many users, partitions, and QoS combinations.

Consider:

  1. Adding a comment documenting the memory complexity: O(users × partitions × QoS)
  2. Implementing lazy initialization for rarely used combinations
  3. Adding metrics to monitor memory usage of these structures
src/CraneCtld/AccountManager.cpp (1)

2593-2609: Consider documenting the CPU limit calculation rationale.

The implementation correctly initializes default resource limits. The CPU count calculation std::numeric_limits<int32_t>::max() / 256 appears to be a safeguard against overflow, but consider adding a comment explaining this choice.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7bc056 and fb21dad.

⛔ Files ignored due to path filters (1)
  • src/Misc/BPF/cgroup_dev_bpf.o is excluded by !**/*.o
📒 Files selected for processing (16)
  • protos/Crane.proto (1 hunks)
  • protos/PublicDefs.proto (6 hunks)
  • src/CraneCtld/AccountManager.cpp (26 hunks)
  • src/CraneCtld/AccountManager.h (4 hunks)
  • src/CraneCtld/AccountMetaContainer.cpp (2 hunks)
  • src/CraneCtld/AccountMetaContainer.h (1 hunks)
  • src/CraneCtld/CtldGrpcServer.cpp (10 hunks)
  • src/CraneCtld/CtldPublicDefs.h (7 hunks)
  • src/CraneCtld/DbClient.cpp (8 hunks)
  • src/CraneCtld/DbClient.h (2 hunks)
  • src/CraneCtld/TaskScheduler.cpp (10 hunks)
  • src/CraneCtld/TaskScheduler.h (1 hunks)
  • src/Utilities/PublicHeader/PublicHeader.cpp (2 hunks)
  • src/Utilities/PublicHeader/String.cpp (2 hunks)
  • src/Utilities/PublicHeader/include/crane/PublicHeader.h (4 hunks)
  • src/Utilities/PublicHeader/include/crane/String.h (2 hunks)
🔇 Additional comments (13)
protos/Crane.proto (1)

313-320: Confirm wire-compatibility of the new partition field

Field number 7 appears to be unused so adding string partition = 7; is wire-compatible.
If 7 was ever reserved or used by a removed field, older binaries will mis-deserialize. Please verify historical .proto history before merging.

src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

469-472: ResourceView::operator-= may produce negative resources

Unlike the node-level overload, the view-level overload does not guard against rhs.allocatable_res > lhs.allocatable_res, leading to underflowed uint64_t counts.
Consider mirroring the ABSL_ASSERT(rhs <= this->allocatable_res) check used elsewhere.

src/Utilities/PublicHeader/PublicHeader.cpp (1)

724-734: Safety check missing in ResourceView -= ResourceView

This overload blindly subtracts both maps; if rhs.device_map contains keys not in lhs.device_map, the underlying DeviceMap -= will ASSERT and, in release, mis-behave. Replicate the containment test or convert the assertion to a hard check.

src/Utilities/PublicHeader/String.cpp (1)

64-85: Possible silent overflow & loss of precision

sz is parsed into a double and then multiplied by large powers of two before the cast to uint64_t.
Values above ~9 PiB can no longer be represented exactly in an IEEE-754 double, and the multiplication may silently overflow uint64_t without any diagnostics.

Consider:

-  double sz = 0.0;
+  long double sz = 0.0L;          // preserve precision
...
-  uint64_t bytes = 0;
+  __uint128_t bytes = 0;          // compute in wider type
...
-  return bytes;
+  if (bytes > std::numeric_limits<uint64_t>::max())
+    return std::nullopt;
+  return static_cast<uint64_t>(bytes);
src/CraneCtld/CtldGrpcServer.cpp (1)

1748-1752: Possible task-count underflow

TryMallocMetaSubmitResource fails → UserReduceTask() is called, but
UserAddTask() is only invoked inside TryMallocMetaSubmitResource on
success.
If the function returns early without incrementing the counter,
UserReduceTask will decrement a zero count → underflow (wrap to UINT_MAX).

Double-check the implementation or guard the decrement with a prior increment
flag.

src/CraneCtld/AccountMetaContainer.cpp (1)

44-50: Verify the semantics of max_submit_jobs == 0

The current logic treats max_submit_jobs == 0 as "no jobs allowed". In many resource management systems, 0 typically means "unlimited". Please verify if this is the intended behavior.

If 0 should mean unlimited, consider updating the checks:

-if (qos->max_submit_jobs_per_user == 0)
+if (qos->max_submit_jobs_per_user == 0)
+  qos->max_submit_jobs_per_user = std::numeric_limits<uint32_t>::max();

Otherwise, document this behavior clearly in the code comments.

src/CraneCtld/DbClient.cpp (1)

79-84: Clarify the CPU count initialization logic

The initialization uses INT32_MAX / 256 for max_tres.GetAllocatableRes().cpu_count. This division by 256 seems arbitrary and differs from other fields that use the full numeric_limits::max().

#!/bin/bash
# Check if this pattern is used elsewhere in the codebase
rg "INT32_MAX\s*/\s*256" -A 2 -B 2

Consider using consistent max values or documenting why this specific value is used.

src/CraneCtld/AccountManager.cpp (6)

22-22: LGTM!

The TaskScheduler include is necessary for the new UserHasTask check.


190-192: Good addition for data integrity.

Checking if a user has active tasks before deletion prevents data inconsistency and orphaned tasks.


1101-1101: Good refactoring to use centralized helper.

Using CraneModifyFieldStr improves code maintainability and reduces duplication.


1280-1288: Correct implementation of account chain population.

The account hierarchy chain is properly built by traversing from the task's account up to the root, which is essential for hierarchical resource tracking.


2133-2133: Consistent method name updates.

The changes from Delete*AllocationInfo to Delete*Meta methods are consistent with the AccountMetaContainer refactoring.

Also applies to: 2172-2172, 2184-2184


3238-3241: Comprehensive cleanup of partition resources.

The implementation properly removes partition resource limits from both the database and in-memory structures when partitions are deleted.

Also applies to: 3245-3247, 3279-3283

Comment on lines +508 to +509
bool UserHasTasks(const std::string& username);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

UserHasTasks should be const and accept string_view

The method is a pure query, so marking it const documents thread-safety expectations and allows callers to use it on const instances.
Using absl::string_view avoids an unnecessary allocation.

-bool UserHasTasks(const std::string& username);
+bool UserHasTasks(absl::string_view username) const;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bool UserHasTasks(const std::string& username);
bool UserHasTasks(absl::string_view username) const;
🤖 Prompt for AI Agents
In src/CraneCtld/TaskScheduler.h at lines 508-509, the method UserHasTasks
should be updated to be a const member function and accept an absl::string_view
parameter instead of a const std::string&. This change documents that the method
does not modify the object and improves efficiency by avoiding string copying.
Modify the method signature accordingly to reflect these improvements.

Comment on lines +367 to +369
void operator+=(DeviceMap& lhs, const DeviceMap& rhs);
void operator-=(DeviceMap& lhs, const DeviceMap& rhs);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

DeviceMap compound operators declared here – beware of unsigned underflow

operator-= relies on run-time ABSL_ASSERTs in the implementation.
When assertions are stripped (e.g. Release builds) the code dereferences lhs_it unchecked → UB if the key is absent.
Either keep the early‐return checks or switch to CHECK/CRANE_ASSERT that stay in release.

🤖 Prompt for AI Agents
In src/Utilities/PublicHeader/include/crane/PublicHeader.h around lines 367 to
369, the operator-= for DeviceMap relies on ABSL_ASSERTs which are removed in
release builds, causing potential undefined behavior due to unchecked
dereferencing. To fix this, replace ABSL_ASSERT with a CHECK or CRANE_ASSERT
that remains active in release builds, or reintroduce explicit early-return
checks to ensure the key exists before dereferencing.

Comment on lines +204 to +217
// 65 - 69
"The current running job exceeds the QoS limit (MaxJobPerUser)",
"User has insufficient privilege"
"User has insufficient privilege",
"The current account is not in the allowed account list for the partition",
"The current account has been explicitly added to the deny list for the partition",
"ERR_EBPF",

// 70 - 74
"ERR_SUPERVISOR",
"The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)"
"ERR_USER_HAS_TASK",
"The current submitted job exceeds the QoS limit (MaxJobsPerQos)",
"ERR_CONVERT_TO_RESOURCE_VIEW",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing comma concatenates two error strings and shifts the array indexes

"The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)" is not followed by a comma, so the next literal is merged by the compiler, producing
"...MaxSubmitJobsPerAccount)ERR_USER_HAS_TASK".
Every subsequent entry is now off-by-one w.r.t the ErrCode enum, breaking CraneErrStr().

-        "The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)"
+        "The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// 65 - 69
"The current running job exceeds the QoS limit (MaxJobPerUser)",
"User has insufficient privilege"
"User has insufficient privilege",
"The current account is not in the allowed account list for the partition",
"The current account has been explicitly added to the deny list for the partition",
"ERR_EBPF",
// 70 - 74
"ERR_SUPERVISOR",
"The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)"
"ERR_USER_HAS_TASK",
"The current submitted job exceeds the QoS limit (MaxJobsPerQos)",
"ERR_CONVERT_TO_RESOURCE_VIEW",
// 65 - 69
"The current running job exceeds the QoS limit (MaxJobPerUser)",
"User has insufficient privilege",
"The current account is not in the allowed account list for the partition",
"The current account has been explicitly added to the deny list for the partition",
"ERR_EBPF",
// 70 - 74
"ERR_SUPERVISOR",
"The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)",
"ERR_USER_HAS_TASK",
"The current submitted job exceeds the QoS limit (MaxJobsPerQos)",
"ERR_CONVERT_TO_RESOURCE_VIEW",
🤖 Prompt for AI Agents
In src/Utilities/PublicHeader/include/crane/PublicHeader.h between lines 204 and
217, there is a missing comma after the string "The current submitted job
exceeds the QoS limit (MaxSubmitJobsPerAccount)". Add a comma immediately after
this string to separate it from the next string literal "ERR_USER_HAS_TASK".
This will prevent the compiler from concatenating the two strings and ensure the
array indexes align correctly with the ErrCode enum, fixing the CraneErrStr()
mapping.

Comment on lines +361 to +387
void operator+=(DeviceMap& lhs, const DeviceMap& rhs) {
for (const auto& [dev_name, rhs_val] : rhs) {
const auto& [rhs_untyped, rhs_types] = rhs_val;
auto& lhs_pair = lhs[dev_name];
lhs_pair.first += rhs_untyped;
for (const auto& [type_name, type_count] : rhs_types) {
lhs_pair.second[type_name] += type_count;
}
}
}

void operator-=(DeviceMap& lhs, const DeviceMap& rhs) {
for (const auto& [dev_name, rhs_val] : rhs) {
const auto& [rhs_untyped, rhs_types] = rhs_val;
auto lhs_it = lhs.find(dev_name);
ABSL_ASSERT(lhs_it != lhs.end());
auto& lhs_pair = lhs_it->second;
ABSL_ASSERT(lhs_pair.first >= rhs_untyped);
lhs_pair.first -= rhs_untyped;
for (const auto& [type_name, type_count] : rhs_types) {
auto type_it = lhs_pair.second.find(type_name);
ABSL_ASSERT(type_it != lhs_pair.second.end());
ABSL_ASSERT(type_it->second >= type_count);
type_it->second -= type_count;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

DeviceMap::operator-= undefined behaviour in Release builds

When lhs lacks dev_name, lhs_it == lhs.end() and the subsequent dereference triggers UB once ABSL_ASSERT is removed. Same for missing type_name.

-    auto lhs_it = lhs.find(dev_name);
-    ABSL_ASSERT(lhs_it != lhs.end());
-    auto& lhs_pair = lhs_it->second;
+    auto lhs_it = lhs.find(dev_name);
+    if (lhs_it == lhs.end()) return;  // nothing to subtract / or handle error
+    auto& lhs_pair = lhs_it->second;

Apply similar guarding to the inner type_it block.
Fail-fast or gracefully handle the mismatch; silent UB is deadly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void operator+=(DeviceMap& lhs, const DeviceMap& rhs) {
for (const auto& [dev_name, rhs_val] : rhs) {
const auto& [rhs_untyped, rhs_types] = rhs_val;
auto& lhs_pair = lhs[dev_name];
lhs_pair.first += rhs_untyped;
for (const auto& [type_name, type_count] : rhs_types) {
lhs_pair.second[type_name] += type_count;
}
}
}
void operator-=(DeviceMap& lhs, const DeviceMap& rhs) {
for (const auto& [dev_name, rhs_val] : rhs) {
const auto& [rhs_untyped, rhs_types] = rhs_val;
auto lhs_it = lhs.find(dev_name);
ABSL_ASSERT(lhs_it != lhs.end());
auto& lhs_pair = lhs_it->second;
ABSL_ASSERT(lhs_pair.first >= rhs_untyped);
lhs_pair.first -= rhs_untyped;
for (const auto& [type_name, type_count] : rhs_types) {
auto type_it = lhs_pair.second.find(type_name);
ABSL_ASSERT(type_it != lhs_pair.second.end());
ABSL_ASSERT(type_it->second >= type_count);
type_it->second -= type_count;
}
}
}
void operator-=(DeviceMap& lhs, const DeviceMap& rhs) {
for (const auto& [dev_name, rhs_val] : rhs) {
const auto& [rhs_untyped, rhs_types] = rhs_val;
auto lhs_it = lhs.find(dev_name);
if (lhs_it == lhs.end()) return; // nothing to subtract / or handle error
auto& lhs_pair = lhs_it->second;
ABSL_ASSERT(lhs_pair.first >= rhs_untyped);
lhs_pair.first -= rhs_untyped;
for (const auto& [type_name, type_count] : rhs_types) {
auto type_it = lhs_pair.second.find(type_name);
ABSL_ASSERT(type_it != lhs_pair.second.end());
ABSL_ASSERT(type_it->second >= type_count);
type_it->second -= type_count;
}
}
}
🤖 Prompt for AI Agents
In src/Utilities/PublicHeader/PublicHeader.cpp around lines 361 to 387, the
operator-= function for DeviceMap uses ABSL_ASSERT to check for the existence of
keys in lhs, but these asserts are removed in Release builds causing undefined
behavior when keys are missing. To fix this, replace ABSL_ASSERT checks with
explicit if conditions that either fail-fast by throwing exceptions or handle
the missing keys gracefully, ensuring no dereference occurs on invalid iterators
for both dev_name and type_name lookups.

Comment on lines +48 to +87
std::optional<uint64_t> ParseMemStringAsByte(const std::string& mem) {
static const LazyRE2 mem_regex = { R"(^([0-9]+(?:\.?[0-9]*))([MmGgKkB]?)$)" } ;
std::string num_str, unit;
if (!RE2::FullMatch(mem, *mem_regex, &num_str, &unit)) {
CRANE_ERROR("invalid memory format: {}", mem);
return std::nullopt;
}

double sz = 0.0;
try {
sz = std::stod(num_str);
} catch (const std::exception& e) {
CRANE_ERROR("invalid memory format: {}", mem);
return std::nullopt;
}

uint64_t bytes = 0;
switch (unit[0]) {
case 'K':
case 'k':
bytes = sz * 1024;
break;
case 'M':
case 'm':
bytes = sz * 1024 * 1024;
break;
case 'G':
case 'g':
bytes = sz * 1024 * 1024 * 1024;
break;
case 'B':
case 'b':
bytes = sz;
break;
default:
bytes = sz * 1024 * 1024;
}

return bytes;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Undefined-behaviour when unit is omitted

unit can be an empty string when the caller passes a plain number such as "1024".
Accessing unit[0] in that case is undefined behaviour and will read past the string’s bounds, potentially crashing the ctl-daemon.

A minimal, defensive fix:

-  switch (unit[0]) {
+  char u = unit.empty() ? '\0' : unit[0];
+  switch (u) {

You’ll also want to decide on a deterministic default (bytes is the typical choice) instead of silently falling through to the default case that currently assumes megabytes.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::optional<uint64_t> ParseMemStringAsByte(const std::string& mem) {
static const LazyRE2 mem_regex = { R"(^([0-9]+(?:\.?[0-9]*))([MmGgKkB]?)$)" } ;
std::string num_str, unit;
if (!RE2::FullMatch(mem, *mem_regex, &num_str, &unit)) {
CRANE_ERROR("invalid memory format: {}", mem);
return std::nullopt;
}
double sz = 0.0;
try {
sz = std::stod(num_str);
} catch (const std::exception& e) {
CRANE_ERROR("invalid memory format: {}", mem);
return std::nullopt;
}
uint64_t bytes = 0;
switch (unit[0]) {
case 'K':
case 'k':
bytes = sz * 1024;
break;
case 'M':
case 'm':
bytes = sz * 1024 * 1024;
break;
case 'G':
case 'g':
bytes = sz * 1024 * 1024 * 1024;
break;
case 'B':
case 'b':
bytes = sz;
break;
default:
bytes = sz * 1024 * 1024;
}
return bytes;
}
std::optional<uint64_t> ParseMemStringAsByte(const std::string& mem) {
static const LazyRE2 mem_regex = { R"(^([0-9]+(?:\.?[0-9]*))([MmGgKkB]?)$)" } ;
std::string num_str, unit;
if (!RE2::FullMatch(mem, *mem_regex, &num_str, &unit)) {
CRANE_ERROR("invalid memory format: {}", mem);
return std::nullopt;
}
double sz = 0.0;
try {
sz = std::stod(num_str);
} catch (const std::exception& e) {
CRANE_ERROR("invalid memory format: {}", mem);
return std::nullopt;
}
uint64_t bytes = 0;
+ char u = unit.empty() ? '\0' : unit[0];
- switch (unit[0]) {
+ switch (u) {
case 'K':
case 'k':
bytes = sz * 1024;
break;
case 'M':
case 'm':
bytes = sz * 1024 * 1024;
break;
case 'G':
case 'g':
bytes = sz * 1024 * 1024 * 1024;
break;
case 'B':
case 'b':
bytes = sz;
break;
default:
bytes = sz * 1024 * 1024;
}
return bytes;
}
🤖 Prompt for AI Agents
In src/Utilities/PublicHeader/String.cpp lines 48 to 87, the code accesses
unit[0] without checking if unit is empty, causing undefined behavior when the
unit is omitted. Fix this by first checking if unit is empty before accessing
unit[0]. If empty, assign a default unit such as 'B' for bytes to ensure
deterministic behavior instead of relying on the default case that assumes
megabytes. Update the switch statement accordingly to handle this default unit
safely.

Comment on lines +1179 to +1180
resource->GetAllocatableRes().memory_bytes = std::stoull(std::string(allocatable_res["mem"].get_string().value));
resource->GetAllocatableRes().memory_sw_bytes = std::stoull(std::string(allocatable_res["mem_sw"].get_string().value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for memory value parsing

The std::stoull calls could throw exceptions if the string values are invalid.

Add try-catch blocks:

-resource->GetAllocatableRes().memory_bytes = std::stoull(std::string(allocatable_res["mem"].get_string().value));
-resource->GetAllocatableRes().memory_sw_bytes = std::stoull(std::string(allocatable_res["mem_sw"].get_string().value));
+try {
+  resource->GetAllocatableRes().memory_bytes = std::stoull(std::string(allocatable_res["mem"].get_string().value));
+  resource->GetAllocatableRes().memory_sw_bytes = std::stoull(std::string(allocatable_res["mem_sw"].get_string().value));
+} catch (const std::exception& e) {
+  CRANE_ERROR("Failed to parse memory values: {}", e.what());
+  resource->GetAllocatableRes().memory_bytes = 0;
+  resource->GetAllocatableRes().memory_sw_bytes = 0;
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resource->GetAllocatableRes().memory_bytes = std::stoull(std::string(allocatable_res["mem"].get_string().value));
resource->GetAllocatableRes().memory_sw_bytes = std::stoull(std::string(allocatable_res["mem_sw"].get_string().value));
try {
resource->GetAllocatableRes().memory_bytes =
std::stoull(std::string(allocatable_res["mem"].get_string().value));
resource->GetAllocatableRes().memory_sw_bytes =
std::stoull(std::string(allocatable_res["mem_sw"].get_string().value));
} catch (const std::exception& e) {
CRANE_ERROR("Failed to parse memory values: {}", e.what());
resource->GetAllocatableRes().memory_bytes = 0;
resource->GetAllocatableRes().memory_sw_bytes = 0;
}
🤖 Prompt for AI Agents
In src/CraneCtld/DbClient.cpp around lines 1179 to 1180, the calls to
std::stoull for parsing memory values can throw exceptions if the input strings
are invalid. Wrap these calls in try-catch blocks to catch std::invalid_argument
and std::out_of_range exceptions, and handle the errors appropriately, such as
logging the error and setting default values or returning an error status.

Comment on lines +2917 to +2919
void AccountManager::EmpalceAccountPartitionResrouce_(
Account& account, const std::string& partition) {
if (account.partition_to_resource_limit_map.contains(partition)) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo in method name.

The method name has a typo: EmpalceAccountPartitionResrouce_ should be EmplaceAccountPartitionResource_.

🤖 Prompt for AI Agents
In src/CraneCtld/AccountManager.cpp around lines 2917 to 2919, the method name
EmpalceAccountPartitionResrouce_ contains typos. Rename the method to
EmplaceAccountPartitionResource_ to correct the spelling errors in both
"Empalce" and "Resrouce".

Comment on lines +1021 to +1024
CraneExpected<void> AccountManager::ModifyAccountPartitioinResource(
uint32_t uid, crane::grpc::ModifyField modify_field,
const std::string& account_name, const std::string& partition_name,
const std::string& value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo in method name.

The method name has a typo: ModifyAccountPartitioinResource should be ModifyAccountPartitionResource.

🤖 Prompt for AI Agents
In src/CraneCtld/AccountManager.cpp around lines 1021 to 1024, the method name
ModifyAccountPartitioinResource contains a typo. Rename the method to
ModifyAccountPartitionResource by correcting the misspelled word "Partitioin" to
"Partition" in both the method declaration and any corresponding definitions or
calls.

Comment on lines +820 to +823
if ((modify_field == crane::grpc::ModifyField::MaxWall ||
modify_field == crane::grpc::MaxWallDurationPerJob) &&
!CheckIfTimeLimitSecIsValid(value_number))
return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect conditional logic for time limit validation.

The time limit validation should only apply to MaxWall and MaxWallDurationPerJob fields, but the current condition is inside a block that only executes for MaxJobs or MaxSubmitJobs.

Apply this diff to fix the logic:

-  if (modify_field == crane::grpc::ModifyField::MaxJobs ||
-      modify_field == crane::grpc::ModifyField::MaxSubmitJobs) {
-    bool ok = util::ConvertStringToInt64(value, &value_number);
-    if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER);
-
-    if ((modify_field == crane::grpc::ModifyField::MaxWall ||
-         modify_field == crane::grpc::MaxWallDurationPerJob) &&
-        !CheckIfTimeLimitSecIsValid(value_number))
-      return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);
-  }
+  bool ok = util::ConvertStringToInt64(value, &value_number);
+  if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER);
+
+  if ((modify_field == crane::grpc::ModifyField::MaxWall ||
+       modify_field == crane::grpc::ModifyField::MaxWallDurationPerJob) &&
+      !CheckIfTimeLimitSecIsValid(value_number))
+    return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/CraneCtld/AccountManager.cpp around lines 820 to 823, the time limit
validation condition is incorrectly nested inside a block that only runs for
MaxJobs or MaxSubmitJobs, causing it to not apply to MaxWall or
MaxWallDurationPerJob as intended. To fix this, move the conditional check for
MaxWall and MaxWallDurationPerJob fields with the time limit validation outside
or separate from the MaxJobs/MaxSubmitJobs block so that it executes correctly
when those fields are modified.

Comment on lines +1043 to +1052
if (modify_field == crane::grpc::ModifyField::MaxJobs ||
modify_field == crane::grpc::ModifyField::MaxSubmitJobs) {
bool ok = util::ConvertStringToInt64(value, &value_number);
if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER);

if ((modify_field == crane::grpc::ModifyField::MaxWall ||
modify_field == crane::grpc::MaxWallDurationPerJob) &&
!CheckIfTimeLimitSecIsValid(value_number))
return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect conditional logic for time limit validation.

Same issue as in ModifyUserPartitionResource - the time limit check is incorrectly nested.

Apply this diff to fix the logic:

-  if (modify_field == crane::grpc::ModifyField::MaxJobs ||
-      modify_field == crane::grpc::ModifyField::MaxSubmitJobs) {
-    bool ok = util::ConvertStringToInt64(value, &value_number);
-    if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER);
-
-    if ((modify_field == crane::grpc::ModifyField::MaxWall ||
-         modify_field == crane::grpc::MaxWallDurationPerJob) &&
-        !CheckIfTimeLimitSecIsValid(value_number))
-      return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);
-  }
+  bool ok = util::ConvertStringToInt64(value, &value_number);
+  if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER);
+
+  if ((modify_field == crane::grpc::ModifyField::MaxWall ||
+       modify_field == crane::grpc::ModifyField::MaxWallDurationPerJob) &&
+      !CheckIfTimeLimitSecIsValid(value_number))
+    return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (modify_field == crane::grpc::ModifyField::MaxJobs ||
modify_field == crane::grpc::ModifyField::MaxSubmitJobs) {
bool ok = util::ConvertStringToInt64(value, &value_number);
if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER);
if ((modify_field == crane::grpc::ModifyField::MaxWall ||
modify_field == crane::grpc::MaxWallDurationPerJob) &&
!CheckIfTimeLimitSecIsValid(value_number))
return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);
}
bool ok = util::ConvertStringToInt64(value, &value_number);
if (!ok)
return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER);
if ((modify_field == crane::grpc::ModifyField::MaxWall ||
modify_field == crane::grpc::ModifyField::MaxWallDurationPerJob) &&
!CheckIfTimeLimitSecIsValid(value_number))
return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);
🤖 Prompt for AI Agents
In src/CraneCtld/AccountManager.cpp around lines 1043 to 1052, the conditional
logic incorrectly nests the time limit validation inside the check for MaxJobs
and MaxSubmitJobs. To fix this, separate the conditions so that the time limit
validation for MaxWall and MaxWallDurationPerJob is checked independently after
converting the value to an integer. Adjust the if statements to correctly
validate the time limit only when modify_field matches the relevant time limit
fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

需要限制每个账号下的每个分区的资源限制,包含最大提交作业数、做大使用cpu资源数等

2 participants