Feat: Add [account/user-partition] resource limit #529

huerni · 2025-06-18T06:55:55Z

Summary by CodeRabbit

New Features
- Added support for partition-level resource and job limits for users, accounts, and QoS, enabling more granular quota management.
- Expanded resource limit controls to include maximum jobs, submissions, wall time, and TRES (trackable resources) at user, account, QoS, and partition levels.
- Enhanced querying and modification interfaces to display and manage partition-specific resource limits.
- Improved error reporting with new error codes for resource and job limit violations.
Bug Fixes
- Improved validation and error handling when modifying or deleting users, accounts, and QoS, especially regarding active tasks and resource constraints.
Refactor
- Streamlined internal resource tracking and concurrency control for better performance and reliability.
Documentation
- Extended error code descriptions and field name mappings for new resource management features.
Chores
- Added new utility functions for parsing resource-related strings and updated operator overloads for resource types.

… interface

coderabbitai · 2025-06-18T06:56:00Z

Walkthrough

This change introduces comprehensive partition-level resource limit management for users, accounts, and QoS entities. It extends protobuf definitions, database schema, and in-memory structures to support hierarchical, partition-aware resource and job limit controls. The control flow and gRPC interfaces are updated to handle new resource fields, with enhanced validation, serialization, and concurrency handling.

Changes

File(s)	Change Summary
protos/Crane.proto	Added `partition` string field to `ModifyAccountRequest` message.
protos/PublicDefs.proto	Expanded resource/job limit enums and messages; introduced `PartitionResourceLimit`; extended `AccountInfo`, `UserInfo`, and `QosInfo` with partition-aware resource limits and new error codes.
src/CraneCtld/AccountManager.cpp src/CraneCtld/AccountManager.h	Added/modifed methods for managing partition-level resource limits for users/accounts; integrated partition resource limits into workflows; improved permission checks; updated deletion logic; added helper methods for resource map management.
src/CraneCtld/AccountMetaContainer.cpp src/CraneCtld/AccountMetaContainer.h	Refactored and expanded resource allocation/checking: introduced striped locking, new resource meta maps, fine-grained resource checks, task count tracking, and new methods for meta resource management and deletion.
src/CraneCtld/CtldGrpcServer.cpp	Extended gRPC service methods to handle new resource/job limit fields for QoS, accounts, and users; updated control flow for add/modify/query/submit operations; integrated new partition and TRES resource logic.
src/CraneCtld/CtldPublicDefs.h	Added/extended structs for partition resource limits, account/user resource maps, and QoS fields; introduced new type aliases and field name arrays; added helper for field name lookup.
src/CraneCtld/DbClient.cpp src/CraneCtld/DbClient.h	Extended MongoDB serialization/deserialization for new resource limit fields and nested maps; added support for `ResourceView`, device maps, and partition/account resource limits; updated default QoS initialization.
src/CraneCtld/TaskScheduler.cpp	Integrated user existence validation and task count management during recovery/submission; replaced QoS resource calls with meta resource methods; added meta resource checks to scheduling; updated resource freeing logic.
src/CraneCtld/TaskScheduler.h	Added `UserHasTasks` method to check if a user has active tasks.
src/Utilities/PublicHeader/PublicHeader.cpp src/Utilities/PublicHeader/include/crane/PublicHeader.h	Added operator overloads for `DeviceMap` and `ResourceView` (`+=`, `-=`); added non-const device map accessor; extended error code string array.
src/Utilities/PublicHeader/String.cpp src/Utilities/PublicHeader/include/crane/String.h	Added functions for parsing memory strings, device maps, and resource views from strings; declared new utility functions.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CtldGrpcServer
    participant AccountManager
    participant AccountMetaContainer
    participant DbClient

    Client->>CtldGrpcServer: ModifyUser/Account/Qos (with partition/resource fields)
    CtldGrpcServer->>AccountManager: ModifyUserPartitionResource/ModifyAccountPartitionResource/ModifyQosTres
    AccountManager->>DbClient: Update resource limits in DB
    AccountManager->>AccountMetaContainer: Update in-memory resource maps

    Client->>CtldGrpcServer: SubmitTaskToScheduler
    CtldGrpcServer->>AccountMetaContainer: TryMallocMetaSubmitResource
    AccountMetaContainer->>AccountManager: Validate resource/job limits (partition, user, account, qos)
    AccountMetaContainer-->>CtldGrpcServer: Success/Error
    CtldGrpcServer-->>Client: Submit result

Suggested labels

enhancement

Suggested reviewers

RileyWen

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

huerni · 2025-06-18T07:03:13Z

merged after #528

huerni · 2025-06-18T07:03:57Z

@coderabbitai full review

coderabbitai · 2025-06-18T07:04:04Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 23

🔭 Outside diff range comments (1)

protos/PublicDefs.proto (1)

502-528: Do NOT renumber existing enum values – this breaks backward compatibility

ModifyField had stable numeric tags.
By inserting new fields (Description = 3, etc.) every subsequent constant is
shifted, altering on-the-wire numbers and corrupting any persisted data
(in DBs, logs, cached protobufs) produced by earlier binaries.

Add new values only at the tail of the enum and keep historical numbers fixed.
Roll back to the original numbering or introduce a new enum if semantics
changed.

Also fix the typo FLags → Flags.

🧹 Nitpick comments (9)

src/Utilities/PublicHeader/include/crane/String.h (2)

50-51: Nit: add function comment or reference

ParseMemStringAsByte is public but undocumented here. A one-liner describing accepted formats (e.g. “10G”, “512Mi”) will save every future reader a trip to the implementation.

95-98: Document conversion helpers

ConvertStringToDeviceMap / ConvertStringToResourceView are extremely handy but tricky. Brief examples ("gpu:2(A100:1,H100:1)") in the header comment would be valuable.

src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

487-489: Return type exposes mutable internal state – intentional?

DeviceMap& GetDeviceMap() gives callers full write access to the internal map, bypassing invariants (e.g. non-negative counts). If that is required, mention it in the comment; otherwise expose a const view or dedicated mutators.
src/Utilities/PublicHeader/String.cpp (1)
503-553: Whitespace sensitivity may reject perfectly valid input

absl::StrSplit(s, ",") keeps leading/trailing spaces, therefore a commonly typed string like
"cpu=8, mem=32G, gres/gpu:2" fails because " mem=32G" does not satisfy item.starts_with("gres/") and later yields kv[0] == " mem".

Trim each item before further processing:
-  for (const auto& item : items) {
+  for (auto item : items) {
+    absl::StripAsciiWhitespace(&item);
This tiny change makes the parser much more forgiving without affecting strict inputs.
src/CraneCtld/AccountMetaContainer.h (1)

28-38: Synchronise stripe counts – avoid redundant locking

kNumStripes = 128, yet each parallel_flat_hash_map is already internally
striped (4 shards via the 6-th template arg).
Adding an additional 128-element std::array<std::mutex> per container
duplicates contention control, wastes memory, and complicates lock ordering.

Consider:

Drop the manual m_*_stripes_ arrays and rely on phmap’s striped mutexes; or

Reduce kNumStripes to the same value used in the map template and document
the guaranteed lock ordering.

Either way, keep a single, consistent stratagem to minimise dead-lock risk and
runtime overhead.

Also applies to: 113-127
protos/PublicDefs.proto (1)
530-537: Nested map indirection looks unnecessary

UserInfo.PartitionToResourceLimit just wraps a single map. Unless a second
level of metadata is planned, replace:
map<string, PartitionToResourceLimit> account_to_partition_limit
with a direct map<string, PartitionResourceLimit> to keep the schema simple.
src/CraneCtld/DbClient.cpp (1)
657-658: Consider numeric types for memory serialization

Serializing memory values as strings may cause precision issues for very large values and increases storage overhead.

Consider using numeric BSON types:
-allocDoc.append(kvp("mem", std::to_string(value.MemoryBytes())));
-allocDoc.append(kvp("mem_sw", std::to_string(value.MemoryBytes())));
+allocDoc.append(kvp("mem", static_cast<int64_t>(value.MemoryBytes())));
+allocDoc.append(kvp("mem_sw", static_cast<int64_t>(value.MemoryBytes())));
src/CraneCtld/CtldPublicDefs.h (1)

948-959: Document memory implications of nested map structure

The MetaResourceStat structure uses multiple nested maps which could consume significant memory with many users, partitions, and QoS combinations.

Consider:

Adding a comment documenting the memory complexity: O(users × partitions × QoS)

Implementing lazy initialization for rarely used combinations

Adding metrics to monitor memory usage of these structures

src/CraneCtld/AccountManager.cpp (1)

2593-2609: Consider documenting the CPU limit calculation rationale.

The implementation correctly initializes default resource limits. The CPU count calculation std::numeric_limits<int32_t>::max() / 256 appears to be a safeguard against overflow, but consider adding a comment explaining this choice.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7bc056 and fb21dad.

⛔ Files ignored due to path filters (1)

src/Misc/BPF/cgroup_dev_bpf.o is excluded by !**/*.o

📒 Files selected for processing (16)

protos/Crane.proto (1 hunks)
protos/PublicDefs.proto (6 hunks)
src/CraneCtld/AccountManager.cpp (26 hunks)
src/CraneCtld/AccountManager.h (4 hunks)
src/CraneCtld/AccountMetaContainer.cpp (2 hunks)
src/CraneCtld/AccountMetaContainer.h (1 hunks)
src/CraneCtld/CtldGrpcServer.cpp (10 hunks)
src/CraneCtld/CtldPublicDefs.h (7 hunks)
src/CraneCtld/DbClient.cpp (8 hunks)
src/CraneCtld/DbClient.h (2 hunks)
src/CraneCtld/TaskScheduler.cpp (10 hunks)
src/CraneCtld/TaskScheduler.h (1 hunks)
src/Utilities/PublicHeader/PublicHeader.cpp (2 hunks)
src/Utilities/PublicHeader/String.cpp (2 hunks)
src/Utilities/PublicHeader/include/crane/PublicHeader.h (4 hunks)
src/Utilities/PublicHeader/include/crane/String.h (2 hunks)

🔇 Additional comments (13)

protos/Crane.proto (1)

313-320: Confirm wire-compatibility of the new partition field

Field number 7 appears to be unused so adding string partition = 7; is wire-compatible.
If 7 was ever reserved or used by a removed field, older binaries will mis-deserialize. Please verify historical .proto history before merging.

src/Utilities/PublicHeader/include/crane/PublicHeader.h (1)

469-472: ResourceView::operator-= may produce negative resources

Unlike the node-level overload, the view-level overload does not guard against rhs.allocatable_res > lhs.allocatable_res, leading to underflowed uint64_t counts.
Consider mirroring the ABSL_ASSERT(rhs <= this->allocatable_res) check used elsewhere.

src/Utilities/PublicHeader/PublicHeader.cpp (1)

724-734: Safety check missing in ResourceView -= ResourceView

This overload blindly subtracts both maps; if rhs.device_map contains keys not in lhs.device_map, the underlying DeviceMap -= will ASSERT and, in release, mis-behave. Replicate the containment test or convert the assertion to a hard check.
src/Utilities/PublicHeader/String.cpp (1)

64-85: Possible silent overflow & loss of precision

sz is parsed into a double and then multiplied by large powers of two before the cast to uint64_t.
Values above ~9 PiB can no longer be represented exactly in an IEEE-754 double, and the multiplication may silently overflow uint64_t without any diagnostics.

Consider:
-  double sz = 0.0;
+  long double sz = 0.0L;          // preserve precision
...
-  uint64_t bytes = 0;
+  __uint128_t bytes = 0;          // compute in wider type
...
-  return bytes;
+  if (bytes > std::numeric_limits<uint64_t>::max())
+    return std::nullopt;
+  return static_cast<uint64_t>(bytes);
src/CraneCtld/CtldGrpcServer.cpp (1)

1748-1752: Possible task-count underflow

TryMallocMetaSubmitResource fails → UserReduceTask() is called, but
UserAddTask() is only invoked inside TryMallocMetaSubmitResource on
success.
If the function returns early without incrementing the counter,
UserReduceTask will decrement a zero count → underflow (wrap to UINT_MAX).

Double-check the implementation or guard the decrement with a prior increment
flag.
src/CraneCtld/AccountMetaContainer.cpp (1)

44-50: Verify the semantics of max_submit_jobs == 0

The current logic treats max_submit_jobs == 0 as "no jobs allowed". In many resource management systems, 0 typically means "unlimited". Please verify if this is the intended behavior.

If 0 should mean unlimited, consider updating the checks:
-if (qos->max_submit_jobs_per_user == 0)
+if (qos->max_submit_jobs_per_user == 0)
+  qos->max_submit_jobs_per_user = std::numeric_limits<uint32_t>::max();
Otherwise, document this behavior clearly in the code comments.
src/CraneCtld/DbClient.cpp (1)

79-84: Clarify the CPU count initialization logic

The initialization uses INT32_MAX / 256 for max_tres.GetAllocatableRes().cpu_count. This division by 256 seems arbitrary and differs from other fields that use the full numeric_limits::max().
#!/bin/bash
# Check if this pattern is used elsewhere in the codebase
rg "INT32_MAX\s*/\s*256" -A 2 -B 2
Consider using consistent max values or documenting why this specific value is used.
src/CraneCtld/AccountManager.cpp (6)

22-22: LGTM!

The TaskScheduler include is necessary for the new UserHasTask check.

190-192: Good addition for data integrity.

Checking if a user has active tasks before deletion prevents data inconsistency and orphaned tasks.

1101-1101: Good refactoring to use centralized helper.

Using CraneModifyFieldStr improves code maintainability and reduces duplication.

1280-1288: Correct implementation of account chain population.

The account hierarchy chain is properly built by traversing from the task's account up to the root, which is essential for hierarchical resource tracking.

2133-2133: Consistent method name updates.

The changes from Delete*AllocationInfo to Delete*Meta methods are consistent with the AccountMetaContainer refactoring.

Also applies to: 2172-2172, 2184-2184

3238-3241: Comprehensive cleanup of partition resources.

The implementation properly removes partition resource limits from both the database and in-memory structures when partitions are deleted.

Also applies to: 3245-3247, 3279-3283

coderabbitai · 2025-06-18T07:09:38Z

src/CraneCtld/TaskScheduler.h

+  bool UserHasTasks(const std::string& username);
+


🛠️ Refactor suggestion

UserHasTasks should be const and accept string_view

The method is a pure query, so marking it const documents thread-safety expectations and allows callers to use it on const instances.
Using absl::string_view avoids an unnecessary allocation.

-bool UserHasTasks(const std::string& username); +bool UserHasTasks(absl::string_view username) const;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

bool UserHasTasks(const std::string& username);

bool UserHasTasks(absl::string_view username) const;

🤖 Prompt for AI Agents

In src/CraneCtld/TaskScheduler.h at lines 508-509, the method UserHasTasks should be updated to be a const member function and accept an absl::string_view parameter instead of a const std::string&. This change documents that the method does not modify the object and improves efficiency by avoiding string copying. Modify the method signature accordingly to reflect these improvements.

coderabbitai · 2025-06-18T07:09:38Z

src/Utilities/PublicHeader/include/crane/PublicHeader.h

+void operator+=(DeviceMap& lhs, const DeviceMap& rhs);
+void operator-=(DeviceMap& lhs, const DeviceMap& rhs);
+


🛠️ Refactor suggestion

DeviceMap compound operators declared here – beware of unsigned underflow

operator-= relies on run-time ABSL_ASSERTs in the implementation.
When assertions are stripped (e.g. Release builds) the code dereferences lhs_it unchecked → UB if the key is absent.
Either keep the early‐return checks or switch to CHECK/CRANE_ASSERT that stay in release.

🤖 Prompt for AI Agents

In src/Utilities/PublicHeader/include/crane/PublicHeader.h around lines 367 to 369, the operator-= for DeviceMap relies on ABSL_ASSERTs which are removed in release builds, causing potential undefined behavior due to unchecked dereferencing. To fix this, replace ABSL_ASSERT with a CHECK or CRANE_ASSERT that remains active in release builds, or reintroduce explicit early-return checks to ensure the key exists before dereferencing.

coderabbitai · 2025-06-18T07:09:38Z

src/Utilities/PublicHeader/include/crane/PublicHeader.h

+        // 65 - 69
        "The current running job exceeds the QoS limit (MaxJobPerUser)",
-        "User has insufficient privilege"
+        "User has insufficient privilege",
+        "The current account is not in the allowed account list for the partition",
+        "The current account has been explicitly added to the deny list for the partition",
+        "ERR_EBPF",
+
+        // 70 - 74
+        "ERR_SUPERVISOR",
+        "The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)"
+        "ERR_USER_HAS_TASK",
+        "The current submitted job exceeds the QoS limit (MaxJobsPerQos)",
+        "ERR_CONVERT_TO_RESOURCE_VIEW",
+


⚠️ Potential issue

Missing comma concatenates two error strings and shifts the array indexes

"The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)" is not followed by a comma, so the next literal is merged by the compiler, producing
"...MaxSubmitJobsPerAccount)ERR_USER_HAS_TASK".
Every subsequent entry is now off-by-one w.r.t the ErrCode enum, breaking CraneErrStr().

- "The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)" + "The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// 65 - 69

"The current running job exceeds the QoS limit (MaxJobPerUser)",

"User has insufficient privilege"

"User has insufficient privilege",

"The current account is not in the allowed account list for the partition",

"The current account has been explicitly added to the deny list for the partition",

"ERR_EBPF",

// 70 - 74

"ERR_SUPERVISOR",

"The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)"

"ERR_USER_HAS_TASK",

"The current submitted job exceeds the QoS limit (MaxJobsPerQos)",

"ERR_CONVERT_TO_RESOURCE_VIEW",

// 65 - 69

"The current running job exceeds the QoS limit (MaxJobPerUser)",

"User has insufficient privilege",

"The current account is not in the allowed account list for the partition",

"The current account has been explicitly added to the deny list for the partition",

"ERR_EBPF",

// 70 - 74

"ERR_SUPERVISOR",

"The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)",

"ERR_USER_HAS_TASK",

"The current submitted job exceeds the QoS limit (MaxJobsPerQos)",

"ERR_CONVERT_TO_RESOURCE_VIEW",

🤖 Prompt for AI Agents

In src/Utilities/PublicHeader/include/crane/PublicHeader.h between lines 204 and 217, there is a missing comma after the string "The current submitted job exceeds the QoS limit (MaxSubmitJobsPerAccount)". Add a comma immediately after this string to separate it from the next string literal "ERR_USER_HAS_TASK". This will prevent the compiler from concatenating the two strings and ensure the array indexes align correctly with the ErrCode enum, fixing the CraneErrStr() mapping.

coderabbitai · 2025-06-18T07:09:38Z

src/Utilities/PublicHeader/PublicHeader.cpp

+void operator+=(DeviceMap& lhs, const DeviceMap& rhs) {
+  for (const auto& [dev_name, rhs_val] : rhs) {
+    const auto& [rhs_untyped, rhs_types] = rhs_val;
+    auto& lhs_pair = lhs[dev_name];
+    lhs_pair.first += rhs_untyped;
+    for (const auto& [type_name, type_count] : rhs_types) {
+      lhs_pair.second[type_name] += type_count;
+    }
+  }
+}
+
+void operator-=(DeviceMap& lhs, const DeviceMap& rhs) {
+  for (const auto& [dev_name, rhs_val] : rhs) {
+    const auto& [rhs_untyped, rhs_types] = rhs_val;
+    auto lhs_it = lhs.find(dev_name);
+    ABSL_ASSERT(lhs_it != lhs.end());
+    auto& lhs_pair = lhs_it->second;
+    ABSL_ASSERT(lhs_pair.first >= rhs_untyped);
+    lhs_pair.first -= rhs_untyped;
+    for (const auto& [type_name, type_count] : rhs_types) {
+      auto type_it = lhs_pair.second.find(type_name);
+      ABSL_ASSERT(type_it != lhs_pair.second.end());
+      ABSL_ASSERT(type_it->second >= type_count);
+      type_it->second -= type_count;
+    }
+  }
+}


⚠️ Potential issue

DeviceMap::operator-= undefined behaviour in Release builds

When lhs lacks dev_name, lhs_it == lhs.end() and the subsequent dereference triggers UB once ABSL_ASSERT is removed. Same for missing type_name.

- auto lhs_it = lhs.find(dev_name); - ABSL_ASSERT(lhs_it != lhs.end()); - auto& lhs_pair = lhs_it->second; + auto lhs_it = lhs.find(dev_name); + if (lhs_it == lhs.end()) return; // nothing to subtract / or handle error + auto& lhs_pair = lhs_it->second;

Apply similar guarding to the inner type_it block.
Fail-fast or gracefully handle the mismatch; silent UB is deadly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

void operator+=(DeviceMap& lhs, const DeviceMap& rhs) {

for (const auto& [dev_name, rhs_val] : rhs) {

const auto& [rhs_untyped, rhs_types] = rhs_val;

auto& lhs_pair = lhs[dev_name];

lhs_pair.first += rhs_untyped;

for (const auto& [type_name, type_count] : rhs_types) {

lhs_pair.second[type_name] += type_count;

}

}

}

void operator-=(DeviceMap& lhs, const DeviceMap& rhs) {

for (const auto& [dev_name, rhs_val] : rhs) {

const auto& [rhs_untyped, rhs_types] = rhs_val;

auto lhs_it = lhs.find(dev_name);

ABSL_ASSERT(lhs_it != lhs.end());

auto& lhs_pair = lhs_it->second;

ABSL_ASSERT(lhs_pair.first >= rhs_untyped);

lhs_pair.first -= rhs_untyped;

for (const auto& [type_name, type_count] : rhs_types) {

auto type_it = lhs_pair.second.find(type_name);

ABSL_ASSERT(type_it != lhs_pair.second.end());

ABSL_ASSERT(type_it->second >= type_count);

type_it->second -= type_count;

}

}

}

void operator-=(DeviceMap& lhs, const DeviceMap& rhs) {

for (const auto& [dev_name, rhs_val] : rhs) {

const auto& [rhs_untyped, rhs_types] = rhs_val;

auto lhs_it = lhs.find(dev_name);

if (lhs_it == lhs.end()) return; // nothing to subtract / or handle error

auto& lhs_pair = lhs_it->second;

ABSL_ASSERT(lhs_pair.first >= rhs_untyped);

lhs_pair.first -= rhs_untyped;

for (const auto& [type_name, type_count] : rhs_types) {

auto type_it = lhs_pair.second.find(type_name);

ABSL_ASSERT(type_it != lhs_pair.second.end());

ABSL_ASSERT(type_it->second >= type_count);

type_it->second -= type_count;

}

}

}

🤖 Prompt for AI Agents

In src/Utilities/PublicHeader/PublicHeader.cpp around lines 361 to 387, the operator-= function for DeviceMap uses ABSL_ASSERT to check for the existence of keys in lhs, but these asserts are removed in Release builds causing undefined behavior when keys are missing. To fix this, replace ABSL_ASSERT checks with explicit if conditions that either fail-fast by throwing exceptions or handle the missing keys gracefully, ensuring no dereference occurs on invalid iterators for both dev_name and type_name lookups.

coderabbitai · 2025-06-18T07:09:39Z

src/Utilities/PublicHeader/String.cpp

+std::optional<uint64_t> ParseMemStringAsByte(const std::string& mem) {
+  static const LazyRE2 mem_regex = { R"(^([0-9]+(?:\.?[0-9]*))([MmGgKkB]?)$)" } ;
+  std::string num_str, unit;
+  if (!RE2::FullMatch(mem, *mem_regex, &num_str, &unit)) {
+    CRANE_ERROR("invalid memory format: {}", mem);
+    return std::nullopt;
+  }
+
+  double sz = 0.0;
+  try {
+    sz = std::stod(num_str);
+  } catch (const std::exception& e) {
+    CRANE_ERROR("invalid memory format: {}", mem);
+    return std::nullopt;
+  }
+
+  uint64_t bytes = 0;
+  switch (unit[0]) {
+    case 'K':
+    case 'k':
+      bytes = sz * 1024;
+      break;
+    case 'M':
+    case 'm':
+      bytes = sz * 1024 * 1024;
+      break;
+    case 'G':
+    case 'g':
+      bytes = sz * 1024 * 1024 * 1024;
+      break;
+    case 'B':
+    case 'b':
+      bytes = sz;
+      break;
+    default:
+      bytes = sz * 1024 * 1024;
+  }
+
+  return bytes;
+}


🛠️ Refactor suggestion

Undefined-behaviour when unit is omitted

unit can be an empty string when the caller passes a plain number such as "1024".
Accessing unit[0] in that case is undefined behaviour and will read past the string’s bounds, potentially crashing the ctl-daemon.

A minimal, defensive fix:

- switch (unit[0]) { + char u = unit.empty() ? '\0' : unit[0]; + switch (u) {

You’ll also want to decide on a deterministic default (bytes is the typical choice) instead of silently falling through to the default case that currently assumes megabytes.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::optional<uint64_t> ParseMemStringAsByte(const std::string& mem) {

static const LazyRE2 mem_regex = { R"(^([0-9]+(?:\.?[0-9]*))([MmGgKkB]?)$)" } ;

std::string num_str, unit;

if (!RE2::FullMatch(mem, *mem_regex, &num_str, &unit)) {

CRANE_ERROR("invalid memory format: {}", mem);

return std::nullopt;

}

double sz = 0.0;

try {

sz = std::stod(num_str);

} catch (const std::exception& e) {

CRANE_ERROR("invalid memory format: {}", mem);

return std::nullopt;

}

uint64_t bytes = 0;

switch (unit[0]) {

case 'K':

case 'k':

bytes = sz * 1024;

break;

case 'M':

case 'm':

bytes = sz * 1024 * 1024;

break;

case 'G':

case 'g':

bytes = sz * 1024 * 1024 * 1024;

break;

case 'B':

case 'b':

bytes = sz;

break;

default:

bytes = sz * 1024 * 1024;

}

return bytes;

}

std::optional<uint64_t> ParseMemStringAsByte(const std::string& mem) {

static const LazyRE2 mem_regex = { R"(^([0-9]+(?:\.?[0-9]*))([MmGgKkB]?)$)" } ;

std::string num_str, unit;

if (!RE2::FullMatch(mem, *mem_regex, &num_str, &unit)) {

CRANE_ERROR("invalid memory format: {}", mem);

return std::nullopt;

}

double sz = 0.0;

try {

sz = std::stod(num_str);

} catch (const std::exception& e) {

CRANE_ERROR("invalid memory format: {}", mem);

return std::nullopt;

}

uint64_t bytes = 0;

+ char u = unit.empty() ? '\0' : unit[0];

- switch (unit[0]) {

+ switch (u) {

case 'K':

case 'k':

bytes = sz * 1024;

break;

case 'M':

case 'm':

bytes = sz * 1024 * 1024;

break;

case 'G':

case 'g':

bytes = sz * 1024 * 1024 * 1024;

break;

case 'B':

case 'b':

bytes = sz;

break;

default:

bytes = sz * 1024 * 1024;

}

return bytes;

}

🤖 Prompt for AI Agents

In src/Utilities/PublicHeader/String.cpp lines 48 to 87, the code accesses unit[0] without checking if unit is empty, causing undefined behavior when the unit is omitted. Fix this by first checking if unit is empty before accessing unit[0]. If empty, assign a default unit such as 'B' for bytes to ensure deterministic behavior instead of relying on the default case that assumes megabytes. Update the switch statement accordingly to handle this default unit safely.

coderabbitai · 2025-06-18T07:09:40Z

src/CraneCtld/DbClient.cpp

+  resource->GetAllocatableRes().memory_bytes = std::stoull(std::string(allocatable_res["mem"].get_string().value));
+  resource->GetAllocatableRes().memory_sw_bytes = std::stoull(std::string(allocatable_res["mem_sw"].get_string().value));


⚠️ Potential issue

Add error handling for memory value parsing

The std::stoull calls could throw exceptions if the string values are invalid.

Add try-catch blocks:

-resource->GetAllocatableRes().memory_bytes = std::stoull(std::string(allocatable_res["mem"].get_string().value)); -resource->GetAllocatableRes().memory_sw_bytes = std::stoull(std::string(allocatable_res["mem_sw"].get_string().value)); +try { + resource->GetAllocatableRes().memory_bytes = std::stoull(std::string(allocatable_res["mem"].get_string().value)); + resource->GetAllocatableRes().memory_sw_bytes = std::stoull(std::string(allocatable_res["mem_sw"].get_string().value)); +} catch (const std::exception& e) { + CRANE_ERROR("Failed to parse memory values: {}", e.what()); + resource->GetAllocatableRes().memory_bytes = 0; + resource->GetAllocatableRes().memory_sw_bytes = 0; +}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

resource->GetAllocatableRes().memory_bytes = std::stoull(std::string(allocatable_res["mem"].get_string().value));

resource->GetAllocatableRes().memory_sw_bytes = std::stoull(std::string(allocatable_res["mem_sw"].get_string().value));

try {

resource->GetAllocatableRes().memory_bytes =

std::stoull(std::string(allocatable_res["mem"].get_string().value));

resource->GetAllocatableRes().memory_sw_bytes =

std::stoull(std::string(allocatable_res["mem_sw"].get_string().value));

} catch (const std::exception& e) {

CRANE_ERROR("Failed to parse memory values: {}", e.what());

resource->GetAllocatableRes().memory_bytes = 0;

resource->GetAllocatableRes().memory_sw_bytes = 0;

}

🤖 Prompt for AI Agents

In src/CraneCtld/DbClient.cpp around lines 1179 to 1180, the calls to std::stoull for parsing memory values can throw exceptions if the input strings are invalid. Wrap these calls in try-catch blocks to catch std::invalid_argument and std::out_of_range exceptions, and handle the errors appropriately, such as logging the error and setting default values or returning an error status.

coderabbitai · 2025-06-18T07:09:40Z

src/CraneCtld/AccountManager.cpp

+void AccountManager::EmpalceAccountPartitionResrouce_(
+    Account& account, const std::string& partition) {
+  if (account.partition_to_resource_limit_map.contains(partition)) return;


⚠️ Potential issue

Fix typo in method name.

The method name has a typo: EmpalceAccountPartitionResrouce_ should be EmplaceAccountPartitionResource_.

🤖 Prompt for AI Agents

In src/CraneCtld/AccountManager.cpp around lines 2917 to 2919, the method name EmpalceAccountPartitionResrouce_ contains typos. Rename the method to EmplaceAccountPartitionResource_ to correct the spelling errors in both "Empalce" and "Resrouce".

coderabbitai · 2025-06-18T07:09:40Z

src/CraneCtld/AccountManager.cpp

+CraneExpected<void> AccountManager::ModifyAccountPartitioinResource(
+    uint32_t uid, crane::grpc::ModifyField modify_field,
+    const std::string& account_name, const std::string& partition_name,
+    const std::string& value) {


⚠️ Potential issue

Fix typo in method name.

The method name has a typo: ModifyAccountPartitioinResource should be ModifyAccountPartitionResource.

🤖 Prompt for AI Agents

In src/CraneCtld/AccountManager.cpp around lines 1021 to 1024, the method name ModifyAccountPartitioinResource contains a typo. Rename the method to ModifyAccountPartitionResource by correcting the misspelled word "Partitioin" to "Partition" in both the method declaration and any corresponding definitions or calls.

coderabbitai · 2025-06-18T07:09:40Z

src/CraneCtld/AccountManager.cpp

+    if ((modify_field == crane::grpc::ModifyField::MaxWall ||
+         modify_field == crane::grpc::MaxWallDurationPerJob) &&
+        !CheckIfTimeLimitSecIsValid(value_number))
+      return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);


⚠️ Potential issue

Fix incorrect conditional logic for time limit validation.

The time limit validation should only apply to MaxWall and MaxWallDurationPerJob fields, but the current condition is inside a block that only executes for MaxJobs or MaxSubmitJobs.

Apply this diff to fix the logic:

- if (modify_field == crane::grpc::ModifyField::MaxJobs || - modify_field == crane::grpc::ModifyField::MaxSubmitJobs) { - bool ok = util::ConvertStringToInt64(value, &value_number); - if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER); - - if ((modify_field == crane::grpc::ModifyField::MaxWall || - modify_field == crane::grpc::MaxWallDurationPerJob) && - !CheckIfTimeLimitSecIsValid(value_number)) - return std::unexpected(CraneErrCode::ERR_TIME_LIMIT); - } + bool ok = util::ConvertStringToInt64(value, &value_number); + if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER); + + if ((modify_field == crane::grpc::ModifyField::MaxWall || + modify_field == crane::grpc::ModifyField::MaxWallDurationPerJob) && + !CheckIfTimeLimitSecIsValid(value_number)) + return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/CraneCtld/AccountManager.cpp around lines 820 to 823, the time limit validation condition is incorrectly nested inside a block that only runs for MaxJobs or MaxSubmitJobs, causing it to not apply to MaxWall or MaxWallDurationPerJob as intended. To fix this, move the conditional check for MaxWall and MaxWallDurationPerJob fields with the time limit validation outside or separate from the MaxJobs/MaxSubmitJobs block so that it executes correctly when those fields are modified.

coderabbitai · 2025-06-18T07:09:40Z

src/CraneCtld/AccountManager.cpp

+  if (modify_field == crane::grpc::ModifyField::MaxJobs ||
+      modify_field == crane::grpc::ModifyField::MaxSubmitJobs) {
+    bool ok = util::ConvertStringToInt64(value, &value_number);
+    if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER);
+
+    if ((modify_field == crane::grpc::ModifyField::MaxWall ||
+         modify_field == crane::grpc::MaxWallDurationPerJob) &&
+        !CheckIfTimeLimitSecIsValid(value_number))
+      return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);
+  }


⚠️ Potential issue

Fix incorrect conditional logic for time limit validation.

Same issue as in ModifyUserPartitionResource - the time limit check is incorrectly nested.

Apply this diff to fix the logic:

- if (modify_field == crane::grpc::ModifyField::MaxJobs || - modify_field == crane::grpc::ModifyField::MaxSubmitJobs) { - bool ok = util::ConvertStringToInt64(value, &value_number); - if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER); - - if ((modify_field == crane::grpc::ModifyField::MaxWall || - modify_field == crane::grpc::MaxWallDurationPerJob) && - !CheckIfTimeLimitSecIsValid(value_number)) - return std::unexpected(CraneErrCode::ERR_TIME_LIMIT); - } + bool ok = util::ConvertStringToInt64(value, &value_number); + if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER); + + if ((modify_field == crane::grpc::ModifyField::MaxWall || + modify_field == crane::grpc::ModifyField::MaxWallDurationPerJob) && + !CheckIfTimeLimitSecIsValid(value_number)) + return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (modify_field == crane::grpc::ModifyField::MaxJobs ||

modify_field == crane::grpc::ModifyField::MaxSubmitJobs) {

bool ok = util::ConvertStringToInt64(value, &value_number);

if (!ok) return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER);

if ((modify_field == crane::grpc::ModifyField::MaxWall ||

modify_field == crane::grpc::MaxWallDurationPerJob) &&

!CheckIfTimeLimitSecIsValid(value_number))

return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);

}

bool ok = util::ConvertStringToInt64(value, &value_number);

if (!ok)

return std::unexpected(CraneErrCode::ERR_CONVERT_TO_INTEGER);

if ((modify_field == crane::grpc::ModifyField::MaxWall ||

modify_field == crane::grpc::ModifyField::MaxWallDurationPerJob) &&

!CheckIfTimeLimitSecIsValid(value_number))

return std::unexpected(CraneErrCode::ERR_TIME_LIMIT);

🤖 Prompt for AI Agents

In src/CraneCtld/AccountManager.cpp around lines 1043 to 1052, the conditional logic incorrectly nests the time limit validation inside the check for MaxJobs and MaxSubmitJobs. To fix this, separate the conditions so that the time limit validation for MaxWall and MaxWallDurationPerJob is checked independently after converting the value to an integer. Adjust the if statements to correctly validate the time limit only when modify_field matches the relevant time limit fields.

huerni added 29 commits May 30, 2025 11:16

feat: PartitionResource

c0620ee

merge dev/qos_tres_limit

0356094

feat: ModifyUserTresPartitionResource and ModifyUserPartitionResource…

15c0cdd

… interface

feat: dbclient add partition_resource

6cf3c65

feat: ResourceView From db

67ed44b

feat: Account PartitionResource to db and From db

b79623f

feat

22e67d0

fix: account db update partition_resource

d1bf74f

fix: account/user modify partition_resource

5e6fe9a

refactor: fix partition_to_resource_map name

9efc07f

refactor: AccountMeta add partition resource

bc7a95a

merge dev/qos_tres_limit

e3e3939

refactor: modify PartitionResourceLimit name

b5023e5

feat

8ad992c

feat: add partition resource limit

084ce86

refactor: func name

5faea4e

refactor

114a6e5

feat: add max_wall limit

2c2b61f

feat: add max_wall_per_job and max_tres_per_job limit

5ee0f8d

refactor: log

2111fdf

fix: partition max wall

3312d6e

refactor: task reason

d6927bd

fix: user account_to_partition_limit_map

ecd4de0

feat

8eff817

feat: user partition limit add account

2705433

fix: not use actual_account

1a64a63

feat: show account/user-partition limit

64a5f1b

fix: partition_to_limit_map database read

567a83e

fix: cpu account max should be int32 / 256

fb21dad

coderabbitai bot reviewed Jun 18, 2025

View reviewed changes

huerni linked an issue Jun 27, 2025 that may be closed by this pull request

需要限制每个账号下的每个分区的资源限制，包含最大提交作业数、做大使用cpu资源数等 #482

Open

	bool UserHasTasks(const std::string& username);
	bool UserHasTasks(absl::string_view username) const;

		void operator+=(DeviceMap& lhs, const DeviceMap& rhs);
		void operator-=(DeviceMap& lhs, const DeviceMap& rhs);

		resource->GetAllocatableRes().memory_bytes = std::stoull(std::string(allocatable_res["mem"].get_string().value));
		resource->GetAllocatableRes().memory_sw_bytes = std::stoull(std::string(allocatable_res["mem_sw"].get_string().value));

Feat: Add [account/user-partition] resource limit #529

Are you sure you want to change the base?

Feat: Add [account/user-partition] resource limit #529

Uh oh!

Conversation

huerni commented Jun 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

huerni commented Jun 18, 2025

Uh oh!

huerni commented Jun 18, 2025

Uh oh!

coderabbitai bot commented Jun 18, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huerni commented Jun 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 18, 2025 •

edited

Loading