Skip to content

Conversation

@huerni
Copy link
Collaborator

@huerni huerni commented Jul 17, 2025

Summary by CodeRabbit

  • New Features

    • Added automatic detection and reporting of CPU topology (sockets, cores per socket, threads per core) using the hwloc library.
    • Hardware topology information is now included in node registration and task metadata.
    • Task scheduling now considers CPU topology, ensuring tasks are allocated to nodes with sufficient cores per socket.
  • Bug Fixes

    • Improved handling and propagation of hardware topology details in task and node metadata.
    • Added validation and warnings for configuration mismatches related to cores per socket.
  • Chores

    • Integrated hwloc as a required dependency for system builds.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 17, 2025

Walkthrough

This change introduces hardware topology awareness to the project by integrating the hwloc library. It adds new CMake configuration for hwloc, updates protobuf and internal data structures to carry CPU topology information, implements topology detection using hwloc, and propagates this information through node registration, task scheduling, and persistence logic.

Changes

File(s) Change Summary
CMakeLists.txt, CMakeModule/FindHWLOC.cmake, src/Utilities/PublicHeader/CMakeLists.txt Added hwloc as a required dependency, created a CMake module to find hwloc, and linked it to Utility_PublicHeader.
protos/PublicDefs.proto Extended protobuf messages to include cores_per_socket and a new TopologyInfo message for hardware topology.
src/Utilities/PublicHeader/include/crane/OS.h, src/Utilities/PublicHeader/OS.cpp Introduced TopologyInfo struct and implemented GetCpuTopologyInfo using hwloc to retrieve CPU topology.
src/Craned/Craned.cpp, src/Craned/CranedPublicDefs.h, src/Craned/CtldClient.cpp Gathered and stored CPU topology info during config parsing and included it in registration requests.
src/CraneCtld/CtldPublicDefs.h Added topology info fields to node and task metadata structures.
src/CraneCtld/DbClient.cpp Included cores_per_socket in MongoDB document conversion for tasks.
src/CraneCtld/RpcService/CranedKeeper.cpp Propagated cores_per_socket in task execution requests.
src/CraneCtld/TaskScheduler.cpp Added node filtering in scheduler based on cores_per_socket compatibility.
src/CraneCtld/CraneCtld.cpp Added parsing of CoresPerSocket from node config to topology info.
src/CraneCtld/CranedMetaContainer.cpp Initialized and validated cores_per_socket in static and remote node metadata, with warnings on mismatch.
etc/config.yaml Added CPU topology parameters (Sockets, CoresPerSocket, ThreadsPerCore) for specific nodes.

Sequence Diagram(s)

sequenceDiagram
    participant Craned
    participant OSUtil
    participant CtldClient
    participant CraneCtld
    participant Scheduler

    Craned->>OSUtil: GetCpuTopologyInfo()
    OSUtil-->>Craned: TopologyInfo struct
    Craned->>CtldClient: Prepare registration (includes TopologyInfo)
    CtldClient->>CraneCtld: Register node (with TopologyInfo)
    CraneCtld->>Scheduler: Update node meta (TopologyInfo available)
    Scheduler->>Scheduler: Filter nodes by cores_per_socket for tasks
Loading

Suggested reviewers

  • RileyWen
  • L-Xiafeng

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf25d79 and 775fa44.

📒 Files selected for processing (4)
  • etc/config.yaml (1 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/CranedMetaContainer.cpp (3 hunks)
  • src/Utilities/PublicHeader/OS.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • etc/config.yaml
  • src/CraneCtld/CraneCtld.cpp
  • src/Utilities/PublicHeader/OS.cpp
  • src/CraneCtld/CranedMetaContainer.cpp
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@huerni huerni linked an issue Jul 17, 2025 that may be closed by this pull request
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/CraneCtld/TaskScheduler.cpp (2)

563-570: LGTM! CPU topology filtering logic is correct.

The topology filtering correctly excludes nodes that cannot satisfy the task's cores-per-socket requirement. The debug logging is helpful for troubleshooting scheduling decisions.

Consider adding validation to ensure topology_info.cores_per_socket is properly initialized (non-zero) to handle edge cases where topology information might be missing or invalid.


644-651: Consider refactoring to eliminate code duplication.

This topology filtering logic is identical to the one at lines 563-570. While the logic is correct, having the same validation in multiple places within the same method increases maintenance overhead.

Consider extracting this logic into a helper function like:

bool IsNodeTopologyCompatible(const TaskInCtld* task, const CranedMeta* craned_meta) {
  if (task->cores_per_socket > craned_meta->remote_meta.topology_info.cores_per_socket) {
    return false;
  }
  CRANE_DEBUG("The number of cores per CPU socket on this Craned {} is sufficient for the task {}.",
              craned_meta->static_meta.hostname, task->TaskId());
  return true;
}

This would improve maintainability and ensure consistent behavior across both code paths.

CMakeModule/FindHWLOC.cmake (1)

14-14: Consider using CRANE_TRACE or making this message conditional.

The debug message will always be printed during CMake configuration, which might be too verbose for normal builds. Consider using a conditional message based on a debug flag or switching to a trace-level message.

-message(STATUS "PkgConfig: HWLOC INCLUDE_DIR:${PC_HWLOC_INCLUDE_DIRS}, LIB_DIR:${PC_HWLOC_LIBRARY_DIRS}, Ver: ${PC_HWLOC_VERSION}")
+if(CMAKE_VERBOSE_MAKEFILE OR HWLOC_DEBUG)
+  message(STATUS "PkgConfig: HWLOC INCLUDE_DIR:${PC_HWLOC_INCLUDE_DIRS}, LIB_DIR:${PC_HWLOC_LIBRARY_DIRS}, Ver: ${PC_HWLOC_VERSION}")
+endif()
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df6b06d and 092c008.

📒 Files selected for processing (13)
  • CMakeLists.txt (1 hunks)
  • CMakeModule/FindHWLOC.cmake (1 hunks)
  • protos/PublicDefs.proto (4 hunks)
  • src/CraneCtld/CtldPublicDefs.h (4 hunks)
  • src/CraneCtld/DbClient.cpp (8 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.cpp (2 hunks)
  • src/Craned/Craned.cpp (1 hunks)
  • src/Craned/CranedPublicDefs.h (1 hunks)
  • src/Craned/CtldClient.cpp (1 hunks)
  • src/Utilities/PublicHeader/CMakeLists.txt (1 hunks)
  • src/Utilities/PublicHeader/OS.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/OS.h (3 hunks)
🧰 Additional context used
🧠 Learnings (9)
src/Utilities/PublicHeader/CMakeLists.txt (1)
Learnt from: RileyWen
PR: PKUHPC/CraneSched#456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.
src/Craned/Craned.cpp (5)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/include/crane/CpuFrequency.h:57-67
Timestamp: 2025-06-30T08:57:12.915Z
Learning: In CraneSched's CpuFrequency class, the static member functions (ParseCpuList_, CpuFreqGetScalingFreq_, CpuFreqSetGov_, CpuFreqSetScalingFreq_, SetCpuOwnerLock_, TestCpuOwnerLock_, FdLockRetry_) are utility functions that only interact with the system via file I/O and locking without accessing instance members like m_cpu_freq_data_. These functions are appropriately declared as static.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/CpuFrequency.cpp:56-77
Timestamp: 2025-06-30T08:51:58.728Z
Learning: In CraneSched's CpuFrequency implementation, the scaling_available_governors file contains only one governor per CPU (not multiple space-separated governors as typically seen in standard Linux systems), so the current if-else parsing logic in CpuFrequency::Init() is correct for reading a single governor string.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
src/CraneCtld/RpcService/CranedKeeper.cpp (10)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
PR: PKUHPC/CraneSched#469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Learnt from: huerni
PR: PKUHPC/CraneSched#469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
Learnt from: 1daidai1
PR: PKUHPC/CraneSched#458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.
src/Craned/CtldClient.cpp (3)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
src/Utilities/PublicHeader/include/crane/OS.h (7)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/include/crane/CpuFrequency.h:57-67
Timestamp: 2025-06-30T08:57:12.915Z
Learning: In CraneSched's CpuFrequency class, the static member functions (ParseCpuList_, CpuFreqGetScalingFreq_, CpuFreqSetGov_, CpuFreqSetScalingFreq_, SetCpuOwnerLock_, TestCpuOwnerLock_, FdLockRetry_) are utility functions that only interact with the system via file I/O and locking without accessing instance members like m_cpu_freq_data_. These functions are appropriately declared as static.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.
Learnt from: RileyWen
PR: PKUHPC/CraneSched#456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Utilities/PublicHeader/Logger.cpp:21-21
Timestamp: 2025-06-23T07:47:44.485Z
Learning: In the CraneSched codebase, static logger sinks like default_sinks in src/Utilities/PublicHeader/Logger.cpp are initialized during startup in a single-threaded context and then become read-only during normal operation, eliminating the need for synchronization protection.
Learnt from: RileyWen
PR: PKUHPC/CraneSched#456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: The `phmap::parallel_flat_hash_set` used in CraneSched's `VaultClientWrapper` is from the Parallel Hashmap library, which provides thread-safe containers for concurrent access. It uses internal partitioning to reduce contention, making it safe for use in multithreaded environments without external synchronization.
src/Craned/CranedPublicDefs.h (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
src/CraneCtld/TaskScheduler.cpp (14)

undefined

<retrieved_learning>
Learnt from: huerni
PR: #537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
</retrieved_learning>

<retrieved_learning>
Learnt from: huerni
PR: #537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the execution_node field in JobToD is intentionally set to the first element of executing_craned_ids vector without guards, as it represents the main execution node for a job. This is by design and assumes executing_craned_ids is never empty when GetJobOfNode is called.
</retrieved_learning>

<retrieved_learning>
Learnt from: huerni
PR: #469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
</retrieved_learning>

<retrieved_learning>
Learnt from: huerni
PR: #469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.
</retrieved_learning>

<retrieved_learning>
Learnt from: 1daidai1
PR: #458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
</retrieved_learning>

<retrieved_learning>
Learnt from: 1daidai1
PR: #458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
</retrieved_learning>

src/CraneCtld/CtldPublicDefs.h (2)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
src/CraneCtld/DbClient.cpp (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#528
File: src/CraneCtld/DbClient.cpp:0-0
Timestamp: 2025-06-23T01:34:41.639Z
Learning: MemoryBytes() method returns uint64_t type in the CraneSched codebase, and when storing in MongoDB BSON documents, it should be cast to int64_t to maintain consistency with the existing schema and enable numeric database operations.
🧬 Code Graph Analysis (3)
src/Craned/Craned.cpp (1)
src/Utilities/PublicHeader/OS.cpp (2)
  • GetCpuTopologyInfo (208-225)
  • GetCpuTopologyInfo (208-208)
src/Craned/CtldClient.cpp (1)
src/Craned/CranedPublicDefs.h (1)
  • g_config (108-108)
src/Utilities/PublicHeader/include/crane/OS.h (1)
src/Utilities/PublicHeader/OS.cpp (2)
  • GetCpuTopologyInfo (208-225)
  • GetCpuTopologyInfo (208-208)
🔇 Additional comments (20)
protos/PublicDefs.proto (4)

584-588: LGTM: Well-structured topology information message.

The TopologyInfo message properly encapsulates CPU topology data with appropriate field types and numbering.


168-168: LGTM: Consistent field addition for task filtering.

The cores_per_socket field addition enables CPU topology-aware task scheduling. Field number 38 is properly assigned.


243-243: LGTM: Consistent field addition across task messages.

The cores_per_socket field is consistently added to both TaskToCtld and TaskToD messages for proper task execution context.


599-599: LGTM: Proper integration of topology info in node metadata.

The topology_info field addition to CranedRemoteMeta enables nodes to report their CPU topology to the control daemon.

CMakeLists.txt (1)

342-344: LGTM: Proper integration of mandatory HWLOC dependency.

The HWLOC package is correctly added as a required dependency to support the new CPU topology detection functionality. The placement after LibAIO and before conditional BPF check is appropriate.

src/Craned/CranedPublicDefs.h (1)

96-96: LGTM: Proper extension of node metadata structure.

The TopologyInfoOfThisNode member addition appropriately extends the CranedMeta struct to include CPU topology information, maintaining consistency with other metadata fields.

src/Craned/Craned.cpp (1)

579-584: LGTM: Well-implemented CPU topology detection with proper error handling.

The topology detection code is properly integrated into the configuration parsing process with appropriate error logging and diagnostic output. The placement after system release info gathering is logical.

src/Utilities/PublicHeader/CMakeLists.txt (1)

19-19: LGTM: Proper library linkage for CPU topology functionality.

The HWLOC::hwloc library is correctly linked to enable the new GetCpuTopologyInfo function implementation in the utilities layer.

src/CraneCtld/RpcService/CranedKeeper.cpp (1)

300-300: LGTM - Correctly propagates CPU topology information.

The addition of cores_per_socket to the task execution request properly extends the RPC interface to include CPU topology data. This enables compute nodes to receive and utilize topology information for task scheduling and execution.

src/Craned/CtldClient.cpp (1)

401-405: LGTM - Properly integrates CPU topology into node registration.

The addition of topology information to the node registration request correctly enables the controller to be aware of each compute node's CPU characteristics. The implementation follows the established pattern for populating gRPC message fields and is well-positioned within the registration sequence.

src/Utilities/PublicHeader/OS.cpp (1)

208-225: Excellent implementation with proper resource management.

The GetCpuTopologyInfo function demonstrates several good practices:

  • Proper hwloc resource management with cleanup on all code paths
  • Safe arithmetic operations with zero-division protection
  • Clear error handling with early returns on initialization failures
  • Appropriate use of hwloc object types for accurate topology detection

The calculation logic correctly derives cores per socket and threads per core from the hardware topology data.

src/Utilities/PublicHeader/include/crane/OS.h (2)

38-42: Clean and well-designed data structure.

The TopologyInfo struct provides a clear interface for CPU topology data with appropriately typed fields. The use of uint32_t is suitable for hardware count values, and the field names are descriptive and unambiguous.


27-27: Appropriate header additions for topology functionality.

The hwloc.h include is necessary for the topology detection implementation, and the GetCpuTopologyInfo function declaration provides a clean interface with clear input/output semantics using a pointer parameter and boolean return value.

Also applies to: 77-77

CMakeModule/FindHWLOC.cmake (3)

16-27: LGTM! Well-structured library discovery.

The find_path and find_library calls follow CMake best practices with proper variable naming and effective use of pkg-config hints.


28-35: LGTM! Proper use of standard CMake error handling.

The find_package_handle_standard_args call correctly validates required variables, handles version checking, and provides a clear error message.


36-42: LGTM! Proper imported target creation.

The conditional target creation and interface property setup follow CMake best practices, ensuring no duplicate targets are created while properly exposing include directories and libraries.

src/CraneCtld/DbClient.cpp (2)

942-1007: LGTM! Consistent addition of cores_per_socket field.

The changes systematically add the cores_per_socket field to the MongoDB document:

  • Array size correctly increased from 37 to 38
  • Field added at the correct position
  • Proper int64_t casting for MongoDB storage (consistent with learned practices)
  • Comments updated accurately
  • Value extracted from the correct protobuf field

1033-1090: LGTM! Consistent implementation across both document conversion functions.

The TaskInCtldToDocument_ function changes mirror those in TaskInEmbeddedDbToDocument_, maintaining consistency:

  • Identical field positioning and array size updates
  • Same int64_t casting approach
  • Consistent value extraction from task->cores_per_socket
src/CraneCtld/CtldPublicDefs.h (2)

206-228: LGTM! Proper integration of topology information into node metadata.

The changes correctly:

  • Add TopologyInfo topology_info field to the struct
  • Update the constructor to copy all topology fields from the gRPC message
  • Maintain consistency with the protobuf TopologyInfo message structure

404-624: LGTM! Well-placed addition of cores_per_socket to task metadata.

The changes appropriately:

  • Add cores_per_socket as uint32_t in the submission-time fields section
  • Update SetFieldsByTaskToCtld to properly extract the value from the protobuf message
  • Maintain consistency with the field placement and initialization patterns

@huerni huerni added enhancement New feature or request build Build system change labels Jul 17, 2025
@huerni
Copy link
Collaborator Author

huerni commented Jul 17, 2025

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 17, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
src/Utilities/PublicHeader/CMakeLists.txt (1)

15-20: Link hwloc privately to avoid leaking the dependency

OS.cpp is the only TU that actually calls into hwloc; consumers of Utility_PublicHeader only see plain POD types in the headers.
Use PRIVATE instead of PUBLIC to keep downstream targets from inheriting the hwloc link requirement.

-target_link_libraries(Utility_PublicHeader PUBLIC
+target_link_libraries(Utility_PublicHeader
+        PUBLIC
         spdlog::spdlog
         crane_proto_lib
         fpm
-        HWLOC::hwloc
+        PRIVATE HWLOC::hwloc
 )
🧹 Nitpick comments (3)
CMakeLists.txt (1)

342-344: Make the hwloc dependency optional or version-gated

find_package(HWLOC REQUIRED) hard-fails the build on any environment that lacks the library (e.g. Mac CI runners, minimal containers).
Either (a) gate this behind an OPTION(CRANE_ENABLE_HWLOC … ON) or (b) specify a minimal version and fall back gracefully:

option(CRANE_ENABLE_HWLOC "Enable hardware-topology support via hwloc" ON)

if (CRANE_ENABLE_HWLOC)
  find_package(HWLOC 2.0 REQUIRED)
else()
  message(WARNING "Building without hwloc – CPU-topology features disabled")
endif()

This keeps the build portable while still enforcing the dependency when the feature is requested.

src/Craned/CranedPublicDefs.h (1)

96-97: Naming nuance: drop the “OfThisNode” suffix

Config::CranedMeta already represents this node. TopologyInfoOfThisNode is therefore verbose and inconsistent with SysInfo.
Recommend:

-    TopologyInfo TopologyInfoOfThisNode;
+    TopologyInfo TopologyInfo;

Keeps the API concise and self-explanatory.

src/CraneCtld/TaskScheduler.cpp (1)

2568-2572: Logging level & message are misleading

The debug line is reached only when the node passes the check; however the text reads “is sufficient for the task” which is more informational than debug and will spam logs for every candidate node.
Consider either:

  1. Downgrading to CRANE_TRACE or
  2. Printing only when the node was rejected, including the requested / available cps values, to help users diagnose why a job is pending.

No diff provided – up to maintainer’s preference.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df6b06d and 29a5cd1.

📒 Files selected for processing (13)
  • CMakeLists.txt (1 hunks)
  • CMakeModule/FindHWLOC.cmake (1 hunks)
  • protos/PublicDefs.proto (4 hunks)
  • src/CraneCtld/CtldPublicDefs.h (4 hunks)
  • src/CraneCtld/DbClient.cpp (8 hunks)
  • src/CraneCtld/RpcService/CranedKeeper.cpp (1 hunks)
  • src/CraneCtld/TaskScheduler.cpp (2 hunks)
  • src/Craned/Craned.cpp (1 hunks)
  • src/Craned/CranedPublicDefs.h (1 hunks)
  • src/Craned/CtldClient.cpp (1 hunks)
  • src/Utilities/PublicHeader/CMakeLists.txt (1 hunks)
  • src/Utilities/PublicHeader/OS.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/OS.h (3 hunks)
🧰 Additional context used
🧠 Learnings (9)
src/Craned/Craned.cpp (5)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/include/crane/CpuFrequency.h:57-67
Timestamp: 2025-06-30T08:57:12.915Z
Learning: In CraneSched's CpuFrequency class, the static member functions (ParseCpuList_, CpuFreqGetScalingFreq_, CpuFreqSetGov_, CpuFreqSetScalingFreq_, SetCpuOwnerLock_, TestCpuOwnerLock_, FdLockRetry_) are utility functions that only interact with the system via file I/O and locking without accessing instance members like m_cpu_freq_data_. These functions are appropriately declared as static.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/CpuFrequency.cpp:56-77
Timestamp: 2025-06-30T08:51:58.728Z
Learning: In CraneSched's CpuFrequency implementation, the scaling_available_governors file contains only one governor per CPU (not multiple space-separated governors as typically seen in standard Linux systems), so the current if-else parsing logic in CpuFrequency::Init() is correct for reading a single governor string.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
src/Utilities/PublicHeader/CMakeLists.txt (2)
Learnt from: RileyWen
PR: PKUHPC/CraneSched#456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.
Learnt from: huerni
PR: PKUHPC/CraneSched#469
File: src/Craned/TaskManager.cpp:0-0
Timestamp: 2025-05-08T07:38:42.362Z
Learning: In CraneSched's PMIx integration, the `g_pmix_server->SetupFork()` function must be called in the child process after fork() and before exec() to properly set up the PMIx environment variables.
src/Craned/CranedPublicDefs.h (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
src/Craned/CtldClient.cpp (3)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
src/CraneCtld/RpcService/CranedKeeper.cpp (10)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
PR: PKUHPC/CraneSched#469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Learnt from: huerni
PR: PKUHPC/CraneSched#469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
Learnt from: 1daidai1
PR: PKUHPC/CraneSched#458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
src/Utilities/PublicHeader/include/crane/OS.h (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/include/crane/CpuFrequency.h:57-67
Timestamp: 2025-06-30T08:57:12.915Z
Learning: In CraneSched's CpuFrequency class, the static member functions (ParseCpuList_, CpuFreqGetScalingFreq_, CpuFreqSetGov_, CpuFreqSetScalingFreq_, SetCpuOwnerLock_, TestCpuOwnerLock_, FdLockRetry_) are utility functions that only interact with the system via file I/O and locking without accessing instance members like m_cpu_freq_data_. These functions are appropriately declared as static.
src/CraneCtld/DbClient.cpp (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#528
File: src/CraneCtld/DbClient.cpp:0-0
Timestamp: 2025-06-23T01:34:41.639Z
Learning: MemoryBytes() method returns uint64_t type in the CraneSched codebase, and when storing in MongoDB BSON documents, it should be cast to int64_t to maintain consistency with the existing schema and enable numeric database operations.
src/CraneCtld/TaskScheduler.cpp (15)

undefined

<retrieved_learning>
Learnt from: huerni
PR: #537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
</retrieved_learning>

<retrieved_learning>
Learnt from: huerni
PR: #537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
</retrieved_learning>

<retrieved_learning>
Learnt from: huerni
PR: #469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
</retrieved_learning>

<retrieved_learning>
Learnt from: huerni
PR: #469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched contains assertions in its overloaded operators (like -=) that prevent negative values from occurring when resources are allocated or deallocated.
</retrieved_learning>

<retrieved_learning>
Learnt from: 1daidai1
PR: #458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
</retrieved_learning>

<retrieved_learning>
Learnt from: 1daidai1
PR: #458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the execution_node field in JobToD is intentionally set to the first element of executing_craned_ids vector without guards, as it represents the main execution node for a job. This is by design and assumes executing_craned_ids is never empty when GetJobOfNode is called.
</retrieved_learning>

<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
</retrieved_learning>

src/CraneCtld/CtldPublicDefs.h (2)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
🧬 Code Graph Analysis (3)
src/Craned/Craned.cpp (1)
src/Utilities/PublicHeader/OS.cpp (2)
  • GetCpuTopologyInfo (208-226)
  • GetCpuTopologyInfo (208-208)
src/Craned/CtldClient.cpp (1)
src/Craned/CranedPublicDefs.h (1)
  • g_config (108-108)
src/Utilities/PublicHeader/include/crane/OS.h (1)
src/Utilities/PublicHeader/OS.cpp (2)
  • GetCpuTopologyInfo (208-226)
  • GetCpuTopologyInfo (208-208)
🔇 Additional comments (25)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)

298-301: Zero-initialised cores_per_socket may mislead craned

TaskToCtld().cores_per_socket() is 0 unless explicitly set during submission; propagating a 0 can be interpreted as “no constraint”.
Consider sending the value only when > 0, or defaulting to the target node’s physical value.

if (task->TaskToCtld().cores_per_socket() > 0)
  mutable_task->set_cores_per_socket(task->TaskToCtld().cores_per_socket());

Prevents ambiguity between “unset” and “zero”.

src/Craned/CtldClient.cpp (1)

401-405: LGTM! Topology information integration is well-implemented.

The CPU topology information is properly extracted from the global configuration and populated into the gRPC registration request metadata. The implementation follows the established pattern used for system release information and maintains consistency with the protobuf schema changes.

src/Utilities/PublicHeader/OS.cpp (1)

208-226: LGTM! Robust hwloc integration with proper error handling.

The implementation correctly uses the hwloc library to gather CPU topology information. Key strengths:

  • Proper hwloc initialization and loading with error checks
  • Division by zero protection in calculations (lines 220-222)
  • Consistent cleanup of hwloc topology object in both success and failure paths
  • Appropriate return value indicating success/failure

The function follows hwloc best practices and provides reliable topology detection.

protos/PublicDefs.proto (4)

584-588: Well-designed TopologyInfo message structure.

The new TopologyInfo message appropriately groups related CPU topology fields with logical field numbering and suitable uint32 types for hardware counts.


168-168: Appropriate integration of cores_per_socket in TaskToCtld.

The cores_per_socket field is properly added to the task submission metadata with a unique field number (38) and appropriate uint32 type.


243-243: Consistent cores_per_socket field in TaskToD.

The cores_per_socket field maintains consistency between task submission and execution messages with proper field numbering (25).


599-599: Logical integration of topology_info in CranedRemoteMeta.

The TopologyInfo field is appropriately included in the node metadata message, enabling topology information propagation during node registration.

src/Utilities/PublicHeader/include/crane/OS.h (3)

22-22: Appropriate hwloc header inclusion.

The hwloc.h include is necessary for the hwloc types used in the CPU topology detection implementation.


38-42: Well-structured TopologyInfo definition.

The TopologyInfo struct appropriately groups CPU topology fields with uint32_t types, maintaining consistency with the protobuf schema and implementation requirements.


77-77: Consistent function declaration.

The GetCpuTopologyInfo function declaration matches the implementation in OS.cpp and follows the established pattern for utility functions in this namespace.

CMakeModule/FindHWLOC.cmake (1)

1-42: Excellent CMake find module following modern practices.

This find module demonstrates excellent CMake practices:

  • Uses pkg-config for initial discovery, appropriate for hwloc
  • Includes proper fallback search mechanisms
  • Follows standard CMake conventions with find_package_handle_standard_args
  • Creates a modern imported target HWLOC::hwloc with appropriate interface properties
  • Provides helpful status messages for debugging

The implementation is robust and should reliably locate hwloc across different systems.

src/CraneCtld/CtldPublicDefs.h (4)

206-206: LGTM: Topology info field addition is well-integrated.

The addition of the topology_info field to CranedRemoteMeta follows the existing pattern and integrates well with the struct's design.


225-229: LGTM: Constructor correctly populates topology information.

The constructor properly extracts and assigns the topology fields from the gRPC message. The field assignments are consistent with the protobuf schema and follow the existing initialization pattern.


406-406: LGTM: Field addition follows consistent naming and type conventions.

The cores_per_socket field uses the appropriate uint32_t type and follows the existing naming convention in the struct.


626-626: LGTM: Field assignment is correctly integrated.

The assignment of cores_per_socket from the protobuf message is properly placed and follows the existing pattern in SetFieldsByTaskToCtld.

src/CraneCtld/DbClient.cpp (10)

942-942: LGTM: Comment updated to reflect new field count.

The comment correctly reflects the addition of the cores_per_socket field at index 37.


945-945: LGTM: Field array size correctly updated.

The array size is properly increased from 37 to 38 to accommodate the new cores_per_socket field.


961-961: LGTM: Field name added in correct position.

The "cores_per_socket" field is properly added at the end of the fields array, maintaining consistency with the indexing pattern.


972-972: LGTM: Tuple type correctly extended.

The tuple type signature is properly updated to include int64_t for the new field, maintaining type consistency.


1006-1007: LGTM: Value extraction and casting follows established pattern.

The cores_per_socket value is correctly extracted from the protobuf message and cast to int64_t for MongoDB storage, following the learned pattern for numeric fields in BSON documents.


1033-1033: LGTM: Comment correctly updated for TaskInCtldToDocument_.

The comment properly reflects the new field structure for the second document conversion function.


1036-1036: LGTM: Field array size consistently updated.

The array size is correctly maintained at 38 fields in both document conversion functions.


1052-1052: LGTM: Field name consistently added.

The "cores_per_socket" field is properly positioned at the end of the fields array in both functions.


1063-1063: LGTM: Tuple type signature consistently updated.

The tuple type is properly extended with int64_t in both document conversion functions.


1089-1089: LGTM: Task field access and casting is correct.

The cores_per_socket field is correctly accessed from the TaskInCtld object and cast to int64_t for MongoDB storage, maintaining consistency with the first conversion function.

@huerni huerni marked this pull request as draft July 17, 2025 06:13
@huerni huerni force-pushed the dev/cores-per-socket branch from 29a5cd1 to 4f26b62 Compare July 17, 2025 08:24
@huerni huerni marked this pull request as ready for review July 17, 2025 09:10
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29a5cd1 and cf25d79.

📒 Files selected for processing (9)
  • etc/config.yaml (1 hunks)
  • src/CraneCtld/CraneCtld.cpp (1 hunks)
  • src/CraneCtld/CranedMetaContainer.cpp (2 hunks)
  • src/CraneCtld/CtldPublicDefs.h (6 hunks)
  • src/CraneCtld/DbClient.cpp (8 hunks)
  • src/CraneCtld/TaskScheduler.cpp (3 hunks)
  • src/Craned/Craned.cpp (1 hunks)
  • src/Utilities/PublicHeader/OS.cpp (1 hunks)
  • src/Utilities/PublicHeader/include/crane/OS.h (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • etc/config.yaml
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/Craned/Craned.cpp
  • src/Utilities/PublicHeader/include/crane/OS.h
  • src/CraneCtld/TaskScheduler.cpp
  • src/Utilities/PublicHeader/OS.cpp
  • src/CraneCtld/CtldPublicDefs.h
  • src/CraneCtld/DbClient.cpp
🧰 Additional context used
🧠 Learnings (2)
src/CraneCtld/CraneCtld.cpp (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/CpuFrequency.cpp:56-77
Timestamp: 2025-06-30T08:51:58.728Z
Learning: In CraneSched's CpuFrequency implementation, the scaling_available_governors file contains only one governor per CPU (not multiple space-separated governors as typically seen in standard Linux systems), so the current if-else parsing logic in CpuFrequency::Init() is correct for reading a single governor string.
src/CraneCtld/CranedMetaContainer.cpp (8)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/include/crane/CpuFrequency.h:57-67
Timestamp: 2025-06-30T08:57:12.915Z
Learning: In CraneSched's CpuFrequency class, the static member functions (ParseCpuList_, CpuFreqGetScalingFreq_, CpuFreqSetGov_, CpuFreqSetScalingFreq_, SetCpuOwnerLock_, TestCpuOwnerLock_, FdLockRetry_) are utility functions that only interact with the system via file I/O and locking without accessing instance members like m_cpu_freq_data_. These functions are appropriately declared as static.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
🔇 Additional comments (1)
src/CraneCtld/CranedMetaContainer.cpp (1)

313-314: LGTM! Consistent configuration initialization.

The initialization of cores_per_socket from the node configuration follows the established pattern used for other topology fields in this method.

@L-Xiafeng L-Xiafeng requested a review from Copilot July 17, 2025 09:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds CPU topology detection using hwloc and integrates this information into node registration, metadata, and scheduling.

  • Introduces a new TopologyInfo struct and GetCpuTopologyInfo API to gather sockets, cores per socket, and threads per core.
  • Propagates topology info through gRPC definitions, configuration, database schema, and the CtldClient registration flow.
  • Updates the scheduler (TaskScheduler) and metadata containers (CranedMetaContainer) to enforce and warn on cores-per-socket constraints.

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/Utilities/PublicHeader/include/crane/OS.h Added TopologyInfo and GetCpuTopologyInfo declaration
src/Utilities/PublicHeader/OS.cpp Implemented CPU topology detection with hwloc
src/Craned/CtldClient.cpp Included topology info in CranedRegister_ gRPC metadata
src/CraneCtld/TaskScheduler.cpp Enforced cores-per-socket constraint in scheduling loops
src/CraneCtld/CranedMetaContainer.cpp Initialized and warned about configured vs. actual cores/socket
protos/PublicDefs.proto Added TopologyInfo message and new field in task messages
Comments suppressed due to low confidence (1)

src/Utilities/PublicHeader/OS.cpp:208

  • Consider adding unit tests for GetCpuTopologyInfo to verify correct detection of socket_count, cores_per_socket, and threads_per_core under various hardware topologies.
bool GetCpuTopologyInfo(TopologyInfo* info) {

Comment on lines +209 to +214
hwloc_topology_t topology;
if (hwloc_topology_init(&topology) != 0) return false;
if (hwloc_topology_load(topology) != 0) {
hwloc_topology_destroy(topology);
return false;
}
Copy link

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initializing and loading the hwloc topology on each call can be expensive. Consider caching the topology or using a singleton to avoid repeated initialization for multiple calls.

Suggested change
hwloc_topology_t topology;
if (hwloc_topology_init(&topology) != 0) return false;
if (hwloc_topology_load(topology) != 0) {
hwloc_topology_destroy(topology);
return false;
}
static hwloc_topology_t topology = []() {
hwloc_topology_t t;
if (hwloc_topology_init(&t) != 0) {
CRANE_ERROR("Failed to initialize hwloc topology.");
std::terminate();
}
if (hwloc_topology_load(t) != 0) {
CRANE_ERROR("Failed to load hwloc topology.");
hwloc_topology_destroy(t);
std::terminate();
}
return t;
}();

Copilot uses AI. Check for mistakes.
@L-Xiafeng L-Xiafeng marked this pull request as draft December 10, 2025 03:36
@L-Xiafeng
Copy link
Collaborator

为了CorePerSocket加这么多功能目前好像不是特别急需?目前不处理吧

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Build system change enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

新增 cbatch --cores-per-socket=<cores> 功能

3 participants