-
Notifications
You must be signed in to change notification settings - Fork 34
Feat: cbatch --cores-per-socket=<cores> 功能 #551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThis change introduces hardware topology awareness to the project by integrating the hwloc library. It adds new CMake configuration for hwloc, updates protobuf and internal data structures to carry CPU topology information, implements topology detection using hwloc, and propagates this information through node registration, task scheduling, and persistence logic. Changes
Sequence Diagram(s)sequenceDiagram
participant Craned
participant OSUtil
participant CtldClient
participant CraneCtld
participant Scheduler
Craned->>OSUtil: GetCpuTopologyInfo()
OSUtil-->>Craned: TopologyInfo struct
Craned->>CtldClient: Prepare registration (includes TopologyInfo)
CtldClient->>CraneCtld: Register node (with TopologyInfo)
CraneCtld->>Scheduler: Update node meta (TopologyInfo available)
Scheduler->>Scheduler: Filter nodes by cores_per_socket for tasks
Suggested reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/CraneCtld/TaskScheduler.cpp (2)
563-570: LGTM! CPU topology filtering logic is correct.The topology filtering correctly excludes nodes that cannot satisfy the task's cores-per-socket requirement. The debug logging is helpful for troubleshooting scheduling decisions.
Consider adding validation to ensure
topology_info.cores_per_socketis properly initialized (non-zero) to handle edge cases where topology information might be missing or invalid.
644-651: Consider refactoring to eliminate code duplication.This topology filtering logic is identical to the one at lines 563-570. While the logic is correct, having the same validation in multiple places within the same method increases maintenance overhead.
Consider extracting this logic into a helper function like:
bool IsNodeTopologyCompatible(const TaskInCtld* task, const CranedMeta* craned_meta) { if (task->cores_per_socket > craned_meta->remote_meta.topology_info.cores_per_socket) { return false; } CRANE_DEBUG("The number of cores per CPU socket on this Craned {} is sufficient for the task {}.", craned_meta->static_meta.hostname, task->TaskId()); return true; }This would improve maintainability and ensure consistent behavior across both code paths.
CMakeModule/FindHWLOC.cmake (1)
14-14: Consider using CRANE_TRACE or making this message conditional.The debug message will always be printed during CMake configuration, which might be too verbose for normal builds. Consider using a conditional message based on a debug flag or switching to a trace-level message.
-message(STATUS "PkgConfig: HWLOC INCLUDE_DIR:${PC_HWLOC_INCLUDE_DIRS}, LIB_DIR:${PC_HWLOC_LIBRARY_DIRS}, Ver: ${PC_HWLOC_VERSION}") +if(CMAKE_VERBOSE_MAKEFILE OR HWLOC_DEBUG) + message(STATUS "PkgConfig: HWLOC INCLUDE_DIR:${PC_HWLOC_INCLUDE_DIRS}, LIB_DIR:${PC_HWLOC_LIBRARY_DIRS}, Ver: ${PC_HWLOC_VERSION}") +endif()
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
CMakeLists.txt(1 hunks)CMakeModule/FindHWLOC.cmake(1 hunks)protos/PublicDefs.proto(4 hunks)src/CraneCtld/CtldPublicDefs.h(4 hunks)src/CraneCtld/DbClient.cpp(8 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(1 hunks)src/CraneCtld/TaskScheduler.cpp(2 hunks)src/Craned/Craned.cpp(1 hunks)src/Craned/CranedPublicDefs.h(1 hunks)src/Craned/CtldClient.cpp(1 hunks)src/Utilities/PublicHeader/CMakeLists.txt(1 hunks)src/Utilities/PublicHeader/OS.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/OS.h(3 hunks)
🧰 Additional context used
🧠 Learnings (9)
src/Utilities/PublicHeader/CMakeLists.txt (1)
Learnt from: RileyWen
PR: PKUHPC/CraneSched#456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.
src/Craned/Craned.cpp (5)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/include/crane/CpuFrequency.h:57-67
Timestamp: 2025-06-30T08:57:12.915Z
Learning: In CraneSched's CpuFrequency class, the static member functions (ParseCpuList_, CpuFreqGetScalingFreq_, CpuFreqSetGov_, CpuFreqSetScalingFreq_, SetCpuOwnerLock_, TestCpuOwnerLock_, FdLockRetry_) are utility functions that only interact with the system via file I/O and locking without accessing instance members like m_cpu_freq_data_. These functions are appropriately declared as static.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/CpuFrequency.cpp:56-77
Timestamp: 2025-06-30T08:51:58.728Z
Learning: In CraneSched's CpuFrequency implementation, the scaling_available_governors file contains only one governor per CPU (not multiple space-separated governors as typically seen in standard Linux systems), so the current if-else parsing logic in CpuFrequency::Init() is correct for reading a single governor string.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
src/CraneCtld/RpcService/CranedKeeper.cpp (10)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
PR: PKUHPC/CraneSched#469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Learnt from: huerni
PR: PKUHPC/CraneSched#469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
Learnt from: 1daidai1
PR: PKUHPC/CraneSched#458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is handled/updated separately before calling SetAllocatedRes, so it does not need to be updated again within the method itself.
src/Craned/CtldClient.cpp (3)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
src/Utilities/PublicHeader/include/crane/OS.h (7)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/include/crane/CpuFrequency.h:57-67
Timestamp: 2025-06-30T08:57:12.915Z
Learning: In CraneSched's CpuFrequency class, the static member functions (ParseCpuList_, CpuFreqGetScalingFreq_, CpuFreqSetGov_, CpuFreqSetScalingFreq_, SetCpuOwnerLock_, TestCpuOwnerLock_, FdLockRetry_) are utility functions that only interact with the system via file I/O and locking without accessing instance members like m_cpu_freq_data_. These functions are appropriately declared as static.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/Craned/TaskManager.cpp:1509-1517
Timestamp: 2025-05-09T01:54:39.465Z
Learning: The CraneSched project uses C++23, which supports Class Template Argument Deduction (CTAD) for standard containers like std::set and includes ranges support, making std::ranges::views::keys valid without additional headers.
Learnt from: RileyWen
PR: PKUHPC/CraneSched#456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Utilities/PublicHeader/Logger.cpp:21-21
Timestamp: 2025-06-23T07:47:44.485Z
Learning: In the CraneSched codebase, static logger sinks like default_sinks in src/Utilities/PublicHeader/Logger.cpp are initialized during startup in a single-threaded context and then become read-only during normal operation, eliminating the need for synchronization protection.
Learnt from: RileyWen
PR: PKUHPC/CraneSched#456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: The `phmap::parallel_flat_hash_set` used in CraneSched's `VaultClientWrapper` is from the Parallel Hashmap library, which provides thread-safe containers for concurrent access. It uses internal partitioning to reduce contention, making it safe for use in multithreaded environments without external synchronization.
src/Craned/CranedPublicDefs.h (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
src/CraneCtld/TaskScheduler.cpp (14)
undefined
<retrieved_learning>
Learnt from: huerni
PR: #537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
</retrieved_learning>
<retrieved_learning>
Learnt from: huerni
PR: #537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the execution_node field in JobToD is intentionally set to the first element of executing_craned_ids vector without guards, as it represents the main execution node for a job. This is by design and assumes executing_craned_ids is never empty when GetJobOfNode is called.
</retrieved_learning>
<retrieved_learning>
Learnt from: huerni
PR: #469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
</retrieved_learning>
<retrieved_learning>
Learnt from: huerni
PR: #469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.
</retrieved_learning>
<retrieved_learning>
Learnt from: 1daidai1
PR: #458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
</retrieved_learning>
<retrieved_learning>
Learnt from: 1daidai1
PR: #458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
</retrieved_learning>
src/CraneCtld/CtldPublicDefs.h (2)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
src/CraneCtld/DbClient.cpp (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#528
File: src/CraneCtld/DbClient.cpp:0-0
Timestamp: 2025-06-23T01:34:41.639Z
Learning: MemoryBytes() method returns uint64_t type in the CraneSched codebase, and when storing in MongoDB BSON documents, it should be cast to int64_t to maintain consistency with the existing schema and enable numeric database operations.
🧬 Code Graph Analysis (3)
src/Craned/Craned.cpp (1)
src/Utilities/PublicHeader/OS.cpp (2)
GetCpuTopologyInfo(208-225)GetCpuTopologyInfo(208-208)
src/Craned/CtldClient.cpp (1)
src/Craned/CranedPublicDefs.h (1)
g_config(108-108)
src/Utilities/PublicHeader/include/crane/OS.h (1)
src/Utilities/PublicHeader/OS.cpp (2)
GetCpuTopologyInfo(208-225)GetCpuTopologyInfo(208-208)
🔇 Additional comments (20)
protos/PublicDefs.proto (4)
584-588: LGTM: Well-structured topology information message.The
TopologyInfomessage properly encapsulates CPU topology data with appropriate field types and numbering.
168-168: LGTM: Consistent field addition for task filtering.The
cores_per_socketfield addition enables CPU topology-aware task scheduling. Field number 38 is properly assigned.
243-243: LGTM: Consistent field addition across task messages.The
cores_per_socketfield is consistently added to bothTaskToCtldandTaskToDmessages for proper task execution context.
599-599: LGTM: Proper integration of topology info in node metadata.The
topology_infofield addition toCranedRemoteMetaenables nodes to report their CPU topology to the control daemon.CMakeLists.txt (1)
342-344: LGTM: Proper integration of mandatory HWLOC dependency.The HWLOC package is correctly added as a required dependency to support the new CPU topology detection functionality. The placement after LibAIO and before conditional BPF check is appropriate.
src/Craned/CranedPublicDefs.h (1)
96-96: LGTM: Proper extension of node metadata structure.The
TopologyInfoOfThisNodemember addition appropriately extends theCranedMetastruct to include CPU topology information, maintaining consistency with other metadata fields.src/Craned/Craned.cpp (1)
579-584: LGTM: Well-implemented CPU topology detection with proper error handling.The topology detection code is properly integrated into the configuration parsing process with appropriate error logging and diagnostic output. The placement after system release info gathering is logical.
src/Utilities/PublicHeader/CMakeLists.txt (1)
19-19: LGTM: Proper library linkage for CPU topology functionality.The
HWLOC::hwloclibrary is correctly linked to enable the newGetCpuTopologyInfofunction implementation in the utilities layer.src/CraneCtld/RpcService/CranedKeeper.cpp (1)
300-300: LGTM - Correctly propagates CPU topology information.The addition of
cores_per_socketto the task execution request properly extends the RPC interface to include CPU topology data. This enables compute nodes to receive and utilize topology information for task scheduling and execution.src/Craned/CtldClient.cpp (1)
401-405: LGTM - Properly integrates CPU topology into node registration.The addition of topology information to the node registration request correctly enables the controller to be aware of each compute node's CPU characteristics. The implementation follows the established pattern for populating gRPC message fields and is well-positioned within the registration sequence.
src/Utilities/PublicHeader/OS.cpp (1)
208-225: Excellent implementation with proper resource management.The
GetCpuTopologyInfofunction demonstrates several good practices:
- Proper hwloc resource management with cleanup on all code paths
- Safe arithmetic operations with zero-division protection
- Clear error handling with early returns on initialization failures
- Appropriate use of hwloc object types for accurate topology detection
The calculation logic correctly derives cores per socket and threads per core from the hardware topology data.
src/Utilities/PublicHeader/include/crane/OS.h (2)
38-42: Clean and well-designed data structure.The
TopologyInfostruct provides a clear interface for CPU topology data with appropriately typed fields. The use ofuint32_tis suitable for hardware count values, and the field names are descriptive and unambiguous.
27-27: Appropriate header additions for topology functionality.The hwloc.h include is necessary for the topology detection implementation, and the
GetCpuTopologyInfofunction declaration provides a clean interface with clear input/output semantics using a pointer parameter and boolean return value.Also applies to: 77-77
CMakeModule/FindHWLOC.cmake (3)
16-27: LGTM! Well-structured library discovery.The find_path and find_library calls follow CMake best practices with proper variable naming and effective use of pkg-config hints.
28-35: LGTM! Proper use of standard CMake error handling.The find_package_handle_standard_args call correctly validates required variables, handles version checking, and provides a clear error message.
36-42: LGTM! Proper imported target creation.The conditional target creation and interface property setup follow CMake best practices, ensuring no duplicate targets are created while properly exposing include directories and libraries.
src/CraneCtld/DbClient.cpp (2)
942-1007: LGTM! Consistent addition of cores_per_socket field.The changes systematically add the
cores_per_socketfield to the MongoDB document:
- Array size correctly increased from 37 to 38
- Field added at the correct position
- Proper
int64_tcasting for MongoDB storage (consistent with learned practices)- Comments updated accurately
- Value extracted from the correct protobuf field
1033-1090: LGTM! Consistent implementation across both document conversion functions.The
TaskInCtldToDocument_function changes mirror those inTaskInEmbeddedDbToDocument_, maintaining consistency:
- Identical field positioning and array size updates
- Same
int64_tcasting approach- Consistent value extraction from
task->cores_per_socketsrc/CraneCtld/CtldPublicDefs.h (2)
206-228: LGTM! Proper integration of topology information into node metadata.The changes correctly:
- Add
TopologyInfo topology_infofield to the struct- Update the constructor to copy all topology fields from the gRPC message
- Maintain consistency with the protobuf
TopologyInfomessage structure
404-624: LGTM! Well-placed addition of cores_per_socket to task metadata.The changes appropriately:
- Add
cores_per_socketasuint32_tin the submission-time fields section- Update
SetFieldsByTaskToCtldto properly extract the value from the protobuf message- Maintain consistency with the field placement and initialization patterns
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
src/Utilities/PublicHeader/CMakeLists.txt (1)
15-20: Link hwloc privately to avoid leaking the dependency
OS.cppis the only TU that actually calls into hwloc; consumers ofUtility_PublicHeaderonly see plain POD types in the headers.
UsePRIVATEinstead ofPUBLICto keep downstream targets from inheriting the hwloc link requirement.-target_link_libraries(Utility_PublicHeader PUBLIC +target_link_libraries(Utility_PublicHeader + PUBLIC spdlog::spdlog crane_proto_lib fpm - HWLOC::hwloc + PRIVATE HWLOC::hwloc )
🧹 Nitpick comments (3)
CMakeLists.txt (1)
342-344: Make the hwloc dependency optional or version-gated
find_package(HWLOC REQUIRED)hard-fails the build on any environment that lacks the library (e.g. Mac CI runners, minimal containers).
Either (a) gate this behind anOPTION(CRANE_ENABLE_HWLOC … ON)or (b) specify a minimal version and fall back gracefully:option(CRANE_ENABLE_HWLOC "Enable hardware-topology support via hwloc" ON) if (CRANE_ENABLE_HWLOC) find_package(HWLOC 2.0 REQUIRED) else() message(WARNING "Building without hwloc – CPU-topology features disabled") endif()This keeps the build portable while still enforcing the dependency when the feature is requested.
src/Craned/CranedPublicDefs.h (1)
96-97: Naming nuance: drop the “OfThisNode” suffix
Config::CranedMetaalready represents this node.TopologyInfoOfThisNodeis therefore verbose and inconsistent withSysInfo.
Recommend:- TopologyInfo TopologyInfoOfThisNode; + TopologyInfo TopologyInfo;Keeps the API concise and self-explanatory.
src/CraneCtld/TaskScheduler.cpp (1)
2568-2572: Logging level & message are misleadingThe debug line is reached only when the node passes the check; however the text reads “is sufficient for the task” which is more informational than debug and will spam logs for every candidate node.
Consider either:
- Downgrading to
CRANE_TRACEor- Printing only when the node was rejected, including the requested / available cps values, to help users diagnose why a job is pending.
No diff provided – up to maintainer’s preference.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
CMakeLists.txt(1 hunks)CMakeModule/FindHWLOC.cmake(1 hunks)protos/PublicDefs.proto(4 hunks)src/CraneCtld/CtldPublicDefs.h(4 hunks)src/CraneCtld/DbClient.cpp(8 hunks)src/CraneCtld/RpcService/CranedKeeper.cpp(1 hunks)src/CraneCtld/TaskScheduler.cpp(2 hunks)src/Craned/Craned.cpp(1 hunks)src/Craned/CranedPublicDefs.h(1 hunks)src/Craned/CtldClient.cpp(1 hunks)src/Utilities/PublicHeader/CMakeLists.txt(1 hunks)src/Utilities/PublicHeader/OS.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/OS.h(3 hunks)
🧰 Additional context used
🧠 Learnings (9)
src/Craned/Craned.cpp (5)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/include/crane/CpuFrequency.h:57-67
Timestamp: 2025-06-30T08:57:12.915Z
Learning: In CraneSched's CpuFrequency class, the static member functions (ParseCpuList_, CpuFreqGetScalingFreq_, CpuFreqSetGov_, CpuFreqSetScalingFreq_, SetCpuOwnerLock_, TestCpuOwnerLock_, FdLockRetry_) are utility functions that only interact with the system via file I/O and locking without accessing instance members like m_cpu_freq_data_. These functions are appropriately declared as static.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/CpuFrequency.cpp:56-77
Timestamp: 2025-06-30T08:51:58.728Z
Learning: In CraneSched's CpuFrequency implementation, the scaling_available_governors file contains only one governor per CPU (not multiple space-separated governors as typically seen in standard Linux systems), so the current if-else parsing logic in CpuFrequency::Init() is correct for reading a single governor string.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
src/Utilities/PublicHeader/CMakeLists.txt (2)
Learnt from: RileyWen
PR: PKUHPC/CraneSched#456
File: src/CraneCtld/Tls/VaultClientWrapper.cpp:116-137
Timestamp: 2025-03-31T09:29:40.388Z
Learning: In CraneSched, `phmap::parallel_flat_hash_set` from the Parallel Hashmap library (`parallel_hashmap/phmap.h`) is used for thread-safe containers. This container implements internal sharding with separate locks for different parts of the hash table, making it inherently thread-safe for concurrent operations without requiring external synchronization.
Learnt from: huerni
PR: PKUHPC/CraneSched#469
File: src/Craned/TaskManager.cpp:0-0
Timestamp: 2025-05-08T07:38:42.362Z
Learning: In CraneSched's PMIx integration, the `g_pmix_server->SetupFork()` function must be called in the child process after fork() and before exec() to properly set up the PMIx environment variables.
src/Craned/CranedPublicDefs.h (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
src/Craned/CtldClient.cpp (3)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/Craned/CtldClient.cpp:0-0
Timestamp: 2025-04-27T06:47:51.751Z
Learning: In CtldClient, when a CranedRegister RPC fails, the state is reset to NOTIFY_SENDING. This ensures that token will be refreshed by NotifyCranedConnected_() before any subsequent CranedRegister_() call, preventing null token access issues.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/Craned/CranedServer.cpp:50-51
Timestamp: 2025-06-23T07:53:30.513Z
Learning: In the CraneSched codebase, `g_ctld_client` is a guaranteed global variable that is always initialized before any gRPC service methods are called, so null pointer checks are not necessary when calling methods on it.
src/CraneCtld/RpcService/CranedKeeper.cpp (10)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
PR: PKUHPC/CraneSched#469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Learnt from: huerni
PR: PKUHPC/CraneSched#469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#520
File: src/CraneCtld/CranedKeeper.cpp:416-417
Timestamp: 2025-06-07T10:47:59.071Z
Learning: In src/CraneCtld/CranedKeeper.h, the m_shutting_down_ member in CranedStub class is declared as std::atomic_bool, making it thread-safe for concurrent access without additional synchronization.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
Learnt from: 1daidai1
PR: PKUHPC/CraneSched#458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
src/Utilities/PublicHeader/include/crane/OS.h (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/include/crane/CpuFrequency.h:57-67
Timestamp: 2025-06-30T08:57:12.915Z
Learning: In CraneSched's CpuFrequency class, the static member functions (ParseCpuList_, CpuFreqGetScalingFreq_, CpuFreqSetGov_, CpuFreqSetScalingFreq_, SetCpuOwnerLock_, TestCpuOwnerLock_, FdLockRetry_) are utility functions that only interact with the system via file I/O and locking without accessing instance members like m_cpu_freq_data_. These functions are appropriately declared as static.
src/CraneCtld/DbClient.cpp (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#528
File: src/CraneCtld/DbClient.cpp:0-0
Timestamp: 2025-06-23T01:34:41.639Z
Learning: MemoryBytes() method returns uint64_t type in the CraneSched codebase, and when storing in MongoDB BSON documents, it should be cast to int64_t to maintain consistency with the existing schema and enable numeric database operations.
src/CraneCtld/TaskScheduler.cpp (15)
undefined
<retrieved_learning>
Learnt from: huerni
PR: #537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
</retrieved_learning>
<retrieved_learning>
Learnt from: huerni
PR: #537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/TaskScheduler.cpp:1713-1721
Timestamp: 2025-05-09T02:15:30.422Z
Learning: TaskScheduler中的资源回收设计:当Craned节点离线时,通过TaskStatusChangeAsync异步处理任务状态变更和资源回收。尽管可能存在短暂的资源更新延迟,但这是设计上可接受的权衡。
</retrieved_learning>
<retrieved_learning>
Learnt from: huerni
PR: #469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method after PMIx registration, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup resources. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly() before returning from the function.
</retrieved_learning>
<retrieved_learning>
Learnt from: huerni
PR: #469
File: src/Craned/TaskManager.cpp:1162-1175
Timestamp: 2025-05-08T07:39:29.207Z
Learning: In CraneSched, when a task fails in the LaunchTaskInstanceMt_ method, ActivateTaskStatusChangeAsync_ does not automatically release PMIx resources or cgroup. Explicit cleanup is required by calling g_pmix_server->DeregisterTask() and g_cg_mgr->ReleaseCgroupByTaskIdOnly().
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched includes assertions in its operator overloads (particularly in operator-=) that verify resources being subtracted are less than or equal to available resources, ensuring no negative values can occur during resource allocation or deallocation operations.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #351
File: src/CraneCtld/CranedMetaContainer.cpp:248-264
Timestamp: 2025-04-18T02:26:16.113Z
Learning: The resource class in CraneSched contains assertions in its overloaded operators (like -=) that prevent negative values from occurring when resources are allocated or deallocated.
</retrieved_learning>
<retrieved_learning>
Learnt from: 1daidai1
PR: #458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
</retrieved_learning>
<retrieved_learning>
Learnt from: 1daidai1
PR: #458
File: src/CraneCtld/CtldPublicDefs.h:0-0
Timestamp: 2025-05-23T02:32:43.952Z
Learning: In the CraneSched project, allocated_res_view is updated before calling SetAllocatedRes, so it doesn't need to be updated again within the method itself.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Supervisor/TaskManager.cpp:396-396
Timestamp: 2025-05-25T04:11:27.740Z
Learning: In TaskManager.cpp, GetCrunMeta() calls don't need null checks because they're only called in contexts where the task is guaranteed to be a CRUN task (e.g., SetupChildProcessCrunX11_ is only called when step->IsCrun() && x11() conditions are met), ensuring the metadata will always be valid.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Craned/SupervisorKeeper.h:32-32
Timestamp: 2025-05-26T11:04:30.580Z
Learning: The Supervisor component in the Crane system is designed to manage only one task per instance. The task specification is provided to the Supervisor during startup by Craned (likely through InitSupervisorRequest), so the ExecuteTask() method doesn't need to accept task parameters since the Supervisor already knows which task to execute.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the execution_node field in JobToD is intentionally set to the first element of executing_craned_ids vector without guards, as it represents the main execution node for a job. This is by design and assumes executing_craned_ids is never empty when GetJobOfNode is called.
</retrieved_learning>
<retrieved_learning>
Learnt from: L-Xiafeng
PR: #399
File: src/Craned/Craned/SupervisorKeeper.cpp:174-178
Timestamp: 2025-05-26T11:00:54.563Z
Learning: The CraneSched project uses C++23 standard, allowing the use of modern C++ features like std::ranges::to and other C++23 language features and library components.
</retrieved_learning>
src/CraneCtld/CtldPublicDefs.h (2)
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
🧬 Code Graph Analysis (3)
src/Craned/Craned.cpp (1)
src/Utilities/PublicHeader/OS.cpp (2)
GetCpuTopologyInfo(208-226)GetCpuTopologyInfo(208-208)
src/Craned/CtldClient.cpp (1)
src/Craned/CranedPublicDefs.h (1)
g_config(108-108)
src/Utilities/PublicHeader/include/crane/OS.h (1)
src/Utilities/PublicHeader/OS.cpp (2)
GetCpuTopologyInfo(208-226)GetCpuTopologyInfo(208-208)
🔇 Additional comments (25)
src/CraneCtld/RpcService/CranedKeeper.cpp (1)
298-301: Zero-initialisedcores_per_socketmay mislead craned
TaskToCtld().cores_per_socket()is 0 unless explicitly set during submission; propagating a 0 can be interpreted as “no constraint”.
Consider sending the value only when > 0, or defaulting to the target node’s physical value.if (task->TaskToCtld().cores_per_socket() > 0) mutable_task->set_cores_per_socket(task->TaskToCtld().cores_per_socket());Prevents ambiguity between “unset” and “zero”.
src/Craned/CtldClient.cpp (1)
401-405: LGTM! Topology information integration is well-implemented.The CPU topology information is properly extracted from the global configuration and populated into the gRPC registration request metadata. The implementation follows the established pattern used for system release information and maintains consistency with the protobuf schema changes.
src/Utilities/PublicHeader/OS.cpp (1)
208-226: LGTM! Robust hwloc integration with proper error handling.The implementation correctly uses the hwloc library to gather CPU topology information. Key strengths:
- Proper hwloc initialization and loading with error checks
- Division by zero protection in calculations (lines 220-222)
- Consistent cleanup of hwloc topology object in both success and failure paths
- Appropriate return value indicating success/failure
The function follows hwloc best practices and provides reliable topology detection.
protos/PublicDefs.proto (4)
584-588: Well-designed TopologyInfo message structure.The new
TopologyInfomessage appropriately groups related CPU topology fields with logical field numbering and suitableuint32types for hardware counts.
168-168: Appropriate integration of cores_per_socket in TaskToCtld.The
cores_per_socketfield is properly added to the task submission metadata with a unique field number (38) and appropriateuint32type.
243-243: Consistent cores_per_socket field in TaskToD.The
cores_per_socketfield maintains consistency between task submission and execution messages with proper field numbering (25).
599-599: Logical integration of topology_info in CranedRemoteMeta.The
TopologyInfofield is appropriately included in the node metadata message, enabling topology information propagation during node registration.src/Utilities/PublicHeader/include/crane/OS.h (3)
22-22: Appropriate hwloc header inclusion.The
hwloc.hinclude is necessary for the hwloc types used in the CPU topology detection implementation.
38-42: Well-structured TopologyInfo definition.The
TopologyInfostruct appropriately groups CPU topology fields withuint32_ttypes, maintaining consistency with the protobuf schema and implementation requirements.
77-77: Consistent function declaration.The
GetCpuTopologyInfofunction declaration matches the implementation inOS.cppand follows the established pattern for utility functions in this namespace.CMakeModule/FindHWLOC.cmake (1)
1-42: Excellent CMake find module following modern practices.This find module demonstrates excellent CMake practices:
- Uses
pkg-configfor initial discovery, appropriate for hwloc- Includes proper fallback search mechanisms
- Follows standard CMake conventions with
find_package_handle_standard_args- Creates a modern imported target
HWLOC::hwlocwith appropriate interface properties- Provides helpful status messages for debugging
The implementation is robust and should reliably locate hwloc across different systems.
src/CraneCtld/CtldPublicDefs.h (4)
206-206: LGTM: Topology info field addition is well-integrated.The addition of the
topology_infofield toCranedRemoteMetafollows the existing pattern and integrates well with the struct's design.
225-229: LGTM: Constructor correctly populates topology information.The constructor properly extracts and assigns the topology fields from the gRPC message. The field assignments are consistent with the protobuf schema and follow the existing initialization pattern.
406-406: LGTM: Field addition follows consistent naming and type conventions.The
cores_per_socketfield uses the appropriateuint32_ttype and follows the existing naming convention in the struct.
626-626: LGTM: Field assignment is correctly integrated.The assignment of
cores_per_socketfrom the protobuf message is properly placed and follows the existing pattern inSetFieldsByTaskToCtld.src/CraneCtld/DbClient.cpp (10)
942-942: LGTM: Comment updated to reflect new field count.The comment correctly reflects the addition of the
cores_per_socketfield at index 37.
945-945: LGTM: Field array size correctly updated.The array size is properly increased from 37 to 38 to accommodate the new
cores_per_socketfield.
961-961: LGTM: Field name added in correct position.The
"cores_per_socket"field is properly added at the end of the fields array, maintaining consistency with the indexing pattern.
972-972: LGTM: Tuple type correctly extended.The tuple type signature is properly updated to include
int64_tfor the new field, maintaining type consistency.
1006-1007: LGTM: Value extraction and casting follows established pattern.The
cores_per_socketvalue is correctly extracted from the protobuf message and cast toint64_tfor MongoDB storage, following the learned pattern for numeric fields in BSON documents.
1033-1033: LGTM: Comment correctly updated for TaskInCtldToDocument_.The comment properly reflects the new field structure for the second document conversion function.
1036-1036: LGTM: Field array size consistently updated.The array size is correctly maintained at 38 fields in both document conversion functions.
1052-1052: LGTM: Field name consistently added.The
"cores_per_socket"field is properly positioned at the end of the fields array in both functions.
1063-1063: LGTM: Tuple type signature consistently updated.The tuple type is properly extended with
int64_tin both document conversion functions.
1089-1089: LGTM: Task field access and casting is correct.The
cores_per_socketfield is correctly accessed from theTaskInCtldobject and cast toint64_tfor MongoDB storage, maintaining consistency with the first conversion function.
29a5cd1 to
4f26b62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
etc/config.yaml(1 hunks)src/CraneCtld/CraneCtld.cpp(1 hunks)src/CraneCtld/CranedMetaContainer.cpp(2 hunks)src/CraneCtld/CtldPublicDefs.h(6 hunks)src/CraneCtld/DbClient.cpp(8 hunks)src/CraneCtld/TaskScheduler.cpp(3 hunks)src/Craned/Craned.cpp(1 hunks)src/Utilities/PublicHeader/OS.cpp(1 hunks)src/Utilities/PublicHeader/include/crane/OS.h(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- etc/config.yaml
🚧 Files skipped from review as they are similar to previous changes (6)
- src/Craned/Craned.cpp
- src/Utilities/PublicHeader/include/crane/OS.h
- src/CraneCtld/TaskScheduler.cpp
- src/Utilities/PublicHeader/OS.cpp
- src/CraneCtld/CtldPublicDefs.h
- src/CraneCtld/DbClient.cpp
🧰 Additional context used
🧠 Learnings (2)
src/CraneCtld/CraneCtld.cpp (1)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/CpuFrequency.cpp:56-77
Timestamp: 2025-06-30T08:51:58.728Z
Learning: In CraneSched's CpuFrequency implementation, the scaling_available_governors file contains only one governor per CPU (not multiple space-separated governors as typically seen in standard Linux systems), so the current if-else parsing logic in CpuFrequency::Init() is correct for reading a single governor string.
src/CraneCtld/CranedMetaContainer.cpp (8)
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Craned/TaskManager.cpp:145-147
Timestamp: 2025-06-30T08:43:44.470Z
Learning: In the CraneSched codebase, src/Craned/Craned.cpp guarantees that g_config.CranedRes.contains(g_config.CranedIdOfThisNode) through explicit validation during startup. The code checks if g_config.CranedRes.contains(g_config.Hostname) and exits the process if not found, then sets g_config.CranedIdOfThisNode = g_config.Hostname. TaskManager constructor is called after this validation, so g_config.CranedRes[g_config.CranedIdOfThisNode] is guaranteed to be valid.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/CraneCtld/CtldGrpcServer.cpp:1662-1669
Timestamp: 2025-06-30T08:38:18.748Z
Learning: In CraneSched's CPU frequency feature, validation is split between controller and compute nodes: CtldGrpcServer only validates governor against configuration (g_config.CpuFreqGovernors), while detailed validation of low/high frequency values is deferred to TaskManager.cpp via CpuFreqValidateAndSet() where actual hardware capabilities can be checked.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CranedKeeper.cpp:51-53
Timestamp: 2025-05-09T01:54:21.256Z
Learning: The ConfigureCraned function in src/CraneCtld/CranedKeeper.cpp is called from a thread pool, so there's no need to worry about it blocking the gRPC completion queue thread.
Learnt from: huerni
PR: PKUHPC/CraneSched#537
File: src/Utilities/PublicHeader/include/crane/CpuFrequency.h:57-67
Timestamp: 2025-06-30T08:57:12.915Z
Learning: In CraneSched's CpuFrequency class, the static member functions (ParseCpuList_, CpuFreqGetScalingFreq_, CpuFreqSetGov_, CpuFreqSetScalingFreq_, SetCpuOwnerLock_, TestCpuOwnerLock_, FdLockRetry_) are utility functions that only interact with the system via file I/O and locking without accessing instance members like m_cpu_freq_data_. These functions are appropriately declared as static.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#399
File: src/Craned/Supervisor/TaskManager.cpp:220-220
Timestamp: 2025-05-25T04:11:50.268Z
Learning: In TaskManager.cpp, when step->IsCrun() is checked before dynamic_cast to CrunMetaInExecution*, the cast is guaranteed to succeed due to the program logic ensuring the correct meta type is used for Crun tasks. Null checks for these dynamic_cast operations are unnecessary in this context.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#475
File: src/CraneCtld/CranedKeeper.cpp:40-62
Timestamp: 2025-04-27T11:52:31.017Z
Learning: In the CraneSched system, retry of configuration RPC is architecturally driven by the Craned's notification system rather than explicit retry code within the ConfigureCraned method. When Configure RPC fails, Craned returns to a notification state and sends new Notify messages which trigger new configuration attempts.
Learnt from: L-Xiafeng
PR: PKUHPC/CraneSched#496
File: src/CraneCtld/CtldPublicDefs.h:756-763
Timestamp: 2025-05-09T01:56:53.142Z
Learning: In the CraneSched codebase, the `execution_node` field in JobToD is intentionally set to the first element of `executing_craned_ids` vector without guards, as it represents the main execution node for a job. This is by design and assumes `executing_craned_ids` is never empty when `GetJobOfNode` is called.
Learnt from: RileyWen
PR: PKUHPC/CraneSched#475
File: src/Craned/CranedServer.h:0-0
Timestamp: 2025-05-02T07:12:46.896Z
Learning: The initialization of `m_supervisor_recovered_` to `true` in the CranedServer class is intentional despite the comment saying "When supervisor ready, init with false". This is temporary until the supervisor functionality is fully implemented.
🔇 Additional comments (1)
src/CraneCtld/CranedMetaContainer.cpp (1)
313-314: LGTM! Consistent configuration initialization.The initialization of
cores_per_socketfrom the node configuration follows the established pattern used for other topology fields in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds CPU topology detection using hwloc and integrates this information into node registration, metadata, and scheduling.
- Introduces a new
TopologyInfostruct andGetCpuTopologyInfoAPI to gather sockets, cores per socket, and threads per core. - Propagates topology info through gRPC definitions, configuration, database schema, and the
CtldClientregistration flow. - Updates the scheduler (
TaskScheduler) and metadata containers (CranedMetaContainer) to enforce and warn on cores-per-socket constraints.
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Utilities/PublicHeader/include/crane/OS.h | Added TopologyInfo and GetCpuTopologyInfo declaration |
| src/Utilities/PublicHeader/OS.cpp | Implemented CPU topology detection with hwloc |
| src/Craned/CtldClient.cpp | Included topology info in CranedRegister_ gRPC metadata |
| src/CraneCtld/TaskScheduler.cpp | Enforced cores-per-socket constraint in scheduling loops |
| src/CraneCtld/CranedMetaContainer.cpp | Initialized and warned about configured vs. actual cores/socket |
| protos/PublicDefs.proto | Added TopologyInfo message and new field in task messages |
Comments suppressed due to low confidence (1)
src/Utilities/PublicHeader/OS.cpp:208
- Consider adding unit tests for
GetCpuTopologyInfoto verify correct detection of socket_count, cores_per_socket, and threads_per_core under various hardware topologies.
bool GetCpuTopologyInfo(TopologyInfo* info) {
| hwloc_topology_t topology; | ||
| if (hwloc_topology_init(&topology) != 0) return false; | ||
| if (hwloc_topology_load(topology) != 0) { | ||
| hwloc_topology_destroy(topology); | ||
| return false; | ||
| } |
Copilot
AI
Jul 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initializing and loading the hwloc topology on each call can be expensive. Consider caching the topology or using a singleton to avoid repeated initialization for multiple calls.
| hwloc_topology_t topology; | |
| if (hwloc_topology_init(&topology) != 0) return false; | |
| if (hwloc_topology_load(topology) != 0) { | |
| hwloc_topology_destroy(topology); | |
| return false; | |
| } | |
| static hwloc_topology_t topology = []() { | |
| hwloc_topology_t t; | |
| if (hwloc_topology_init(&t) != 0) { | |
| CRANE_ERROR("Failed to initialize hwloc topology."); | |
| std::terminate(); | |
| } | |
| if (hwloc_topology_load(t) != 0) { | |
| CRANE_ERROR("Failed to load hwloc topology."); | |
| hwloc_topology_destroy(t); | |
| std::terminate(); | |
| } | |
| return t; | |
| }(); |
|
为了CorePerSocket加这么多功能目前好像不是特别急需?目前不处理吧 |
Summary by CodeRabbit
New Features
Bug Fixes
Chores