Skip to content

Conversation

@shenxiaochen
Copy link

@shenxiaochen shenxiaochen commented Dec 30, 2025

Fix SNC state detection on non-SNC-capable platforms

Description

On platforms that do not support Sub-NUMA Clustering (SNC) monitoring (e.g., AMD and Hygon), if the CPU topology indicates that the number of NUMA nodes and the number of sockets are not equal, the SNC state detection logic in hw_cap_mon_snc_state() will attempt to read the SNC configuration MSR register PQOS_MSR_SNC_CFG (0xCA0) that is not
available. This causes the command "pqos --iface=msr" to fail with the error:

ERROR: RDMSR failed for reg[0xca0] on lcore 0
ERROR: Error reading SNC information!
ERROR: Error encounter in monitoring discovery!
ERROR: discover_capabilities() error 1
Error initializing PQoS library!

Fix the issue by ensuring hw_cap_mon_snc_state() returns early on non-SNC-capable platforms.

Affected parts

  • library
  • pqos utility
  • rdtset utility
  • other: (please specify)

Motivation and Context

The command "pqos --iface=msr" fails on non-SNC-capable platforms during SNC state detection with the error:

ERROR: RDMSR failed for reg[0xca0] on lcore 0
ERROR: Error reading SNC information!
ERROR: Error encounter in monitoring discovery!
ERROR: discover_capabilities() error 1
Error initializing PQoS library!

The code changes fix the issue on non-SNC-capable platforms.

How Has This Been Tested?

(1) Run "pqos --iface=msr" without the error described above on non-SNC-capable platforms (e.g., AMD or Hygon).
(2) Passed all tests in intel-cmt-cat/unit-test.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

From Intel RDT spec[1] and AMD Platform QoS spec[2]:
If the CPU platform supports CPUID.0FH.01H:EAX, CPUID.0FH.01H:EAX[7:0]
returns MBM counter length (width) as offset from 24.

But in hw_cap_mon_discover(), the MBM counter length is calculated with
incorrect 7-bits bitmask (0x7f).

Fix the issue with 8-bits bitmask (0xff) for MBM counter length.

[1] Intel Architectures SDM, Vol.3B, 19.18 Intel RDT Monitoring:
https://cdrdv2.intel.com/v1/dl/getContent/671200

[2] AMD Platform QoS Extensions, Rev 1.03:
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/other/56375_1_03_PUB.pdf

Fixes: 050f8c6 ("lib: detect MBM counter length")
Signed-off-by: Xiaochen Shen <[email protected]>
Hygon CPUs support Platform QoS features (PQoS Version V1.0) described
in the AMD Platform QoS specification[1].

Following Platform QoS sub-features are available on Hygon CPUs if the
underlying hardware supports it:
 - L3 Cache Occupancy Monitoring (CMT)
 - L3 External Memory Bandwidth Monitoring (MBM)
 - L3 Cache Allocation Enforcement (CAT)
 - Code and Data Prioritization (CDP)
 - Memory Bandwidth Enforcement/Allocation (MBA)

[1] AMD Platform QoS Extensions, Rev 1.03:
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/other/56375_1_03_PUB.pdf

Signed-off-by: Xiaochen Shen <[email protected]>
Add PQOS_VENDOR id Python interface for Hygon Platform QoS features.

Signed-off-by: Xiaochen Shen <[email protected]>
The default base MBM counter length (width) is 24 bits. Currently, Hygon
CPU does not support the CPUID 0xF.[ECX=1]:EAX to adjust the counter
length. But the Hygon CPU supports wider counter with the fixed width of
32 bits.

Set the default MBM counter length to 32 bit by adjusting the offset to 8
bits for Hygon.

Hygon future products will implement CPUID 0xF.[ECX=1]:EAX.

Signed-off-by: Xiaochen Shen <[email protected]>
On platforms that do not support Sub-NUMA Clustering (SNC) monitoring
(e.g., AMD and Hygon), if the CPU topology indicates that the number of
NUMA nodes and the number of sockets are not equal, the SNC state
detection logic in hw_cap_mon_snc_state() will attempt to read the SNC
configuration MSR register PQOS_MSR_SNC_CFG (0xCA0) that is not
available. This causes the command "pqos --iface=msr" to fail with the
error:

  "ERROR: RDMSR failed for reg[0xca0] on lcore 0"
  "ERROR: Error reading SNC information!"
  "ERROR: Error encounter in monitoring discovery!"
  "ERROR: discover_capabilities() error 1"
  "Error initializing PQoS library!"

Fix the issue by ensuring hw_cap_mon_snc_state() returns early on
non-SNC-capable platforms.

Fixes: bfc7c70 ("SNC is added")
Signed-off-by: Xiaochen Shen <[email protected]>
@shenxiaochen
Copy link
Author

Note: The code base of this PR is on top of #300 and #299

[This PR - #304 ]
e07514a lib: Fix SNC state detection on non-SNC-capable platforms

[PR #300 ]
b981133 lib: Set fixed MBM counter length for Hygon
90c603e lib/python: Add support for Hygon Platform QoS features
c5cc545 lib: Add support for Hygon Platform QoS features

[PR #299 ]
6a4d764 lib: Fix incorrect bitmask for MBM counter length

Best regards,
Xiaochen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant