[pull] main from NVIDIA:main by pull[bot] · Pull Request #545 · phu0ngng/TransformerEngine

pull · 2026-04-03T22:32:04Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

#2820) * Compute swizzle_idx once per thread and pass into ComputeKernel. Signed-off-by: Cael Ling <caell@nvidia.com> * one __syncthreads per stage in GroupHadamardAmaxTmaKernel Signed-off-by: Cael Ling <caell@nvidia.com> * streamline group Hadamard ComputeKernel loads Signed-off-by: Cael Ling <caell@nvidia.com> * streamline group Hadamard ComputeKernel loads Signed-off-by: Cael Ling <caell@nvidia.com> * streamline group Hadamard ComputeKernel loads Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * one __syncthreads per stage in GroupHadamardAmaxTmaKernel Signed-off-by: Cael Ling <caell@nvidia.com> Made-with: Cursor * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Compute swizzle_idx once per thread and pass into ComputeKernel. Signed-off-by: Cael Ling <caell@nvidia.com> * Fix kReturnIdentityAmax path Signed-off-by: Cael Ling <caell@nvidia.com> * Fix kReturnIdentityAmax path Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply the change to other variants Signed-off-by: Cael Ling <caell@nvidia.com> * Refactor the change to other variants Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor the change to other variants Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor the ldmatrix logics Signed-off-by: Cael Ling <caell@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Cael Ling <caell@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…ter CI error re… (#2802) * Capture subprocess stderr in distributed tests for better CI error reporting Distributed tests launch subprocesses via torch.distributed.launch/torchrun. When these fail, pytest only captures the CalledProcessError from the parent process, not the actual worker traceback. This makes CI JUnit XML reports show "exit code 1" with no useful error detail. Add run_distributed() utility to tests/pytorch/utils.py that captures stderr while letting stdout stream to the terminal. On failure, the worker's stderr (containing the actual Python traceback) is included in the AssertionError, which pytest writes into the JUnit XML report. Behavior: - Interactive use: stdout streams in real time (unchanged), stderr shown on failure - CI/JUnit XML: failure reports now include the actual worker traceback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * Add JUnit XML output to ctest in L0_cppunittest Add --output-junit flag so ctest writes JUnit XML to /logs/, matching the pattern used by pytest tests. The XML is written before ctest exits, so it's captured even on test failure. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Add unswizzling functions for scaling factors in swizzle module - Introduced `nvte_unswizzle_scaling_factors` to convert swizzled scaling factors back to row-major format. - Implemented `regs_unshuffle_with_bit_shifts` and `regs_unshuffle` for unshuffling operations in CUDA kernels. - Added `unswizzle_row_scaling_kernel_impl` and `unswizzle_col_scaling_kernel_impl` for handling unswizzling in row and column scaling respectively. These changes enhance the functionality of the swizzle module, enabling better handling of scaling factors in tensor operations. Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Add swizzle/unswizzle roundtrip test for scaling factors These enhancements tests the changes introduced for unswizzling Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Added another unswizzling functionality test for scaling factors - Introduced `compute_ref_unswizzle` to handle the conversion of swizzled scaling factors back to their original format. - Added `performTestUnswizzle1D` to validate the unswizzling process with various scaling modes. - Created `UnswizzleTestSuite` for comprehensive testing of unswizzling operations. Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Moved swizzle_row_scaling_kernel implementation at its original place - Moved the definition of `swizzle_row_scaling_kernel` to a new location for better organization. - Ensured the kernel implementation is now properly defined and accessible for scaling operations in the swizzle module. Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Add multi-tensor unswizzling functions for scaling factors - Introduced `multi_tensor_unswizzle_scaling_factors` to convert swizzled scaling factors back to their original row-major format. - Implemented CUDA kernels for unswizzling in both row and column scaling, enhancing the swizzle module's functionality. - Updated the launch function to handle multiple tensor unswizzling operations efficiently. These changes improve the handling of scaling factors in tensor operations, ensuring better performance and organization within the swizzle module. Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added greptile suggestions Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed unused check from tests and reading input directly as const rather than casting Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor unswizzling functions and update test cases for scaling factors - Updated unswizzling kernel implementations to remove original_M and original_K parameters, simplifying the function signatures. - Enhanced test suite to utilize new unswizzling data shapes, ensuring comprehensive coverage of aligned and padded cases. These changes improve the clarity and efficiency of the unswizzling process in the swizzle module. Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor unswizzling scaling factors to use a launch function Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Change unswizzling to use output as gt. Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Refactor unswizzling scaling factors to improve input validation and streamline processing. Need to check if rowwise and columnwise both can be true. If yes the if else needs to account for that Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Fix multi_tensor_unswizzle_scaling_factors to correctly reference output tensors for scaling mode and data validation. Updated checks for input and output tensor shapes to ensure proper handling of row-wise and column-wise scaling factors. Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Enhance swizzle tests and unswizzling validation Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Fix typos and update validation checks in swizzle.cu Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Update validation checks in multi_tensor_unswizzle_scaling_factors to use input numel Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * Typo Signed-off-by: Abhishek <abhi.dtu11@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishek <abhi.dtu11@gmail.com> Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Przemek Tredak <ptredak@nvidia.com>

Signed-off-by: Gaetan Lepage <gaetan@glepage.com>

cael-ling and others added 4 commits April 3, 2026 11:32

Fix nvshmem build (#2815)

e83c097

Signed-off-by: Gaetan Lepage <gaetan@glepage.com>

pull bot locked and limited conversation to collaborators Apr 3, 2026

pull bot added the ⤵️ pull label Apr 3, 2026

pull bot merged commit e83c097 into phu0ngng:main Apr 3, 2026

pull bot had a problem deploying to github-pages April 3, 2026 22:33 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from NVIDIA:main#545

[pull] main from NVIDIA:main#545
pull[bot] merged 4 commits intophu0ngng:mainfrom
NVIDIA:main

pull bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pull bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pull bot commented Apr 3, 2026 •

edited

Loading