Single __syncthreads per stage in GroupHadamardAmaxTmaKernel by cael-ling · Pull Request #2809 · NVIDIA/TransformerEngine

cael-ling · 2026-03-29T07:03:49Z

Description

In the TMA + compute pipeline of GroupHadamardAmaxTmaKernel, __syncthreads() was previously invoked inside the inner compute_stage_y loop. From a dependency standpoint, within a stage all threads only read shared memory on the Tensor Core path, with warps accessing disjoint tiled regions; what actually needs a block-wide ordering is “every thread has finished all generic shared-memory reads for this stage’s ping-pong buffer” before a later iteration may issue TMA that reuses that buffer.

The barrier is therefore tightened to: after all ComputeKernel work for the current (stage_y, stage_x) (full compute_stage_y × compute_stage_x nest), run one __syncthreads(), then ptx::fence_proxy_async_shared_cta(), so generic shared visibility is established before the next async TMA can overwrite the buffer. This cuts synchronization count, reduces barrier overhead, and matches the intended “no reader left” semantics before reuse.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-03-29T07:06:30Z

Greptile Summary

This PR tightens the synchronization in three TMA-based Hadamard kernel variants (GroupHadamardAmaxTmaKernel, HadamardAmaxTmaKernel, and GraphSafeGroupHadamardAmaxTmaKernel) by moving __syncthreads() from inside the inner compute_stage_y loop to after the full compute_stage_y × compute_stage_x compute nest. The result is one __syncthreads() per outer TMA stage rather than one per compute_stage_y iteration.

Correctness is preserved. ComputeKernel only reads shared memory via ldmatrix into registers and accumulates amax values in register variables — it never writes back to shared memory. Because there are no inter-iteration write-then-read hazards within the compute nest, the single post-nest sync is sufficient to guarantee that "no reader is left" in the ping-pong buffer before TMA may reuse it.
The fence ordering is unchanged. __syncthreads() followed by ptx::fence_proxy_async_shared_cta() still runs once per outer stage, which is exactly what is required to make generic shared-memory accesses visible to the async TMA proxy before the next async write.
The change is applied consistently across all three kernel files, with identical structure and comments in each.
The reduction in barrier count is proportional to compute_stage_y_num, which can be > 1, giving a real runtime improvement for larger tile configurations.

Confidence Score: 5/5

Safe to merge — the optimization is correct, consistently applied, and reduces unnecessary barrier overhead without affecting correctness.
ComputeKernel is provably read-only on shared memory (all output goes to registers), so moving __syncthreads() outside the inner loop introduces no data hazard. The fence/barrier sequence per TMA stage is preserved. All three kernel files receive identical, consistent changes. No P0 or P1 issues found.
No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/common/hadamard_transform/group_hadamard_transform.cu	Moves __syncthreads() from inside compute_stage_y loop to after the full compute_stage_y × compute_stage_x nest, reducing barrier invocations from N to 1 per TMA stage. Safe because ComputeKernel only reads shared memory into registers.
transformer_engine/common/hadamard_transform/hadamard_transform.cu	Same optimization as group_hadamard_transform.cu — single __syncthreads() per outer stage instead of one per compute_stage_y iteration.
transformer_engine/common/hadamard_transform/graph_safe_group_hadamard_transform.cu	Same optimization consistently applied to the graph-safe variant of the kernel.

Sequence Diagram

sequenceDiagram
    participant TMA as TMA Engine
    participant SM as Shared Memory (ping-pong)
    participant TC as Thread Block (Tensor Core)

    Note over TMA,TC: Outer loop: stage N

    TMA->>SM: copy_2d_to_shared(in_shs[next%2]) — prefetch stage N+1
    TC->>SM: mbarrier_wait_parity(mbar[N]) — wait for stage N data

    loop compute_stage_y × compute_stage_x (all read-only)
        TC->>SM: ldmatrix from in_shs[N%2] (READ ONLY)
        TC->>TC: MMA + amax accumulation into registers
    end

    Note over TC: __syncthreads() — single barrier per stage (NEW)
    Note over TC: fence_proxy_async_shared_cta() — generic SM visible before next TMA

    Note over TMA,TC: Outer loop: stage N+1

    TMA->>SM: copy_2d_to_shared(in_shs[(N+2)%2]) — prefetch stage N+2<br/>(safe: buffer was released by __syncthreads above)
    TC->>SM: mbarrier_wait_parity(mbar[N+1])

_{Reviews (3): Last reviewed commit: "Apply the change to other variants" | Re-trigger Greptile}

Signed-off-by: Cael Ling <caell@nvidia.com> Made-with: Cursor

for more information, see https://pre-commit.ci

zhongbozhu · 2026-03-30T17:26:09Z

LGTM, can you introduce this change to other similar variants of the kernel? https://github.com/NVIDIA/TransformerEngine/tree/main/transformer_engine/common/hadamard_transform

Signed-off-by: Cael Ling <caell@nvidia.com>

cael-ling · 2026-03-31T07:33:37Z

The change has been applied to variants:(group_hadamard_transform.cu/hadamard_trnsform.cu/graph_safe_group_hadamard_transform.cu)

cael-ling and others added 2 commits March 29, 2026 03:13

one __syncthreads per stage in GroupHadamardAmaxTmaKernel

a4dea3b

Signed-off-by: Cael Ling <caell@nvidia.com> Made-with: Cursor

[pre-commit.ci] auto fixes from pre-commit.com hooks

9822e07

for more information, see https://pre-commit.ci

cael-ling force-pushed the refactor/grp_hadamard_syncthreads branch from 1657a91 to 9822e07 Compare March 29, 2026 10:16

cael-ling changed the title ~~perf(hadamard): single __syncthreads per stage in GroupHadamardAmaxTmaKernel~~ Single __syncthreads per stage in GroupHadamardAmaxTmaKernel Mar 29, 2026

Apply the change to other variants

c89118d

Signed-off-by: Cael Ling <caell@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single __syncthreads per stage in GroupHadamardAmaxTmaKernel#2809

Single __syncthreads per stage in GroupHadamardAmaxTmaKernel#2809
cael-ling wants to merge 3 commits intoNVIDIA:mainfrom
cael-ling:refactor/grp_hadamard_syncthreads

cael-ling commented Mar 29, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 29, 2026 •

edited

Loading

Uh oh!

zhongbozhu commented Mar 30, 2026

Uh oh!

cael-ling commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cael-ling commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

zhongbozhu commented Mar 30, 2026

Uh oh!

cael-ling commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cael-ling commented Mar 29, 2026 •

edited

Loading

greptile-apps bot commented Mar 29, 2026 •

edited

Loading