Skip to content

Single __syncthreads per stage in GroupHadamardAmaxTmaKernel#2809

Open
cael-ling wants to merge 3 commits intoNVIDIA:mainfrom
cael-ling:refactor/grp_hadamard_syncthreads
Open

Single __syncthreads per stage in GroupHadamardAmaxTmaKernel#2809
cael-ling wants to merge 3 commits intoNVIDIA:mainfrom
cael-ling:refactor/grp_hadamard_syncthreads

Conversation

@cael-ling
Copy link
Copy Markdown

@cael-ling cael-ling commented Mar 29, 2026

Description

In the TMA + compute pipeline of GroupHadamardAmaxTmaKernel, __syncthreads() was previously invoked inside the inner compute_stage_y loop. From a dependency standpoint, within a stage all threads only read shared memory on the Tensor Core path, with warps accessing disjoint tiled regions; what actually needs a block-wide ordering is “every thread has finished all generic shared-memory reads for this stage’s ping-pong buffer” before a later iteration may issue TMA that reuses that buffer.

The barrier is therefore tightened to: after all ComputeKernel work for the current (stage_y, stage_x) (full compute_stage_y × compute_stage_x nest), run one __syncthreads(), then ptx::fence_proxy_async_shared_cta(), so generic shared visibility is established before the next async TMA can overwrite the buffer. This cuts synchronization count, reduces barrier overhead, and matches the intended “no reader left” semantics before reuse.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 29, 2026

Greptile Summary

This PR tightens the synchronization in three TMA-based Hadamard kernel variants (GroupHadamardAmaxTmaKernel, HadamardAmaxTmaKernel, and GraphSafeGroupHadamardAmaxTmaKernel) by moving __syncthreads() from inside the inner compute_stage_y loop to after the full compute_stage_y × compute_stage_x compute nest. The result is one __syncthreads() per outer TMA stage rather than one per compute_stage_y iteration.

  • Correctness is preserved. ComputeKernel only reads shared memory via ldmatrix into registers and accumulates amax values in register variables — it never writes back to shared memory. Because there are no inter-iteration write-then-read hazards within the compute nest, the single post-nest sync is sufficient to guarantee that "no reader is left" in the ping-pong buffer before TMA may reuse it.
  • The fence ordering is unchanged. __syncthreads() followed by ptx::fence_proxy_async_shared_cta() still runs once per outer stage, which is exactly what is required to make generic shared-memory accesses visible to the async TMA proxy before the next async write.
  • The change is applied consistently across all three kernel files, with identical structure and comments in each.
  • The reduction in barrier count is proportional to compute_stage_y_num, which can be > 1, giving a real runtime improvement for larger tile configurations.

Confidence Score: 5/5

  • Safe to merge — the optimization is correct, consistently applied, and reduces unnecessary barrier overhead without affecting correctness.
  • ComputeKernel is provably read-only on shared memory (all output goes to registers), so moving __syncthreads() outside the inner loop introduces no data hazard. The fence/barrier sequence per TMA stage is preserved. All three kernel files receive identical, consistent changes. No P0 or P1 issues found.
  • No files require special attention.

Important Files Changed

Filename Overview
transformer_engine/common/hadamard_transform/group_hadamard_transform.cu Moves __syncthreads() from inside compute_stage_y loop to after the full compute_stage_y × compute_stage_x nest, reducing barrier invocations from N to 1 per TMA stage. Safe because ComputeKernel only reads shared memory into registers.
transformer_engine/common/hadamard_transform/hadamard_transform.cu Same optimization as group_hadamard_transform.cu — single __syncthreads() per outer stage instead of one per compute_stage_y iteration.
transformer_engine/common/hadamard_transform/graph_safe_group_hadamard_transform.cu Same optimization consistently applied to the graph-safe variant of the kernel.

Sequence Diagram

sequenceDiagram
    participant TMA as TMA Engine
    participant SM as Shared Memory (ping-pong)
    participant TC as Thread Block (Tensor Core)

    Note over TMA,TC: Outer loop: stage N

    TMA->>SM: copy_2d_to_shared(in_shs[next%2]) — prefetch stage N+1
    TC->>SM: mbarrier_wait_parity(mbar[N]) — wait for stage N data

    loop compute_stage_y × compute_stage_x (all read-only)
        TC->>SM: ldmatrix from in_shs[N%2] (READ ONLY)
        TC->>TC: MMA + amax accumulation into registers
    end

    Note over TC: __syncthreads() — single barrier per stage (NEW)
    Note over TC: fence_proxy_async_shared_cta() — generic SM visible before next TMA

    Note over TMA,TC: Outer loop: stage N+1

    TMA->>SM: copy_2d_to_shared(in_shs[(N+2)%2]) — prefetch stage N+2<br/>(safe: buffer was released by __syncthreads above)
    TC->>SM: mbarrier_wait_parity(mbar[N+1])
Loading

Reviews (3): Last reviewed commit: "Apply the change to other variants" | Re-trigger Greptile

cael-ling and others added 2 commits March 29, 2026 03:13
Signed-off-by: Cael Ling <caell@nvidia.com>
Made-with: Cursor
@cael-ling cael-ling force-pushed the refactor/grp_hadamard_syncthreads branch from 1657a91 to 9822e07 Compare March 29, 2026 10:16
@cael-ling cael-ling changed the title perf(hadamard): single __syncthreads per stage in GroupHadamardAmaxTmaKernel Single __syncthreads per stage in GroupHadamardAmaxTmaKernel Mar 29, 2026
@zhongbozhu
Copy link
Copy Markdown
Collaborator

LGTM, can you introduce this change to other similar variants of the kernel? https://github.com/NVIDIA/TransformerEngine/tree/main/transformer_engine/common/hadamard_transform

Signed-off-by: Cael Ling <caell@nvidia.com>
@cael-ling
Copy link
Copy Markdown
Author

The change has been applied to variants:(group_hadamard_transform.cu/hadamard_trnsform.cu/graph_safe_group_hadamard_transform.cu)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants