Single __syncthreads per stage in GroupHadamardAmaxTmaKernel#2809
Single __syncthreads per stage in GroupHadamardAmaxTmaKernel#2809cael-ling wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR tightens the synchronization in three TMA-based Hadamard kernel variants (
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant TMA as TMA Engine
participant SM as Shared Memory (ping-pong)
participant TC as Thread Block (Tensor Core)
Note over TMA,TC: Outer loop: stage N
TMA->>SM: copy_2d_to_shared(in_shs[next%2]) — prefetch stage N+1
TC->>SM: mbarrier_wait_parity(mbar[N]) — wait for stage N data
loop compute_stage_y × compute_stage_x (all read-only)
TC->>SM: ldmatrix from in_shs[N%2] (READ ONLY)
TC->>TC: MMA + amax accumulation into registers
end
Note over TC: __syncthreads() — single barrier per stage (NEW)
Note over TC: fence_proxy_async_shared_cta() — generic SM visible before next TMA
Note over TMA,TC: Outer loop: stage N+1
TMA->>SM: copy_2d_to_shared(in_shs[(N+2)%2]) — prefetch stage N+2<br/>(safe: buffer was released by __syncthreads above)
TC->>SM: mbarrier_wait_parity(mbar[N+1])
Reviews (3): Last reviewed commit: "Apply the change to other variants" | Re-trigger Greptile |
Signed-off-by: Cael Ling <caell@nvidia.com> Made-with: Cursor
for more information, see https://pre-commit.ci
1657a91 to
9822e07
Compare
|
LGTM, can you introduce this change to other similar variants of the kernel? https://github.com/NVIDIA/TransformerEngine/tree/main/transformer_engine/common/hadamard_transform |
Signed-off-by: Cael Ling <caell@nvidia.com>
|
The change has been applied to variants:(group_hadamard_transform.cu/hadamard_trnsform.cu/graph_safe_group_hadamard_transform.cu) |
Description
In the TMA + compute pipeline of GroupHadamardAmaxTmaKernel, __syncthreads() was previously invoked inside the inner compute_stage_y loop. From a dependency standpoint, within a stage all threads only read shared memory on the Tensor Core path, with warps accessing disjoint tiled regions; what actually needs a block-wide ordering is “every thread has finished all generic shared-memory reads for this stage’s ping-pong buffer” before a later iteration may issue TMA that reuses that buffer.
The barrier is therefore tightened to: after all ComputeKernel work for the current (stage_y, stage_x) (full compute_stage_y × compute_stage_x nest), run one __syncthreads(), then ptx::fence_proxy_async_shared_cta(), so generic shared visibility is established before the next async TMA can overwrite the buffer. This cuts synchronization count, reduces barrier overhead, and matches the intended “no reader left” semantics before reuse.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: