Skip to content

[pull] main from NVIDIA:main#542

Merged
pull[bot] merged 5 commits intophu0ngng:mainfrom
NVIDIA:main
Apr 3, 2026
Merged

[pull] main from NVIDIA:main#542
pull[bot] merged 5 commits intophu0ngng:mainfrom
NVIDIA:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Apr 2, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

tdophung and others added 4 commits April 2, 2026 10:17
* Pass input_output_alias to TritonAutotunedKernelCall

Signed-off-by: JAX Toolbox <jax@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add jax version guard for the input_output_aliasing fix

Signed-off-by: tdophung <tdophung@nvidia.com>

---------

Signed-off-by: JAX Toolbox <jax@nvidia.com>
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: JAX Toolbox <jax@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* done

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* one review comment form greptile

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* instead part of the comment not needed

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/float8_blockwise_tensor.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* No need to set it to None

Remove unnecessary columnwise data and scale inv assignments.

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

---------

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* cudnn now returns Stats always and Max only with `return_max_logit=true`

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix a typo that caused a bug

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update doc strings

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix more docs

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fixes from the feedback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update cudnn-frontend to v1.19.1

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update the cudnn frontend

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix a wrong omission

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* bugfix: mask out padding tokens when THD

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes from greptile feedback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor nit

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fixes from feedback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@pull pull bot locked and limited conversation to collaborators Apr 2, 2026
@pull pull bot added the ⤵️ pull label Apr 2, 2026
* Enabled persistency with WorkID Query feature

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added a struct with tunable parameters

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added persistency with static scheduling

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed test cases

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for benchmarking

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed out-of-boundary error

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Tuned kernel parameters

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring 2

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring 3

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed the dynamic (WorkID Query) persistency

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for PR

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Ready for benchmark

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for benchmark - Regular kernel

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added the source code to the profiler

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added constructors to Job and Block descriptors

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed the prefetch overlapping between jobs

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Cache tensor ID

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* ShapeRepresentation is not a template parameter

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed redundant fence_proxy

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Used mixed precision FMA

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added Quantize parameters

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added the fast math branch

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added the fast math to cpp test suite

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Align tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Use STS instead of generic ST

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Add zero-tensor cases

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Used LDS instead of generic LD in colwise path

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Used LDS instead of generic LD in rowwise

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for merge

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Uncommented test cases

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added FP16 Fast math path to rowwise processing

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed lint

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed test suite

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed test suite

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Modifications per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Assert the buffer size

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added fast math RCP for bf16

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fast math for BF16 is now default

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed compilation error when compiling on previous archs

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Boundary condition fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed compilation error

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring. Moved helpers to core-common

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactoring per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Addressed the PR review comments

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed the compilation error when PTX was compiled for CUDA 13.0

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed pytorch extensions

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@pull pull bot merged commit 42267ec into phu0ngng:main Apr 3, 2026
@pull pull bot had a problem deploying to github-pages April 3, 2026 04:33 Failure
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants