Merged
Conversation
* Pass input_output_alias to TritonAutotunedKernelCall Signed-off-by: JAX Toolbox <jax@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add jax version guard for the input_output_aliasing fix Signed-off-by: tdophung <tdophung@nvidia.com> --------- Signed-off-by: JAX Toolbox <jax@nvidia.com> Signed-off-by: tdophung <tdophung@nvidia.com> Co-authored-by: JAX Toolbox <jax@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
* done Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * one review comment form greptile Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * instead part of the comment not needed Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * No need to set it to None Remove unnecessary columnwise data and scale inv assignments. Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* cudnn now returns Stats always and Max only with `return_max_logit=true` Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix a typo that caused a bug Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update doc strings Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix more docs Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fixes from the feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update cudnn-frontend to v1.19.1 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * update the cudnn frontend Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix a wrong omission Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bugfix: mask out padding tokens when THD Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes from greptile feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor nit Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fixes from feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Enabled persistency with WorkID Query feature Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added a struct with tunable parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added persistency with static scheduling Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for benchmarking Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed out-of-boundary error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Tuned kernel parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring 2 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring 3 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed the dynamic (WorkID Query) persistency Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for PR Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Ready for benchmark Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for benchmark - Regular kernel Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the source code to the profiler Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added constructors to Job and Block descriptors Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed the prefetch overlapping between jobs Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Cache tensor ID Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * ShapeRepresentation is not a template parameter Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed redundant fence_proxy Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used mixed precision FMA Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added Quantize parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the fast math branch Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the fast math to cpp test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Align tests Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Use STS instead of generic ST Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Add zero-tensor cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used LDS instead of generic LD in colwise path Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used LDS instead of generic LD in rowwise Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for merge Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Uncommented test cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added FP16 Fast math path to rowwise processing Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed lint Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Modifications per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Assert the buffer size Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added fast math RCP for bf16 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fast math for BF16 is now default Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed compilation error when compiling on previous archs Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Boundary condition fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed compilation error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring. Moved helpers to core-common Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactoring per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addressed the PR review comments Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed the compilation error when PTX was compiled for CUDA 13.0 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed pytorch extensions Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )