Skip to content

Conversation

@Maxon081102
Copy link

What does this PR do?
Fixes runtime errors caused by pointer size mismatches in Triton kernels during very large matrix computations.
This PR introduces a new Triton kernel that uses int64 pointers and adds automatic kernel selection between the existing int32 kernel and the new int64 version.
The change ensures stable execution on long‑context LLM training tasks, preventing RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered when matrix dimensions exceed int32 limits.

@meta-cla
Copy link

meta-cla bot commented Oct 15, 2025

Hi @Maxon081102!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 15, 2025
@meta-cla
Copy link

meta-cla bot commented Oct 15, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. The change is very large, and this code is in best-effort maintenance, hence we can only accept changes if they don't add too much surface.

Comment on lines +431 to +432
pid = tl.program_id(0).to(tl.int64)
pid_k = tl.program_id(1).to(tl.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect these are the only two lines that differ wrt the original version, is that so?

What is the downside of always casting to int64 in the original kernel? Did you observe some performance regression?

Comment on lines +619 to +650
# Decide whether 32-bit address arithmetic can overflow; if so, use int64-safe kernel
INT32_MAX = (1 << 31) - 1
def _dim_or_zero(xs, i):
return xs[i] if len(xs) > i else 0

use_int64 = False
for i in range(3):
Mi = max(0, _dim_or_zero(ms, i))
Ni = max(0, _dim_or_zero(ns, i))
Ki = max(0, _dim_or_zero(ks, i))

# A offsets
a_row_term = max(0, Mi - 1) * int(strides_am[i])
a_col_term = max(0, Ki - 1) * int(strides_ak[i])
# B offsets
b_row_term = max(0, Ki - 1) * int(strides_bk[i])
b_col_term = max(0, Ni - 1) * int(strides_bn[i])
# C offsets
c_row_term = max(0, Mi - 1) * int(strides_cm[i])
c_col_term = max(0, Ni - 1) * int(strides_cn[i])

# Check per-term and per-address sums
if (
a_row_term > INT32_MAX or a_col_term > INT32_MAX or
b_row_term > INT32_MAX or b_col_term > INT32_MAX or
c_row_term > INT32_MAX or c_col_term > INT32_MAX or
(a_row_term + a_col_term) > INT32_MAX or
(b_row_term + b_col_term) > INT32_MAX or
(c_row_term + c_col_term) > INT32_MAX
):
use_int64 = True
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can be simplified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants