Poor performance of group gemm backward implementation #154

malixian · 2025-05-14T09:19:08Z

malixian
May 14, 2025

I found that the backward implementation performance of group gemm in symmetric tensor contraction is poor.
In the MACE-based test, I found that cueq's symmetric_tensor_contraction takes an average of 2.4ms. However, group gemm takes an average of 24ms, a 10x performance gap. But in the forward process, group gemm is even better than cueq.
I guess the main performance difference in backward should be the multiple cublas calls in group gemm. Will there be related optimizations in the future, such as kernel fusion?

vbharadwaj-bk · 2025-05-14T17:55:56Z

vbharadwaj-bk
May 14, 2025
Maintainer

This is an interesting observation, thanks very much for this (and appreciate that you went into the code to isolate what the problem could be)! I am thinking we should document this in a note on the README.

Symmetric contraction was not originally in the scope of our project (since it is mainly used in MACE). We are keeping it in a beta status, and we aren't exposing it in the package top-level yet. However, we introduced this implementation based on grouped_gemm because we found the indexing performance of the baseline code very high when the species count becomes large (there is work performed for every zero entry), and because we thought it would be useful to accelerate MACE for AMD GPUs - we are still working on the RocM integration, although this will be short. Papers like this one https://arxiv.org/pdf/2504.10700 detail much better approaches to accelerating symmetric contraction.

I don't think the issue is that we are calling grouped_gemm multiple times (and kernel fusion would probably not help - although I could be wrong). A bigger issue would be that the combination of transposes / strides / leading dimensions for one of the calls in the backward pass is different from the forward pass, and cuBLAS may exhibit poorer performance for this combination. https://github.com/PASSIONLab/OpenEquivariance/blob/9602816fa4388b62559b9b3f8d7eea470e3c4214/openequivariance/extension/group_mm_cuda.hpp#L77C2-L92C36.
I think the permutation we do in PyTorch at the beginning of the STOpt forward call could be improved as well.

Our action items to resolve:

A) Document this issue on the README
B) Profile the backward pass to confirm if one of the backward cuBLAS calls is causing the slowdown.
C) Resolve the slowdown.

A will probably happen in a few commits, B will probably happen, C may or may not happen, since we are prioritizing improvements to the convolution kernel (e.g. stable summation and accuracy improvements) first. We'll close this issue (and perhaps move it to discussion if C doesn't get resolved) when we decide on one of these courses of action.

0 replies

vbharadwaj-bk · 2025-08-14T18:36:53Z

vbharadwaj-bk
Aug 14, 2025
Maintainer

Sorry to return to this so late - I've added a small note to the README indicating the slowdown for the symmetric contraction backward pass and have added action items B & C to our project tracker. Going to convert to a discussion so others are aware, but no immediate plans on my end to speed up this kernel much further, as it is rather specific to MACE.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor performance of group gemm backward implementation #154

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Poor performance of group gemm backward implementation #154

Uh oh!

malixian May 14, 2025

Replies: 2 comments

Uh oh!

vbharadwaj-bk May 14, 2025 Maintainer

Uh oh!

vbharadwaj-bk Aug 14, 2025 Maintainer

malixian
May 14, 2025

vbharadwaj-bk
May 14, 2025
Maintainer

vbharadwaj-bk
Aug 14, 2025
Maintainer