Replies: 2 comments
-
|
This is an interesting observation, thanks very much for this (and appreciate that you went into the code to isolate what the problem could be)! I am thinking we should document this in a note on the README. Symmetric contraction was not originally in the scope of our project (since it is mainly used in MACE). We are keeping it in a beta status, and we aren't exposing it in the package top-level yet. However, we introduced this implementation based on grouped_gemm because we found the indexing performance of the baseline code very high when the species count becomes large (there is work performed for every zero entry), and because we thought it would be useful to accelerate MACE for AMD GPUs - we are still working on the RocM integration, although this will be short. Papers like this one https://arxiv.org/pdf/2504.10700 detail much better approaches to accelerating symmetric contraction. I don't think the issue is that we are calling grouped_gemm multiple times (and kernel fusion would probably not help - although I could be wrong). A bigger issue would be that the combination of transposes / strides / leading dimensions for one of the calls in the backward pass is different from the forward pass, and cuBLAS may exhibit poorer performance for this combination. https://github.com/PASSIONLab/OpenEquivariance/blob/9602816fa4388b62559b9b3f8d7eea470e3c4214/openequivariance/extension/group_mm_cuda.hpp#L77C2-L92C36. Our action items to resolve: A) Document this issue on the README A will probably happen in a few commits, B will probably happen, C may or may not happen, since we are prioritizing improvements to the convolution kernel (e.g. stable summation and accuracy improvements) first. We'll close this issue (and perhaps move it to discussion if C doesn't get resolved) when we decide on one of these courses of action. |
Beta Was this translation helpful? Give feedback.
-
|
Sorry to return to this so late - I've added a small note to the README indicating the slowdown for the symmetric contraction backward pass and have added action items B & C to our project tracker. Going to convert to a discussion so others are aware, but no immediate plans on my end to speed up this kernel much further, as it is rather specific to MACE. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I found that the backward implementation performance of group gemm in symmetric tensor contraction is poor.
In the MACE-based test, I found that cueq's symmetric_tensor_contraction takes an average of 2.4ms. However, group gemm takes an average of 24ms, a 10x performance gap. But in the forward process, group gemm is even better than cueq.
I guess the main performance difference in backward should be the multiple cublas calls in group gemm. Will there be related optimizations in the future, such as kernel fusion?
Beta Was this translation helpful? Give feedback.
All reactions