I found that the backward implementation performance of group gemm in symmetric tensor contraction is poor.
In the MACE-based test, I found that cueq's symmetric_tensor_contraction takes an average of 2.4ms. However, group gemm takes an average of 24ms, a 10x performance gap. But in the forward process, group gemm is even better than cueq.
I guess the main performance difference in backward should be the multiple cublas calls in group gemm. Will there be related optimizations in the future, such as kernel fusion?