Skip to content

Poor performance of group gemm backward implementation #100

@malixian

Description

@malixian

I found that the backward implementation performance of group gemm in symmetric tensor contraction is poor.
In the MACE-based test, I found that cueq's symmetric_tensor_contraction takes an average of 2.4ms. However, group gemm takes an average of 24ms, a 10x performance gap. But in the forward process, group gemm is even better than cueq.
I guess the main performance difference in backward should be the multiple cublas calls in group gemm. Will there be related optimizations in the future, such as kernel fusion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions