Poor performance of group gemm backward implementation

I found that the backward implementation performance of group gemm in symmetric tensor contraction is poor. 
In the MACE-based test, I found that cueq's symmetric_tensor_contraction takes an average of 2.4ms. However, group gemm takes an average of 24ms, a 10x performance gap. But in the forward process, group gemm is even better than cueq. 
I guess the main performance difference in backward should be the multiple cublas calls in group gemm. Will there be related optimizations in the future, such as kernel fusion? 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor performance of group gemm backward implementation #100

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor performance of group gemm backward implementation #100

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions