The GPU memory optimization with Larger angular momentum #161

ChenQianA · 2025-10-03T15:29:12Z

ChenQianA
Oct 3, 2025

As is known, the headaches of CGTP are the computational inefficiency and GPU memory overhead. OpenEquivariance achieves excellent acceleration over e3nn by optimizing the GPU utility as reported in the paper. However, the acceleration from the algorithmic methods like SO(2) operation and Gaunt TP can reduce the complexity from O(L^6) to between O(L^3) and O(L^2logL), which makes it very fast while being friendly to GPU memory. So, has OpenEquivariance been tested regarding (1) the optimization for GPU memory, especially when the angular momentum L is very large, (2) acceleration for scenarios with larger angular momentum L? Thanks.

Answered by vbharadwaj-bk

Oct 3, 2025

How large do you want to make L? We've tested up to L=7,7,7 interactions with high GPU utilization, see line 87 of tests/benchmark.py. You can benchmark it yourself by modifying the irreps in the example at the top of the README to have high L values.

We haven't benchmarked against those methods (but you are welcome to). In terms of memory for our implementation: the nonzeros for the TP are coded into the instruction stream, so as L increases we don't really run into memory constraints (but the I-cache will eventually spill, resulting in slowdown). I'd say we are still pretty memory and compute efficient overall :)

The SO2 convolution is very clever, but you have to rotate irreps by a uni…

View full answer

vbharadwaj-bk · 2025-10-03T15:58:32Z

vbharadwaj-bk
Oct 3, 2025
Maintainer

How large do you want to make L? We've tested up to L=7,7,7 interactions with high GPU utilization, see line 87 of tests/benchmark.py. You can benchmark it yourself by modifying the irreps in the example at the top of the README to have high L values.

We haven't benchmarked against those methods (but you are welcome to). In terms of memory for our implementation: the nonzeros for the TP are coded into the instruction stream, so as L increases we don't really run into memory constraints (but the I-cache will eventually spill, resulting in slowdown). I'd say we are still pretty memory and compute efficient overall :)

The SO2 convolution is very clever, but you have to rotate irreps by a unique transform on each edge of the atomic graph, so it's much more complicated from an HPC perspective. See page 38, bottom-most paragraph of my dissertation for an analysis of the tradeoffs and a full explanation of their method. I think against a simple implementation of SO2 convolution in PyTorch, we'd be very competitive (if not outright better).

Gaunt TP is potentially very fast, but it's a distinct operation, see analysis here, that may sacrifice some expressive capability. And again, if you wish to implement optimizations like kernel fusion with graph convolution as we do, the simple PyTorch implementation probably won't cut it. You would need to fuse the FFTs / multiplication at the kernel level with the node edge aggregation, which is a lot of engineering.

But happy to see numbers to the contrary on these, these are back of the envelope guesses.

2 replies

ChenQianA Oct 7, 2025
Author

Thank you so much, Vivek, for the detailed reply. When I trace the GPU memory utilities, it turns out that the obtained allocated memory for the run (via the memory difference between the start and the end of the run using the torch.cuda.memory_allocated(device) command), the \delta allocated memory is near zero instead. Is it a consequence of the kernel optimization that makes it less traceable by the built-in torch command? How can I look into the memory utility of openequivariance?

vbharadwaj-bk Oct 7, 2025
Maintainer

Yep, this is expected; our package has near-zero GPU DRAM overhead.

With the exception of the deterministic fused convolution (where we allocate maybe a ~10-100 megabytes of scratch space to avoid atomics, and this is sized proportionally to the number of streaming multiprocessors in the GPU and NOT the input size), all other algorithms including the atomic fused convolution in MACE / Nequip require zero additional DRAM, either using a PyTorch buffer or otherwise. No temporary DRAM is allocated, as all computation occurs inside the shared memory and results are written directly to the output buffer.

We have no calls to cudaMalloc in our code (beyond one line that was used for debugging / clearing the L2 cache between runs), and you can check, e.g. by tracing through the function below, that the only new PyTorch buffers that are created are subsequently returned.

(

OpenEquivariance/openequivariance/extension/libtorch_tp_jit.cpp

Line 566 in 08c891c

    
           tuple<torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor> jit_conv_double_backward(

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The GPU memory optimization with Larger angular momentum #161

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

The GPU memory optimization with Larger angular momentum #161

Uh oh!

Uh oh!

ChenQianA Oct 3, 2025

Replies: 1 comment · 2 replies

Uh oh!

vbharadwaj-bk Oct 3, 2025 Maintainer

Uh oh!

ChenQianA Oct 7, 2025 Author

Uh oh!

Uh oh!

vbharadwaj-bk Oct 7, 2025 Maintainer

ChenQianA
Oct 3, 2025

Replies: 1 comment 2 replies

vbharadwaj-bk
Oct 3, 2025
Maintainer

ChenQianA Oct 7, 2025
Author

vbharadwaj-bk Oct 7, 2025
Maintainer