Questions about the implementation #135

Glinttsd · 2025-06-13T03:07:08Z

Glinttsd
Jun 13, 2025

Hi, thanks for your great work!

I am trying to exploit how the TP kernel is implemented. I noticed that there is a CUDA file named subkernel_per_interaction_multirep.cuh (which is attached here) generated during the execution. I believe that is where the GPU kernel is implemented.

Regarding this CUDA file, I have two questions for its forward process:

The sparse computation part to compute the CG coefficients (Algorithm 2 in the paper) seems not unrolled. The source code is provided below. The for loop is not tagged with #pragma unroll to unroll the loop.

  // PERFORM CG DECOMPOSITION CALCULATION 
  {%- for i in range(tensor.nnz) %}
      {%- set coord1, coord2, coord3, value = tensor.tuples[i] %}
      L3_local_vec[{{coord3}}] += {{value}} * {{instructions[interaction_index].path_weight}} * L1_local_vec[{{coord1}}] * L2_local_vec[{{coord2}}];
  {%- endfor %}

The calculated results L3_local_vec from the above code seems been written into the shared memory smem_gemm_L3, and then multiplies the weight matrix. It seems to be contrary to the paper's description:

Finally, the output Z is accumulated to shared memory after multiplication by W .

The source code is provided below.

// WRITE TO SMEM_GEMM_L3 
#pragma unroll 
for(int L3_irrep_index = 0; L3_irrep_index < {{L3.irrep_lengths[w]}}; L3_irrep_index++){
    smem_gemm_L3[(threadIdx.x * {{L3.irrep_lengths[w]}}) + L3_irrep_index] = L3_local_vec[L3_irrep_index]; 
}

// WAIT FOR WEIGHTS TO HAVE ARRIVED
cooperative_groups::wait(group); 

group.sync();

// PERFORM MATMUL 
int i = threadIdx.x;
if(i < L3_mults){
    (some calculations)
}
group.sync();

Any help would be greatly appreciated! Thank you again for the great work!

vbharadwaj-bk · 2025-06-13T03:49:04Z

vbharadwaj-bk
Jun 13, 2025
Maintainer

Hi @Glinttsd, this implementation is deprecated and no longer in our repo (it was abandoned early in development). Giving a short response here and Austin will follow up.

You want to take a look at loop_unroll_tp.cuh in the templates folder for the low-level subkernels, and loop_unroll_batch.cuh and loop_unroll_conv.cuh for the wrappers that orchestrate the subkernels according to the computation schedule. That said, the approach is still very similar to what you have pasted above.

That for loop is actually jinja directive (notice the {% %}; it is "unrolled" when the template runs by writing out the full stream of instructions. Try heading to implementations/LoopUnrollTP.py and uncommenting these two lines to see what the kernel looks like:

OpenEquivariance/openequivariance/implementations/LoopUnrollTP.py

Line 98 in 7db972c

# with open("scratch.txt", "w") as f:
For UVU instructions, the multiplication happens in the registers. For UVW instructions, yes, we write out the intermediate to shared memory before performing the weight matrix multiplication. I don't think this is contrary to the paper, we just elided mentioning that for certain instructions, there is an intermediate write-out.

0 replies

asglover · 2025-06-13T03:55:50Z

asglover
Jun 13, 2025
Collaborator

In the paper, we do mention that, for UVW, we write to shared memory first:

The matrix multiplication by the weights at the end
of Algorithm 2 depends on the structure of W. When
W is square and diagonal (kernel B), multiplication
proceeds asynchronously in parallel across all threads.
When W is a general dense matrix, we temporarily
store zreg to shared memory and perform a warp-level
matrix-multiplication across all threads.

0 replies

asglover · 2025-06-13T04:01:31Z

asglover
Jun 13, 2025
Collaborator

Out of curiosity, what is your end goal for "exploiting the how TP kernel is implemented"? Is there a particular application you are trying to accelerate that we don't currently support? Or a similar problem?

0 replies

Glinttsd · 2025-06-13T04:18:40Z

Glinttsd
Jun 13, 2025
Author

Hi, thank you for your quick response!

Me and my team are recently focusing on the FPGA acceleration of the tensor product computation. That's why I am exploiting and learning the implementation of TP kernel.

I see that I've missed some statements in the paper. Thank you for pointing this out.

I will also try the methods that @vbharadwaj-bk mentioned.

0 replies

asglover · 2025-06-13T04:45:22Z

asglover
Jun 13, 2025
Collaborator

Very cool!

The code will be much more readable when the templates are instantiated. I'll try to post an example of a simple kernel (instantiated, not a template) in a bit.

0 replies

Glinttsd · 2025-06-13T05:15:40Z

Glinttsd
Jun 13, 2025
Author

That will be very helpful. Thanks!

0 replies

asglover · 2025-06-13T05:42:55Z

asglover
Jun 13, 2025
Collaborator

Linked is the example from our readme fully instantiated from the templates! It is a 'uvu' connection mode, so the weight multiplication will happen thread locally.

https://gist.github.com/asglover/d6cc3ace0a09a6e338c6e03e9653b17a

I also included the code to save the kernel, so you can play around and try different "tensor product problem" specifications and see how it changes the kernels.

0 replies

vbharadwaj-bk · 2025-06-16T22:49:04Z

vbharadwaj-bk
Jun 16, 2025
Maintainer

Glad to see the conversation here! Since there are no actionable items, shall we convert this to a discussion? You can continue to chat about the specifics of the kernel there as your FPGA work progresses.

0 replies

Glinttsd · 2025-06-17T00:50:17Z

Glinttsd
Jun 17, 2025
Author

Yes, please. Thanks for that!

0 replies

Questions about the implementation #135

Uh oh!

Glinttsd Jun 13, 2025

Replies: 9 comments

Uh oh!

vbharadwaj-bk Jun 13, 2025 Maintainer

Uh oh!

Uh oh!

asglover Jun 13, 2025 Collaborator

Uh oh!

asglover Jun 13, 2025 Collaborator

Uh oh!

Glinttsd Jun 13, 2025 Author

Uh oh!

Uh oh!

asglover Jun 13, 2025 Collaborator

Uh oh!

Glinttsd Jun 13, 2025 Author

Uh oh!

asglover Jun 13, 2025 Collaborator

Uh oh!

vbharadwaj-bk Jun 16, 2025 Maintainer

Uh oh!

Glinttsd Jun 17, 2025 Author

Glinttsd
Jun 13, 2025

vbharadwaj-bk
Jun 13, 2025
Maintainer

asglover
Jun 13, 2025
Collaborator

asglover
Jun 13, 2025
Collaborator

Glinttsd
Jun 13, 2025
Author

asglover
Jun 13, 2025
Collaborator

Glinttsd
Jun 13, 2025
Author

asglover
Jun 13, 2025
Collaborator

vbharadwaj-bk
Jun 16, 2025
Maintainer

Glinttsd
Jun 17, 2025
Author