Replies: 9 comments
-
|
Hi @Glinttsd, this implementation is deprecated and no longer in our repo (it was abandoned early in development). Giving a short response here and Austin will follow up. You want to take a look at loop_unroll_tp.cuh in the templates folder for the low-level subkernels, and loop_unroll_batch.cuh and loop_unroll_conv.cuh for the wrappers that orchestrate the subkernels according to the computation schedule. That said, the approach is still very similar to what you have pasted above.
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
Out of curiosity, what is your end goal for "exploiting the how TP kernel is implemented"? Is there a particular application you are trying to accelerate that we don't currently support? Or a similar problem? |
Beta Was this translation helpful? Give feedback.
-
|
Hi, thank you for your quick response! Me and my team are recently focusing on the FPGA acceleration of the tensor product computation. That's why I am exploiting and learning the implementation of TP kernel. I see that I've missed some statements in the paper. Thank you for pointing this out. I will also try the methods that @vbharadwaj-bk mentioned. |
Beta Was this translation helpful? Give feedback.
-
|
Very cool! The code will be much more readable when the templates are instantiated. I'll try to post an example of a simple kernel (instantiated, not a template) in a bit. |
Beta Was this translation helpful? Give feedback.
-
|
That will be very helpful. Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Linked is the example from our readme fully instantiated from the templates! It is a 'uvu' connection mode, so the weight multiplication will happen thread locally. https://gist.github.com/asglover/d6cc3ace0a09a6e338c6e03e9653b17a I also included the code to save the kernel, so you can play around and try different "tensor product problem" specifications and see how it changes the kernels. |
Beta Was this translation helpful? Give feedback.
-
|
Glad to see the conversation here! Since there are no actionable items, shall we convert this to a discussion? You can continue to chat about the specifics of the kernel there as your FPGA work progresses. |
Beta Was this translation helpful? Give feedback.
-
|
Yes, please. Thanks for that! |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Hi, thanks for your great work!
I am trying to exploit how the TP kernel is implemented. I noticed that there is a CUDA file named
subkernel_per_interaction_multirep.cuh(which is attached here) generated during the execution. I believe that is where the GPU kernel is implemented.Regarding this CUDA file, I have two questions for its forward process:
#pragma unrollto unroll the loop.L3_local_vecfrom the above code seems been written into the shared memorysmem_gemm_L3, and then multiplies the weight matrix. It seems to be contrary to the paper's description:The source code is provided below.
Any help would be greatly appreciated! Thank you again for the great work!
Beta Was this translation helpful? Give feedback.
All reactions