[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch by gnguralnick · Pull Request #18961 · apache/tvm

gnguralnick · 2026-03-31T23:06:21Z

Summary

Hoists Int32Array/Uint32Array/Float32Array allocation out of the per-dispatch submitShader closure into the per-shader createShaderFunc scope, eliminating 3 typed array allocations + 1 ArrayBuffer per GPU kernel dispatch.
podArgIndices.length is fixed per shader, so the cached views have the correct size for every invocation. Every slot 0..podArgIndices.length is unconditionally written before writeBuffer copies the data out, so no stale values can leak between dispatches.
Builds on top of the batched dispatch architecture from Batched GPU dispatch and object caching for WebGPU runtime #18871 — the uniform buffer pool already gives each dispatch its own GPU-side buffer, so reusing the CPU-side staging array is safe.

Motivation

In workloads with many small dispatches (e.g. LLM token generation), the per-dispatch typed array allocations become a measurable source of GC pressure. Pre-allocating and reusing the views avoids this overhead.

Test plan

Verify npm run lint passes in web/
Run WebGPU model inference (e.g. via MLC-LLM web demo) and confirm correct output
Profile dispatch-heavy workload to confirm reduced allocation rate

Hoist Int32Array/Uint32Array/Float32Array allocation out of the per-dispatch submitShader closure into the per-shader scope. Since podArgIndices.length is fixed for each shader, the views can be safely reused: every slot (0..podArgIndices.length) is written on each dispatch before writeBuffer copies the data, so no stale values can leak between invocations. This avoids 3 heap allocations + 1 ArrayBuffer per GPU kernel dispatch, which adds up in workloads with many small dispatches (e.g. LLM token generation).

gemini-code-assist

Code Review

This pull request optimizes WebGPU shader dispatches by pre-allocating and reusing typed array views for POD arguments, effectively reducing per-dispatch memory allocation overhead. The review feedback suggests further refining this by using Int32Array.BYTES_PER_ELEMENT instead of magic numbers and pre-calculating argument types to avoid string comparison overhead within the hot dispatch loop.

gemini-code-assist · 2026-03-31T23:08:51Z

web/src/webgpu.ts

+    const maxPodArgs = podArgIndices.length + 1; // +1 for packGridDimX
+    const podArgsArrayBuffer = new ArrayBuffer(maxPodArgs * 4);
+    const i32ViewCached = new Int32Array(podArgsArrayBuffer);
+    const u32ViewCached = new Uint32Array(podArgsArrayBuffer);
+    const f32ViewCached = new Float32Array(podArgsArrayBuffer);


To further reduce per-dispatch overhead and avoid magic numbers, consider hoisting the total byte size calculation and using Int32Array.BYTES_PER_ELEMENT. This constant can then be reused in submitShader for the uniform pool request and the bind group entry size. Also, updated the comment to match the variable name packDimX used in the implementation.

Suggested change

const maxPodArgs = podArgIndices.length + 1; // +1 for packGridDimX

const podArgsArrayBuffer = new ArrayBuffer(maxPodArgs * 4);

const i32ViewCached = new Int32Array(podArgsArrayBuffer);

const u32ViewCached = new Uint32Array(podArgsArrayBuffer);

const f32ViewCached = new Float32Array(podArgsArrayBuffer);

const maxPodArgs = podArgIndices.length + 1; // +1 for packDimX

const podArgBytes = maxPodArgs * Int32Array.BYTES_PER_ELEMENT;

const podArgsArrayBuffer = new ArrayBuffer(podArgBytes);

const i32ViewCached = new Int32Array(podArgsArrayBuffer);

const u32ViewCached = new Uint32Array(podArgsArrayBuffer);

const f32ViewCached = new Float32Array(podArgsArrayBuffer);

gemini-code-assist · 2026-03-31T23:08:51Z

web/src/webgpu.ts

+            i32ViewCached[i] = value;
          } else if (dtype.startsWith("uint")) {
-            u32View[i] = value;
+            u32ViewCached[i] = value;
          } else if (dtype.startsWith("float")) {
-            f32View[i] = value;
+            f32ViewCached[i] = value;


The dtype.startsWith string operations are executed for every POD argument on every dispatch. Since the argument types are fixed for each shader, consider pre-calculating an array of type indicators (e.g., an enum or numeric constants) in the createShadeInternal scope. This would allow replacing the string operations with a faster numeric check in the submitShader loop, which is beneficial for workloads with many small dispatches.

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961

[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961
gnguralnick wants to merge 1 commit intoapache:mainfrom
gnguralnick:webgpu-pod-args-prealloc

gnguralnick commented Mar 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gnguralnick commented Mar 31, 2026

Summary

Motivation

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant