Skip to content

[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961

Open
gnguralnick wants to merge 1 commit intoapache:mainfrom
gnguralnick:webgpu-pod-args-prealloc
Open

[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961
gnguralnick wants to merge 1 commit intoapache:mainfrom
gnguralnick:webgpu-pod-args-prealloc

Conversation

@gnguralnick
Copy link
Copy Markdown
Contributor

Summary

  • Hoists Int32Array/Uint32Array/Float32Array allocation out of the per-dispatch submitShader closure into the per-shader createShaderFunc scope, eliminating 3 typed array allocations + 1 ArrayBuffer per GPU kernel dispatch.
  • podArgIndices.length is fixed per shader, so the cached views have the correct size for every invocation. Every slot 0..podArgIndices.length is unconditionally written before writeBuffer copies the data out, so no stale values can leak between dispatches.
  • Builds on top of the batched dispatch architecture from Batched GPU dispatch and object caching for WebGPU runtime #18871 — the uniform buffer pool already gives each dispatch its own GPU-side buffer, so reusing the CPU-side staging array is safe.

Motivation

In workloads with many small dispatches (e.g. LLM token generation), the per-dispatch typed array allocations become a measurable source of GC pressure. Pre-allocating and reusing the views avoids this overhead.

Test plan

  • Verify npm run lint passes in web/
  • Run WebGPU model inference (e.g. via MLC-LLM web demo) and confirm correct output
  • Profile dispatch-heavy workload to confirm reduced allocation rate

Hoist Int32Array/Uint32Array/Float32Array allocation out of the
per-dispatch submitShader closure into the per-shader scope. Since
podArgIndices.length is fixed for each shader, the views can be
safely reused: every slot (0..podArgIndices.length) is written on
each dispatch before writeBuffer copies the data, so no stale
values can leak between invocations.

This avoids 3 heap allocations + 1 ArrayBuffer per GPU kernel
dispatch, which adds up in workloads with many small dispatches
(e.g. LLM token generation).
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes WebGPU shader dispatches by pre-allocating and reusing typed array views for POD arguments, effectively reducing per-dispatch memory allocation overhead. The review feedback suggests further refining this by using Int32Array.BYTES_PER_ELEMENT instead of magic numbers and pre-calculating argument types to avoid string comparison overhead within the hot dispatch loop.

Comment on lines +701 to +705
const maxPodArgs = podArgIndices.length + 1; // +1 for packGridDimX
const podArgsArrayBuffer = new ArrayBuffer(maxPodArgs * 4);
const i32ViewCached = new Int32Array(podArgsArrayBuffer);
const u32ViewCached = new Uint32Array(podArgsArrayBuffer);
const f32ViewCached = new Float32Array(podArgsArrayBuffer);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To further reduce per-dispatch overhead and avoid magic numbers, consider hoisting the total byte size calculation and using Int32Array.BYTES_PER_ELEMENT. This constant can then be reused in submitShader for the uniform pool request and the bind group entry size. Also, updated the comment to match the variable name packDimX used in the implementation.

Suggested change
const maxPodArgs = podArgIndices.length + 1; // +1 for packGridDimX
const podArgsArrayBuffer = new ArrayBuffer(maxPodArgs * 4);
const i32ViewCached = new Int32Array(podArgsArrayBuffer);
const u32ViewCached = new Uint32Array(podArgsArrayBuffer);
const f32ViewCached = new Float32Array(podArgsArrayBuffer);
const maxPodArgs = podArgIndices.length + 1; // +1 for packDimX
const podArgBytes = maxPodArgs * Int32Array.BYTES_PER_ELEMENT;
const podArgsArrayBuffer = new ArrayBuffer(podArgBytes);
const i32ViewCached = new Int32Array(podArgsArrayBuffer);
const u32ViewCached = new Uint32Array(podArgsArrayBuffer);
const f32ViewCached = new Float32Array(podArgsArrayBuffer);

Comment on lines +774 to +778
i32ViewCached[i] = value;
} else if (dtype.startsWith("uint")) {
u32View[i] = value;
u32ViewCached[i] = value;
} else if (dtype.startsWith("float")) {
f32View[i] = value;
f32ViewCached[i] = value;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dtype.startsWith string operations are executed for every POD argument on every dispatch. Since the argument types are fixed for each shader, consider pre-calculating an array of type indicators (e.g., an enum or numeric constants) in the createShadeInternal scope. This would allow replacing the string operations with a faster numeric check in the submitShader loop, which is beneficial for workloads with many small dispatches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant