[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961
[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961gnguralnick wants to merge 1 commit intoapache:mainfrom
Conversation
Hoist Int32Array/Uint32Array/Float32Array allocation out of the per-dispatch submitShader closure into the per-shader scope. Since podArgIndices.length is fixed for each shader, the views can be safely reused: every slot (0..podArgIndices.length) is written on each dispatch before writeBuffer copies the data, so no stale values can leak between invocations. This avoids 3 heap allocations + 1 ArrayBuffer per GPU kernel dispatch, which adds up in workloads with many small dispatches (e.g. LLM token generation).
There was a problem hiding this comment.
Code Review
This pull request optimizes WebGPU shader dispatches by pre-allocating and reusing typed array views for POD arguments, effectively reducing per-dispatch memory allocation overhead. The review feedback suggests further refining this by using Int32Array.BYTES_PER_ELEMENT instead of magic numbers and pre-calculating argument types to avoid string comparison overhead within the hot dispatch loop.
| const maxPodArgs = podArgIndices.length + 1; // +1 for packGridDimX | ||
| const podArgsArrayBuffer = new ArrayBuffer(maxPodArgs * 4); | ||
| const i32ViewCached = new Int32Array(podArgsArrayBuffer); | ||
| const u32ViewCached = new Uint32Array(podArgsArrayBuffer); | ||
| const f32ViewCached = new Float32Array(podArgsArrayBuffer); |
There was a problem hiding this comment.
To further reduce per-dispatch overhead and avoid magic numbers, consider hoisting the total byte size calculation and using Int32Array.BYTES_PER_ELEMENT. This constant can then be reused in submitShader for the uniform pool request and the bind group entry size. Also, updated the comment to match the variable name packDimX used in the implementation.
| const maxPodArgs = podArgIndices.length + 1; // +1 for packGridDimX | |
| const podArgsArrayBuffer = new ArrayBuffer(maxPodArgs * 4); | |
| const i32ViewCached = new Int32Array(podArgsArrayBuffer); | |
| const u32ViewCached = new Uint32Array(podArgsArrayBuffer); | |
| const f32ViewCached = new Float32Array(podArgsArrayBuffer); | |
| const maxPodArgs = podArgIndices.length + 1; // +1 for packDimX | |
| const podArgBytes = maxPodArgs * Int32Array.BYTES_PER_ELEMENT; | |
| const podArgsArrayBuffer = new ArrayBuffer(podArgBytes); | |
| const i32ViewCached = new Int32Array(podArgsArrayBuffer); | |
| const u32ViewCached = new Uint32Array(podArgsArrayBuffer); | |
| const f32ViewCached = new Float32Array(podArgsArrayBuffer); |
| i32ViewCached[i] = value; | ||
| } else if (dtype.startsWith("uint")) { | ||
| u32View[i] = value; | ||
| u32ViewCached[i] = value; | ||
| } else if (dtype.startsWith("float")) { | ||
| f32View[i] = value; | ||
| f32ViewCached[i] = value; |
There was a problem hiding this comment.
The dtype.startsWith string operations are executed for every POD argument on every dispatch. Since the argument types are fixed for each shader, consider pre-calculating an array of type indicators (e.g., an enum or numeric constants) in the createShadeInternal scope. This would allow replacing the string operations with a faster numeric check in the submitShader loop, which is beneficial for workloads with many small dispatches.
Summary
Int32Array/Uint32Array/Float32Arrayallocation out of the per-dispatchsubmitShaderclosure into the per-shadercreateShaderFuncscope, eliminating 3 typed array allocations + 1ArrayBufferper GPU kernel dispatch.podArgIndices.lengthis fixed per shader, so the cached views have the correct size for every invocation. Every slot0..podArgIndices.lengthis unconditionally written beforewriteBuffercopies the data out, so no stale values can leak between dispatches.Motivation
In workloads with many small dispatches (e.g. LLM token generation), the per-dispatch typed array allocations become a measurable source of GC pressure. Pre-allocating and reusing the views avoids this overhead.
Test plan
npm run lintpasses inweb/