CUDA Renderer Performance: Vectorized Memory Access and Register Pressure


<p>Profiling <code>render_kernel</code> on an RTX 4060 Laptop GPU (sm_89) with Nsight Compute reveals two independent performance bottlenecks. Both have been confirmed at the hardware instruction level via <code>cuobjdump</code>. This issue tracks the investigation and fixes.</p>
<hr>
<h2>Environment</h2>
<ul>
<li>GPU: NVIDIA GeForce RTX 4060 Laptop GPU (sm_89)</li>
<li>CUDA: 12.4</li>
<li>Profiling tool: Nsight Compute 2024.1.1</li>
<li>Kernel: <code>render_kernel</code> — path tracer using Cook-Torrance BRDF, BVH traversal, NEE</li>
</ul>
<hr>
<h2>Bottleneck 1 — Scalar Global Memory Loads (Priority)</h2>
<h3>Evidence</h3>
<p><code>cuobjdump -sass</code> on <code>libpbr_cuda.a</code> shows every global memory load in the kernel is a 32-bit scalar <code>LDG.E</code>:</p>
<pre><code>LDG.E R6, [R4.64]
LDG.E R8, [R4.64+0x4]
LDG.E R9, [R4.64+0x8]
</code></pre>
<p>Reading a single <code>Vec3</code> (e.g. <code>MaterialData::baseColor</code>) generates three separate global memory instructions. With <code>float4</code> the compiler should emit a single <code>LDG.E.128</code>:</p>
<pre><code>LDG.E.128 R6, [R4.64]
</code></pre>
<p>Nsight Compute confirms the impact:</p>

Metric | Value
-- | --
L1TEX Global Load Access Pattern | 4.0 of 32 bytes utilized per sector
Uncoalesced Global Accesses | 651,005,289 excessive sectors (36% of total)
Est. Speedup | 29.21%


<h3>Root Cause</h3>
<p>Both <code>traceRayBVH</code> and <code>traceShadowRayBVH</code> declare <code>int stack[64]</code> — 256 bytes per thread. When the compiler inlines both functions into <code>pathTracer_CookTorrance</code>, both stacks coexist in the register file simultaneously (512 bytes per thread). The hardware cannot fit enough warps per SM, destroying the GPU's ability to hide memory latency by switching warps. Excess stack spills to local memory with strided, uncoalesced access patterns.</p>
<p>The actual required stack depth for median-split BVH with <code>MAX_LEAF_SIZE = 4</code>:</p>
<pre><code>depth = ceil(log2(N / 4))

1,000 triangles  -&gt; depth 8
10,000 triangles -&gt; depth 12
100,000 triangles -&gt; depth 15
</code></pre>
<p><code>stack[64]</code> is never needed.</p>
<h3>Fix</h3>
<p><strong><code>core_renderer.hpp</code></strong> — reduce stack size and prevent inlining:</p>
<pre><code class="language-cpp">// Prevent compiler from merging both stacks into one activation frame
fgt_device_gpu __noinline__ bool traceRayBVH(...) {
    int stack[32];  // was 64
    ...
}

fgt_device_gpu __noinline__ bool traceShadowRayBVH(...) {
    int stack[32];  // was 64
    ...
}
</code></pre>
<p>The <code>__noinline__</code> attribute is the more important change — it ensures the two stacks never coexist in registers simultaneously regardless of stack size.</p>
<p>To determine the exact required depth, instrument <code>BVHBuilder::buildRecursive</code>:</p>
<pre><code class="language-cpp">int buildRecursive(..., int depth = 0) {
    m_maxDepth = std::max(m_maxDepth, depth);
    ...
    buildRecursive(..., depth + 1);
}
</code></pre>
<p>Print <code>m_maxDepth</code> after build and set GPU stack to that value plus a small buffer (e.g. +4).</p>
<hr>
<h2>Additional Fixes (Low Effort)</h2>
<h3><code>F_Schlick</code> — replace <code>powf</code> with manual multiply</h3>
<pre><code class="language-cpp">// Before — full math library call
float pow5 = powf(1.0f - VoH, 5.0f);

// After — 4 multiplications
float x = 1.0f - VoH;
float x2 = x * x;
float pow5 = x2 * x2 * x;
</code></pre>
<h3><code>sampleHemisphere</code> — eliminate <code>acosf</code></h3>
<pre><code class="language-cpp">// Before
float theta = acosf(sqrtf(1.0f - u));
float xs = sinf(theta) * cosf(phi);
float ys = sinf(theta) * sinf(phi);
float zs = cosf(theta);

// After — identical distribution, no acosf
float zs = sqrtf(1.0f - u);
float r  = sqrtf(u);
float xs = r * cosf(phi);
float ys = r * sinf(phi);
</code></pre>
<h3><code>AABB::center()</code> — arithmetic bug</h3>
<pre><code class="language-cpp">// Before — divides by 0.5 which multiplies by 2
mid = mid / 0.5;

// After
mid = mid * 0.5f;
</code></pre>
<h3>SAH partition — uncomment</h3>
<p><code>BVHBuilder::partition</code> has a complete SAH implementation commented out, replaced by median split. SAH produces a shallower, more balanced tree which directly reduces required BVH stack depth and reduces traversal divergence. Uncomment it.</p>
<hr>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Renderer Performance: Vectorized Memory Access and Register Pressure #63

Environment

Bottleneck 1 — Scalar Global Memory Loads (Priority)

Evidence

Root Cause

Fix

Additional Fixes (Low Effort)

`F_Schlick` — replace `powf` with manual multiply

`sampleHemisphere` — eliminate `acosf`

`AABB::center()` — arithmetic bug

SAH partition — uncomment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
L1TEX Global Load Access Pattern	4.0 of 32 bytes utilized per sector
Uncoalesced Global Accesses	651,005,289 excessive sectors (36% of total)
Est. Speedup	29.21%

CUDA Renderer Performance: Vectorized Memory Access and Register Pressure #63

Description

Environment

Bottleneck 1 — Scalar Global Memory Loads (Priority)

Evidence

Root Cause

Fix

Additional Fixes (Low Effort)

F_Schlick — replace powf with manual multiply

sampleHemisphere — eliminate acosf

AABB::center() — arithmetic bug

SAH partition — uncomment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`F_Schlick` — replace `powf` with manual multiply

`sampleHemisphere` — eliminate `acosf`

`AABB::center()` — arithmetic bug