Profiling render_kernel on an RTX 4060 Laptop GPU (sm_89) with Nsight Compute reveals two independent performance bottlenecks. Both have been confirmed at the hardware instruction level via cuobjdump. This issue tracks the investigation and fixes.
Environment
- GPU: NVIDIA GeForce RTX 4060 Laptop GPU (sm_89)
- CUDA: 12.4
- Profiling tool: Nsight Compute 2024.1.1
- Kernel:
render_kernel — path tracer using Cook-Torrance BRDF, BVH traversal, NEE
Bottleneck 1 — Scalar Global Memory Loads (Priority)
Evidence
cuobjdump -sass on libpbr_cuda.a shows every global memory load in the kernel is a 32-bit scalar LDG.E:
LDG.E R6, [R4.64]
LDG.E R8, [R4.64+0x4]
LDG.E R9, [R4.64+0x8]
Reading a single Vec3 (e.g. MaterialData::baseColor) generates three separate global memory instructions. With float4 the compiler should emit a single LDG.E.128:
Nsight Compute confirms the impact:
| Metric |
Value |
| L1TEX Global Load Access Pattern |
4.0 of 32 bytes utilized per sector |
| Uncoalesced Global Accesses |
651,005,289 excessive sectors (36% of total) |
| Est. Speedup |
29.21% |
Root Cause
Both traceRayBVH and traceShadowRayBVH declare int stack[64] — 256 bytes per thread. When the compiler inlines both functions into pathTracer_CookTorrance, both stacks coexist in the register file simultaneously (512 bytes per thread). The hardware cannot fit enough warps per SM, destroying the GPU's ability to hide memory latency by switching warps. Excess stack spills to local memory with strided, uncoalesced access patterns.
The actual required stack depth for median-split BVH with MAX_LEAF_SIZE = 4:
depth = ceil(log2(N / 4))
1,000 triangles -> depth 8
10,000 triangles -> depth 12
100,000 triangles -> depth 15
stack[64] is never needed.
Fix
core_renderer.hpp — reduce stack size and prevent inlining:
// Prevent compiler from merging both stacks into one activation frame
fgt_device_gpu __noinline__ bool traceRayBVH(...) {
int stack[32]; // was 64
...
}
fgt_device_gpu noinline bool traceShadowRayBVH(...) {
int stack[32]; // was 64
...
}
The __noinline__ attribute is the more important change — it ensures the two stacks never coexist in registers simultaneously regardless of stack size.
To determine the exact required depth, instrument BVHBuilder::buildRecursive:
int buildRecursive(..., int depth = 0) {
m_maxDepth = std::max(m_maxDepth, depth);
...
buildRecursive(..., depth + 1);
}
Print m_maxDepth after build and set GPU stack to that value plus a small buffer (e.g. +4).
Additional Fixes (Low Effort)
F_Schlick — replace powf with manual multiply
// Before — full math library call
float pow5 = powf(1.0f - VoH, 5.0f);
// After — 4 multiplications
float x = 1.0f - VoH;
float x2 = x * x;
float pow5 = x2 * x2 * x;
sampleHemisphere — eliminate acosf
// Before
float theta = acosf(sqrtf(1.0f - u));
float xs = sinf(theta) * cosf(phi);
float ys = sinf(theta) * sinf(phi);
float zs = cosf(theta);
// After — identical distribution, no acosf
float zs = sqrtf(1.0f - u);
float r = sqrtf(u);
float xs = r * cosf(phi);
float ys = r * sinf(phi);
AABB::center() — arithmetic bug
// Before — divides by 0.5 which multiplies by 2
mid = mid / 0.5;
// After
mid = mid * 0.5f;
SAH partition — uncomment
BVHBuilder::partition has a complete SAH implementation commented out, replaced by median split. SAH produces a shallower, more balanced tree which directly reduces required BVH stack depth and reduces traversal divergence. Uncomment it.
Profiling
render_kernelon an RTX 4060 Laptop GPU (sm_89) with Nsight Compute reveals two independent performance bottlenecks. Both have been confirmed at the hardware instruction level viacuobjdump. This issue tracks the investigation and fixes.Environment
render_kernel— path tracer using Cook-Torrance BRDF, BVH traversal, NEEBottleneck 1 — Scalar Global Memory Loads (Priority)
Evidence
cuobjdump -sassonlibpbr_cuda.ashows every global memory load in the kernel is a 32-bit scalarLDG.E:Reading a single
Vec3(e.g.MaterialData::baseColor) generates three separate global memory instructions. Withfloat4the compiler should emit a singleLDG.E.128:Nsight Compute confirms the impact:
Root Cause
Both
traceRayBVHandtraceShadowRayBVHdeclareint stack[64]— 256 bytes per thread. When the compiler inlines both functions intopathTracer_CookTorrance, both stacks coexist in the register file simultaneously (512 bytes per thread). The hardware cannot fit enough warps per SM, destroying the GPU's ability to hide memory latency by switching warps. Excess stack spills to local memory with strided, uncoalesced access patterns.The actual required stack depth for median-split BVH with
MAX_LEAF_SIZE = 4:stack[64]is never needed.Fix
core_renderer.hpp— reduce stack size and prevent inlining:The
__noinline__attribute is the more important change — it ensures the two stacks never coexist in registers simultaneously regardless of stack size.To determine the exact required depth, instrument
BVHBuilder::buildRecursive:Print
m_maxDepthafter build and set GPU stack to that value plus a small buffer (e.g. +4).Additional Fixes (Low Effort)
F_Schlick— replacepowfwith manual multiplysampleHemisphere— eliminateacosfAABB::center()— arithmetic bugSAH partition — uncomment
BVHBuilder::partitionhas a complete SAH implementation commented out, replaced by median split. SAH produces a shallower, more balanced tree which directly reduces required BVH stack depth and reduces traversal divergence. Uncomment it.