@@ -103,18 +103,29 @@ <h2>Introduction</h2>
103103 < td class ="spec-value "> Vulkan</ td >
104104 < td class ="spec-value "> RTX acceleration</ td >
105105 < td class ="spec-value "> Procedural sphere tracing + triangle modes</ td >
106- < td class ="spec-value fps-highlight "> ~33ms </ td >
106+ < td class ="spec-value fps-highlight "> ~33 ms </ td >
107107 < td class ="spec-value fps-highlight "> ~30 FPS</ td >
108- < td class ="spec-value "> Cornell Box, Lucy, etc. - complex scenes</ td >
108+ < td class ="spec-value ">
109+ < ul >
110+ < li > No acceleration structure compaction</ li >
111+ < li > Using procedural AABBs per sphere</ li >
112+ < li > Using ray tracing pipeline (no inline ray tracing)</ li >
113+
114+ </ ul >
115+ </ td >
109116 </ tr >
110117 < tr >
111118 < td class ="spec-value "> Mine</ td >
112119 < td class ="spec-value "> CUDA</ td >
113120 < td class ="spec-value "> No hardware RT cores</ td >
114121 < td class ="spec-value "> Procedural spheres only</ td >
115- < td class ="spec-value fps-highlight "> ~8ms </ td >
122+ < td class ="spec-value fps-highlight "> ~8 ms </ td >
116123 < td class ="spec-value fps-highlight "> 105 FPS</ td >
117- < td class ="spec-value "> same resolution and settings</ td >
124+ < td class ="spec-value ">
125+ < ul >
126+ < li > Same resolution and settings</ li >
127+ </ ul >
128+ </ td >
118129 </ tr >
119130 </ tbody >
120131 </ table >
@@ -907,9 +918,9 @@ <h2>Optimization #6 — Structure of Arrays (SoA)</h2>
907918 < tbody >
908919 < tr >
909920 < td > Frame Time</ td >
910- < td > 140ms </ td >
911- < td > 65ms </ td >
912- < td class ="improvement "> -75ms (-53.6%)</ td >
921+ < td > 140 ms </ td >
922+ < td > 65 ms </ td >
923+ < td class ="improvement "> -75 ms (-53.6%)</ td >
913924 </ tr >
914925 < tr >
915926 < td > L1 Cache hit rates</ td >
@@ -1236,9 +1247,9 @@ <h3>Global Memory Performance</h3>
12361247 </ tr >
12371248 < tr >
12381249 < td > Frame Time</ td >
1239- < td > ~10ms </ td >
1240- < td > ~8ms </ td >
1241- < td class ="improvement "> ~ -2ms ~(-20%)</ td >
1250+ < td > ~10 ms </ td >
1251+ < td > ~8 ms </ td >
1252+ < td class ="improvement "> ~ -2 ms ~(-20%)</ td >
12421253 </ tr >
12431254 </ tbody >
12441255 </ table >
@@ -1497,7 +1508,8 @@ <h3>Case Study: Ray-AABB Intersection</h3>
14971508 done per
14981509 frame. Switching from the generic < code > std::fma</ code > and < code > std::max</ code > to the intrinsic
14991510 float versions
1500- led to a frame time drop from < strong > 12ms</ strong > to < strong > 9ms</ strong > , and reduced instruction
1511+ led to a frame time drop from < strong > 12 ms</ strong > to < strong > 9 ms</ strong > , and reduced
1512+ instruction
15011513 count.
15021514 </ p >
15031515
@@ -1550,8 +1562,8 @@ <h3>Performance Breakdown</h3>
15501562 </ tr >
15511563 < tr >
15521564 < td > Performance (in hot path)</ td >
1553- < td > < strong > 9ms </ strong > total frame time</ td >
1554- < td > < strong > 12ms </ strong > total frame time</ td >
1565+ < td > < strong > 9 ms </ strong > total frame time</ td >
1566+ < td > < strong > 12 ms </ strong > total frame time</ td >
15551567 </ tr >
15561568 </ tbody >
15571569 </ table >
@@ -1592,7 +1604,7 @@ <h3>Best Practices</h3>
15921604 intrinsics is
15931605 not a micro-optimization—it's a major win in performance-critical kernels. In our case, it shaved
15941606 off
1595- < strong > 3ms per frame</ strong > and greatly simplified the PTX output.
1607+ < strong > 3 ms per frame</ strong > and greatly simplified the PTX output.
15961608 </ p >
15971609 </ section >
15981610
0 commit comments