Port to fearless_simd #58

Shnatsel · 2025-12-12T14:02:58Z

No description provided.

codecov-commenter · 2025-12-12T14:05:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.88%. Comparing base (0f47ea1) to head (b8ebaa5).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #58      +/-   ##
==========================================
+ Coverage   99.82%   99.88%   +0.06%     
==========================================
  Files          13       12       -1     
  Lines        2261     2711     +450     
==========================================
+ Hits         2257     2708     +451     
+ Misses          4        3       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…han multiversion

…ly not messing anything up

Shnatsel · 2026-01-21T13:28:11Z

On Zen4 This gives up to 7% penalty due to not utilizing AVX-512, but otherwise looks normal. We don't need explicit mul_neg_add on x86 it seems, this is lowered into the correct instruction automatically.

On Apple M4 this is a large regression. The hottest instructions are loads/stores to/from the stack for f32x16, so it might be due to register pressure or some such (LLVM isn't great at dealing with that). I'll need to investigate how wide lowers this kind of thing to NEON, its approach is apparently better than that of fearless_simd. Or we could rewrite the function to operate on native vectors but then we might give up some ILP?

valadaptive · 2026-01-21T16:47:16Z

On Apple M4 this is a large regression. The hottest instructions are loads/stores to/from the stack for f32x16, so it might be due to register pressure or some such

This is a wild guess (I don't have Apple Silicon hardware, so I can't benchmark any of this), but the way you're loading from a slice looks a bit convoluted. Instead of e.g.

let in0_re = f32x4::simd_from(simd, <[f32; 4]>::try_from(&reals_s0[0..4]).unwrap());

have you tried simply:

let in0_re = f32x4::from_slice(simd, &reals_s0[0..4]));

Also just to confirm, you ran this with the latest fearless_simd from Git, correct? linebender/fearless_simd#159 aimed to improve codegen around SIMD loads, and linebender/fearless_simd#181 just landed a couple days ago and adds (potentially) faster methods for SIMD stores.

Shnatsel · 2026-01-21T17:18:22Z

Yep, this is on latest fearless_simd from git. I'll see if from_slice does anything, it's certainly more readable.

I've also tried swapping vector repr from arrays to structs to mimic wide internal representation but it didn't make a difference.

Shnatsel · 2026-01-21T17:51:26Z

CI is broken in a really interesting way: it complains about mul_neg_add which doesn't appear anywhere in the code on the latest commit. It's either running on an old commit or on a different branch; either way that could be exploitable if it can be reproduced.

Shnatsel · 2026-01-21T18:08:59Z

Nope, no difference in performance from changing loads/stores. Looks like a readability win to me though.

Shnatsel · 2026-01-22T16:37:48Z

I had Claude analyze the generated assembly from fft_dit_chunk_32_simd_f32, and it seems that the regression is due to over-eager inlining: https://claude.ai/share/920afa1e-8e39-48a1-b00f-1a55d6c25d1c

Load/store instructions are hot on the profiler (samply), so this makes sense as a regression due to stack spills. This also explains why all the kernels are gone from cargo asm view - they've all been aggressively inlined.

This is likely an artifact of fearless_simd's #[inline(always)]. We go though vectorize() to add a function call that isn't force-inlined but apparently that's not enough.

valadaptive · 2026-01-22T17:01:23Z

This is likely an artifact of fearless_simd's #[inline(always)]. We go though vectorize() to add a function call that isn't force-inlined but apparently that's not enough.

I wonder why it's making a different inlining decision. Was the function marked #[inline(never)] before?

I think we should probably just make vectorize not be inline over in fearless_simd.

This reverts commit 979e357.

…entirety of DiT kernels ends up rolled up into one function on ARM and collapses under register pressure

Shnatsel · 2026-01-22T19:55:02Z

Nope, preventing inlining didn't help performance. It seems functions that operate on 512-bit vectors like f32x16 are still much slower than with wide.

Shnatsel · 2026-01-22T20:27:39Z

Now that both aren't inlined, I can directly compare the generated assembly. This is almost the first time I see ARM assembly but what jumps out at me is that wide doesn't use the fmls instruction, it always does fmla followed by a fneg which is two instructions to do the same thing, and it's still much faster somehow.

Assembly of an affected function fft_dit_chunk_64_simd_f32 lifted from samply assembly view:

wide: wide_asm.txt

fearless_simd: fearless_asm.txt

Profiling data

Profile with wide (on main, exact commit): https://share.firefox.dev/3LNEZto

Profile with fearless_simd (#58, exact commit): https://share.firefox.dev/3LVrB6x

Both recorded with samply record -r 5000 cargo bench --profile=profiling --bench=bench 'Forward f32/PhastFT DIT/64'

Shnatsel · 2026-01-22T20:40:57Z

I can't really read Aarch64 assembly but Claude can, and Claude has an idea on what's going on: https://claude.ai/share/4fa3af4f-3c34-4d57-b11d-3611385f4f1c

That explains a lot if it's true

valadaptive · 2026-01-22T20:51:36Z

Try linebender/fearless_simd#184.

Shnatsel · 2026-01-22T21:16:17Z

linebender/fearless_simd#184 is a +20% boost on Apple M4! Thanks a lot!

This still isn't as fast as wide but that's a big improvement!

Shnatsel · 2026-01-22T21:29:47Z

New profile for fearless_simd: https://share.firefox.dev/3LOGBDc

New assembly for fft_dit_chunk_64_simd_f32: asm_64_f32_fixed_loads_asm.txt

Shnatsel · 2026-01-22T22:16:24Z

It looks like the load-store-load pattern for load_array is present even when loading from actual arrays that are just constants: https://claude.ai/share/4fa3af4f-3c34-4d57-b11d-3611385f4f1c

@valadaptive you'll probably want to drop load_array entirely and always use the implementation of load_array_ref on Aarch64 even for on-stack arrays because rustc doesn't align stack arrays with this load in mind

Shnatsel · 2026-01-23T02:27:52Z

I was wrong about wide being less efficient at computing FMA. The math instructions are equally efficient - there is no single-instruction version of mul_sub() in NEON, you need to do either vfmaq_f32(vnegq_f32(a), b, c) or vnegq_f32(vfmsq_f32(a, b, c)) but there's no single-instruction version.

valadaptive · 2026-01-23T02:51:23Z

It looks like the load-store-load pattern for load_array is present even when loading from actual arrays that are just constants: https://claude.ai/share/4fa3af4f-3c34-4d57-b11d-3611385f4f1c

@valadaptive you'll probably want to drop load_array entirely and always use the implementation of load_array_ref on Aarch64 even for on-stack arrays because rustc doesn't align stack arrays with this load in mind

Can you share the LLVM assembly? I think if you include RUSTFLAGS="--emit=llvm-ir", it should show up somewhere in target/release/deps.

The constant array loads are inlined, and I find it very strange that LLVM is incapable of eliminating a redundant load/store of a simple constant.

It's hard to find any information about this online, but I cannot find any reference to the ld1 instruction requiring the address to be aligned, and the information I can find seems to indicate that NEON can do unaligned loads and stores just fine. I suspect Claude may be making this up.

I did notice that the f32x16 array load function maps to vld1q_f32_x4, which Rust lowers to the LLVM intrinsic llvm.aarch64.neon.ld1x4.v4f32.p0. The issue may be that LLVM lowers that particular intrinsic poorly, or does so very late in the pipeline. I'll have to check; it may be better to just use four 128-bit loads.

valadaptive · 2026-01-23T03:34:42Z

Try it with linebender/fearless_simd#185.

…insics

Shnatsel · 2026-01-23T11:38:41Z

After switching over to linebender/fearless_simd#185 I had to double-check my results, re-measure the baseline and run tests, because the benchmarks look too good to be true!

fearless_simd is now 10% to 40% faster than wide!

It's hard to find any information about this online, but I cannot find any reference to the ld1 instruction requiring the address to be aligned, and the information I can find seems to indicate that NEON can do unaligned loads and stores just fine. I suspect Claude may be making this up.

Yep, I agree that's probably not the cause of the spill. Looking at the code that implements it, I suspect it's Aligned512 that's to blame: crate::support::Aligned512(vld1q_f32_x4(val.as_ptr() as *const _)) }

I'll prepare a branch without the aligned wrappers and measure that for comparison.

Shnatsel · 2026-01-23T12:05:08Z

I'll prepare a branch without the aligned wrappers and measure that for comparison.

Branch up: https://github.com/Shnatsel/fearless_simd/tree/no-align-wrapper-in-loads

But nope, no improvement from that. It seems transmute_copy really is just faster than the intrinsics.

Shnatsel · 2026-01-23T12:29:23Z

Here's the LLVM IR you asked for:
phastft-fearless-main-commit-744661d.ll.gz
phastft-transmute-copy-loads.ll.gz

Shnatsel · 2026-01-23T15:11:27Z

FWIW I showed this to an LLVM developer and he said

that 'save to stack and immediately load back to registers' thing could be an escape hatch for when the compiler needs to bitcast from one type to another but doesn't know how

…results

…_add"" This reverts commit 3242122.

This is an interesting one! The remaining performance gap in QuState/PhastFT#58 seems to come from subpar performance when loading constants. I noticed that in Rust's `stdarch`, which defines all the SIMD intrinsics, the x86 load/store intrinsics lower to raw memory operations (`ptr::copy_nonoverlapping`). The AArch64 load/store intrinsics, on the other hand, *do* map to corresponding LLVM intrinsics! My hypothesis is that the LLVM intrinsics are not lowered until much later in the compilation pipeline, resulting in much fewer optimization opportunities and much worse codegen. If this is the case, we should just use memory operations directly. This also simplifies the code that we generate by quite a bit.

Shnatsel · 2026-01-24T20:51:36Z

The performance regression is resolved upstream in linebender/fearless_simd#184 and linebender/fearless_simd#185, this should be ready to merge

Shnatsel added 7 commits December 8, 2025 15:55

WIP

679c22b

Merge branch 'main' into fearless-simd

aba214e

Adapt conversion to slice for fearless_simd

f7a1133

Fully convert fft_dit_chunk_8_simd_f64 to fearless_simd

28166a9

Convert the rest of DIT functions to fearless_simd

1333b09

Clean up imports

1b17dbe

Wire up new DIT kernel function signatures to DIT process

6244269

Shnatsel added 4 commits December 12, 2025 18:37

Dispatch to multiversioned fft_dit_chunk_2 via fearless_simd rather t…

f003c14

…han multiversion

move SIMD dispatch one level higher so that it's definitely, positive…

a19a4e5

…ly not messing anything up

Update for the swapped order of arguments in simd_from

35a7b60

cargo fmt

6cbf4ef

Shnatsel mentioned this pull request Jan 21, 2026

PoC: BRAVO bit reversal #62

Closed

Simpler SIMD loads

3ff0b45

Simpler SIMD stores

c955c97

Shnatsel mentioned this pull request Jan 22, 2026

Next release checklist #63

Open

4 tasks

Use a branch with less forced inlining

979e357

Shnatsel mentioned this pull request Jan 22, 2026

Less overzealous inlining linebender/fearless_simd#183

Open

Shnatsel added 3 commits January 22, 2026 18:25

Revert "Use a branch with less forced inlining"

915abf8

This reverts commit 979e357.

Rename functions for consistency

73dfefb

Add DiT kernel wrappers to act as an inlining barrier, otherwise the …

31954e7

…entirety of DiT kernels ends up rolled up into one function on ARM and collapses under register pressure

Shnatsel mentioned this pull request Jan 22, 2026

Use load_array_ref to implement from_slice linebender/fearless_simd#184

Merged

Shnatsel mentioned this pull request Jan 23, 2026

NEON: use dedicated mul_sub instruction instead of negate-then-fma Lokathor/wide#251

Closed

valadaptive mentioned this pull request Jan 23, 2026

Don't use load/store intrinsics linebender/fearless_simd#185

Merged

Try experimental branch with loads via transmute_copy instead of intr…

ab9f572

…insics

Shnatsel added 3 commits January 23, 2026 20:44

commit Cargo.lock for reproducible builds and reproducible benchmark …

894f7d0

…results

Merge branch 'main' into fearless-simd

eb31d23

Revert "Revert "Use mul_add with a float inversion instead of mul_neg…

e586782

…_add"" This reverts commit 3242122.

Shnatsel mentioned this pull request Jan 24, 2026

fearless_simd + BRAVO #64

Merged

Repoint fearless_simd dependency to git main

b8ebaa5

Shnatsel marked this pull request as ready for review January 24, 2026 20:50

Shnatsel merged commit 6875e8c into main Jan 24, 2026
10 checks passed

Shnatsel deleted the fearless-simd branch January 24, 2026 20:53

Shnatsel mentioned this pull request Jan 24, 2026

Multiversioned AVX-512 functions lower into SSE intrinsics via wide #42

Closed

Port to fearless_simd #58

Port to fearless_simd #58

Uh oh!

Conversation

Shnatsel commented Dec 12, 2025

Uh oh!

codecov-commenter commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Shnatsel commented Jan 21, 2026

Uh oh!

valadaptive commented Jan 21, 2026

Uh oh!

Shnatsel commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shnatsel commented Jan 21, 2026

Uh oh!

Shnatsel commented Jan 21, 2026

Uh oh!

Shnatsel commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valadaptive commented Jan 22, 2026

Uh oh!

Shnatsel commented Jan 22, 2026

Uh oh!

Shnatsel commented Jan 22, 2026

Uh oh!

Shnatsel commented Jan 22, 2026

Uh oh!

valadaptive commented Jan 22, 2026

Uh oh!

Shnatsel commented Jan 22, 2026

Uh oh!

Shnatsel commented Jan 22, 2026

Uh oh!

Shnatsel commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shnatsel commented Jan 23, 2026

Uh oh!

valadaptive commented Jan 23, 2026

Uh oh!

valadaptive commented Jan 23, 2026

Uh oh!

Shnatsel commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shnatsel commented Jan 23, 2026

Uh oh!

Shnatsel commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shnatsel commented Jan 23, 2026

Uh oh!

Shnatsel commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Dec 12, 2025 •

edited

Loading

Shnatsel commented Jan 21, 2026 •

edited

Loading

Shnatsel commented Jan 22, 2026 •

edited

Loading

Shnatsel commented Jan 22, 2026 •

edited

Loading

Shnatsel commented Jan 23, 2026 •

edited

Loading

Shnatsel commented Jan 23, 2026 •

edited

Loading