-
Notifications
You must be signed in to change notification settings - Fork 13
Port to fearless_simd #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #58 +/- ##
==========================================
+ Coverage 99.82% 99.88% +0.06%
==========================================
Files 13 12 -1
Lines 2261 2711 +450
==========================================
+ Hits 2257 2708 +451
+ Misses 4 3 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
On Zen4 This gives up to 7% penalty due to not utilizing AVX-512, but otherwise looks normal. We don't need explicit mul_neg_add on x86 it seems, this is lowered into the correct instruction automatically. On Apple M4 this is a large regression. The hottest instructions are loads/stores to/from the stack for f32x16, so it might be due to register pressure or some such (LLVM isn't great at dealing with that). I'll need to investigate how |
This is a wild guess (I don't have Apple Silicon hardware, so I can't benchmark any of this), but the way you're loading from a slice looks a bit convoluted. Instead of e.g. let in0_re = f32x4::simd_from(simd, <[f32; 4]>::try_from(&reals_s0[0..4]).unwrap());have you tried simply: let in0_re = f32x4::from_slice(simd, &reals_s0[0..4]));Also just to confirm, you ran this with the latest fearless_simd from Git, correct? linebender/fearless_simd#159 aimed to improve codegen around SIMD loads, and linebender/fearless_simd#181 just landed a couple days ago and adds (potentially) faster methods for SIMD stores. |
|
Yep, this is on latest fearless_simd from git. I'll see if I've also tried swapping vector repr from arrays to structs to mimic |
|
CI is broken in a really interesting way: it complains about mul_neg_add which doesn't appear anywhere in the code on the latest commit. It's either running on an old commit or on a different branch; either way that could be exploitable if it can be reproduced. |
|
Nope, no difference in performance from changing loads/stores. Looks like a readability win to me though. |
|
I had Claude analyze the generated assembly from Load/store instructions are hot on the profiler (samply), so this makes sense as a regression due to stack spills. This also explains why all the kernels are gone from This is likely an artifact of |
I wonder why it's making a different inlining decision. Was the function marked I think we should probably just make |
This reverts commit 979e357.
…entirety of DiT kernels ends up rolled up into one function on ARM and collapses under register pressure
|
Nope, preventing inlining didn't help performance. It seems functions that operate on 512-bit vectors like |
|
Now that both aren't inlined, I can directly compare the generated assembly. This is almost the first time I see ARM assembly but what jumps out at me is that Assembly of an affected function wide: wide_asm.txt fearless_simd: fearless_asm.txt Profiling data Profile with Profile with Both recorded with |
|
I can't really read Aarch64 assembly but Claude can, and Claude has an idea on what's going on: https://claude.ai/share/4fa3af4f-3c34-4d57-b11d-3611385f4f1c That explains a lot if it's true |
|
linebender/fearless_simd#184 is a +20% boost on Apple M4! Thanks a lot! This still isn't as fast as |
|
New profile for fearless_simd: https://share.firefox.dev/3LOGBDc New assembly for |
|
It looks like the load-store-load pattern for @valadaptive you'll probably want to drop |
|
I was wrong about |
Can you share the LLVM assembly? I think if you include The constant array loads are inlined, and I find it very strange that LLVM is incapable of eliminating a redundant load/store of a simple constant. It's hard to find any information about this online, but I cannot find any reference to the I did notice that the f32x16 array load function maps to |
|
Try it with linebender/fearless_simd#185. |
|
After switching over to linebender/fearless_simd#185 I had to double-check my results, re-measure the baseline and run tests, because the benchmarks look too good to be true!
Yep, I agree that's probably not the cause of the spill. Looking at the code that implements it, I suspect it's I'll prepare a branch without the aligned wrappers and measure that for comparison. |
Branch up: https://github.com/Shnatsel/fearless_simd/tree/no-align-wrapper-in-loads But nope, no improvement from that. It seems |
|
Here's the LLVM IR you asked for: |
|
FWIW I showed this to an LLVM developer and he said
|
This is an interesting one! The remaining performance gap in QuState/PhastFT#58 seems to come from subpar performance when loading constants. I noticed that in Rust's `stdarch`, which defines all the SIMD intrinsics, the x86 load/store intrinsics lower to raw memory operations (`ptr::copy_nonoverlapping`). The AArch64 load/store intrinsics, on the other hand, *do* map to corresponding LLVM intrinsics! My hypothesis is that the LLVM intrinsics are not lowered until much later in the compilation pipeline, resulting in much fewer optimization opportunities and much worse codegen. If this is the case, we should just use memory operations directly. This also simplifies the code that we generate by quite a bit.
|
The performance regression is resolved upstream in linebender/fearless_simd#184 and linebender/fearless_simd#185, this should be ready to merge |
No description provided.