-
Notifications
You must be signed in to change notification settings - Fork 22
Don't use load/store intrinsics #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is a massive improvement on Apple M4! This takes |
|
Here's what I get with vello_cpu main from December vs. vello_cpu with this PR and #171, it seems to yield improvements in some cases but unfortunately still regressions in others. :( But if it alleviates problems in other benchmarks we can still merge it. |
|
So you mean basically reverting #184? Seems to be about the same unfortunately. |
|
So just for the record: Before (using fearless_simd 0.3): impl<S: Simd> Iterator for GradientPainter<'_, S> {
type Item = u8x64<S>;
#[inline(always)]
fn next(&mut self) -> Option<Self::Item> {
let extend = self.gradient.extend;
let pos = f32x16::from_slice(self.simd, self.t_vals.next()?);
let t_vals = apply_extend(pos, extend);
let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();
let mut vals = [0_u8; 64];
for (val, idx) in vals.chunks_exact_mut(4).zip(*indices) {
val.copy_from_slice(&self.lut[idx as usize]);
}
Some(u8x64::from_slice(self.simd, &vals))
}
}
impl<S: Simd> crate::fine::Painter for GradientPainter<'_, S> {
#[inline(never)]
fn paint_u8(&mut self, buf: &mut [u8]) {
self.simd.vectorize(
#[inline(always)]
|| {
for chunk in buf.chunks_exact_mut(64) {
chunk.copy_from_slice(self.next().unwrap().as_slice());
}
},
);
}
fn paint_f32(&mut self, _: &mut [f32]) {
unimplemented!()
}
}After (using fearless_simd main + this PR + #171): impl<S: Simd> crate::fine::Painter for GradientPainter<'_, S> {
#[inline(never)]
fn paint_u8(&mut self, buf: &mut [u8]) {
self.simd.vectorize(
#[inline(always)]
|| {
let src: &[u32] = bytemuck::cast_slice(&self.lut);
let dest: &mut [u32] = bytemuck::cast_slice_mut(buf);
for chunk in dest.chunks_exact_mut(16) {
let extend = self.gradient.extend;
let pos = f32x16::from_slice(self.simd, self.t_vals.next().unwrap());
let t_vals = apply_extend(pos, extend);
let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();
indices.gather_into(src, chunk);
}
},
);
}
fn paint_f32(&mut self, _: &mut [f32]) {
unimplemented!()
}
}Here are the assemblies: https://gist.github.com/LaurenzV/a5ed17df074e7de3eed2b96c41f121d2 For |
|
This is before/after this PR, or are there some other changes that are also included? |
|
If I change it back to not use the new #[inline(never)]
fn paint_u8(&mut self, buf: &mut [u8]) {
self.simd.vectorize(
#[inline(always)]
|| {
self.simd.vectorize(
#[inline(always)]
|| {
for chunk in buf.chunks_exact_mut(64) {
let extend = self.gradient.extend;
let pos = f32x16::from_slice(self.simd, self.t_vals.next().unwrap());
let t_vals = apply_extend(pos, extend);
let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();
for (val, idx) in chunk.chunks_exact_mut(4).zip(*indices) {
val.copy_from_slice(&self.lut[idx as usize]);
}
}
},
);
},
);
} |
Before is using fearless_simd 0.3, After is using fearless_simd main + this PR + #171 |
Could you share the assembly for this case? |
Here: |
|
Ah, yeah, it's the bounds-checks-on-gather case again. The compiler just so happens to structure the load loop differently. This seems to be incidental to the use of intrinsics. I am not very good at reading Aarch64 assembly but Gemini has a convincing explanation of what's happening. This PR fixes awful codegen for loading contiguous data into vector types. On main So on balance I think this is well worth merging: it fixes some really awful codegen for the most common case, and the regression for the fill case seems entirely incidental and will likely change from one LLVM version to another anyway. |
|
Sure, do the changes with |
|
I've looked at the documentation for I don't really follow what's happening in the stores, all that PhastFT tests pass on this branch so it's certainly not grossly wrong. |
|
@valadaptive if I use fine/gradient/linear/opaque_u8_neon
time: [473.02 ns 476.11 ns 482.30 ns]
change: [-2.6771% -1.3268% +0.4153%] (p = 0.05 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) low mild
3 (3.00%) high severeStill a bit worse than current main, but I think at this point I can live with the difference! |
At the risk of getting into micro-optimization, is there any benefit if you split up the If it turns out to be beneficial, I could implement it in I might take a look at LLVM at some point, but it's not really an area of it that I've looked into before. |
|
With benchmarks all on par or improved and safety assured, is this good to go? I'm excited to see this merged because it's the last blocker for merging the migration of https://github.com/QuState/PHastFT/ from |
|
I’ll look at it tomorrow |
fearless_simd_gen/src/generic.rs
Outdated
|
|
||
| // There are architecture-specific "load" intrinsics, but they can actually be *worse* for performance. If they | ||
| // lower to LLVM intrinsics, they will likely not be optimized until much later in the pipeline (if at all), | ||
| // resulting in substantially worse codegen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also link to this PR for context.
fearless_simd_gen/src/generic.rs
Outdated
| let store_expr = quote! { | ||
| unsafe { | ||
| // Copies `count` scalars from the backing type, which has the same layout as the destination array (see | ||
| // `generic_as_array`). We know that the source and destination are aligned to at least the alignment of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the slices themselves are guaranteed to have "correct" alignment, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the existence of a &[f64] that doesn't point to aligned f64s is UB, see https://doc.rust-lang.org/stable/std/slice/fn.from_raw_parts.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would still be good to clarify in the comment, I think.
|
Now that this is squared away, I've merged the PR to convert https://github.com/QuState/PHastFT/ to |
|
@folkertdev has kindly helped me look into the assembly and the LLVM IR, and managed to come up with a minimized example showing the missed optimization: https://rust.godbolt.org/z/v9hqY3csr The intermediate load onto the stack was caused by the Rust code containing an intermediate array that rustc failed to optimize out. This is very likely a bug in stdarch or LLVM. You can also see that when working correctly, the intrinsic produces a |
|
I guess we should open an issue in the Rust repo for that? 😄 |
|
I'm a co-maintainer of
|
|
PR for stdarch is up: rust-lang/stdarch#2004 |
This is an interesting one! The remaining performance gap in QuState/PhastFT#58 seems to come from subpar performance when loading constants.
I noticed that in Rust's
stdarch, which defines all the SIMD intrinsics, the x86 load/store intrinsics lower to raw memory operations (ptr::copy_nonoverlapping). The AArch64 load/store intrinsics, on the other hand, do map to corresponding LLVM intrinsics!My hypothesis is that the LLVM intrinsics are not lowered until much later in the compilation pipeline, resulting in much fewer optimization opportunities and much worse codegen. If this is the case, we should just use memory operations directly. This also simplifies the code that we generate by quite a bit.