Don't use load/store intrinsics #185

valadaptive · 2026-01-23T03:34:26Z

This is an interesting one! The remaining performance gap in QuState/PhastFT#58 seems to come from subpar performance when loading constants.

I noticed that in Rust's stdarch, which defines all the SIMD intrinsics, the x86 load/store intrinsics lower to raw memory operations (ptr::copy_nonoverlapping). The AArch64 load/store intrinsics, on the other hand, do map to corresponding LLVM intrinsics!

My hypothesis is that the LLVM intrinsics are not lowered until much later in the compilation pipeline, resulting in much fewer optimization opportunities and much worse codegen. If this is the case, we should just use memory operations directly. This also simplifies the code that we generate by quite a bit.

Shnatsel · 2026-01-23T12:06:31Z

This is a massive improvement on Apple M4! This takes fearless_simd from ~10% slower than wide to 10-40% faster, with a 30% speedup on most sizes in QuState/PhastFT#58

LaurenzV · 2026-01-23T15:02:48Z

Here's what I get with vello_cpu main from December vs. vello_cpu with this PR and #171, it seems to yield improvements in some cases but unfortunately still regressions in others. :( But if it alleviates problems in other benchmarks we can still merge it.

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/main.rs (/Users/lstampfl/Programming/GitHub/vello/target/release/deps/main-b47aadb4df1020e4)
fine/fill/opaque_short_u8_neon
                        time:   [7.4267 ns 7.4334 ns 7.4406 ns]
                        change: [-24.443% -22.346% -20.206%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

fine/fill/opaque_long_u8_neon
                        time:   [43.211 ns 43.404 ns 43.646 ns]
                        change: [-35.947% -33.779% -31.660%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

fine/fill/transparent_short_u8_neon
                        time:   [19.305 ns 20.180 ns 21.145 ns]
                        change: [+9.3696% +12.799% +16.477%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  17 (17.00%) high severe

fine/fill/transparent_long_u8_neon
                        time:   [123.57 ns 123.76 ns 123.99 ns]
                        change: [+16.085% +16.371% +16.655%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

fine/strip/solid_short_u8_neon
                        time:   [13.104 ns 13.117 ns 13.133 ns]
                        change: [-6.1164% -5.9475% -5.7633%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

fine/strip/solid_long_u8_neon
                        time:   [78.589 ns 78.656 ns 78.734 ns]
                        change: [+2.2898% +2.6359% +2.9400%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe

fine/pack/pack_block_u8_neon
                        time:   [62.368 ns 62.406 ns 62.448 ns]
                        change: [-1.8640% -1.6525% -1.4733%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

fine/pack/pack_regular_u8_neon
                        time:   [203.74 ns 203.89 ns 204.06 ns]
                        change: [-1.0042% -0.7988% -0.5463%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe

fine/gradient/linear/opaque_u8_neon
                        time:   [529.87 ns 530.74 ns 531.77 ns]
                        change: [+19.602% +19.883% +20.185%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

fine/gradient/radial/opaque_u8_neon
                        time:   [698.40 ns 698.89 ns 699.45 ns]
                        change: [+13.633% +13.964% +14.262%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

fine/gradient/radial/opaque_conical_u8_neon
                        time:   [849.08 ns 849.79 ns 850.64 ns]
                        change: [+13.747% +14.030% +14.282%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

fine/gradient/sweep/opaque_u8_neon
                        time:   [1.1623 µs 1.1631 µs 1.1640 µs]
                        change: [+8.1183% +8.3233% +8.5196%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe

fine/gradient/extend/pad_u8_neon
                        time:   [528.84 ns 529.27 ns 529.76 ns]
                        change: [+20.171% +20.434% +20.690%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

fine/gradient/extend/repeat_u8_neon
                        time:   [657.77 ns 658.34 ns 659.05 ns]
                        change: [+16.606% +16.917% +17.247%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) high mild
  10 (10.00%) high severe

fine/gradient/extend/reflect_u8_neon
                        time:   [761.78 ns 763.70 ns 766.09 ns]
                        change: [+18.826% +19.311% +19.824%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

fine/gradient/many_stops_u8_neon
                        time:   [813.23 ns 813.84 ns 814.60 ns]
                        change: [+10.621% +10.903% +11.206%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

fine/gradient/transparent_u8_neon
                        time:   [680.01 ns 680.46 ns 681.00 ns]
                        change: [+10.975% +11.202% +11.419%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

fine/image/transform/none_u8_neon
                        time:   [486.06 ns 486.43 ns 486.89 ns]
                        change: [-4.2720% -4.0379% -3.8356%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  5 (5.00%) high severe

fine/image/transform/scale_u8_neon
                        time:   [486.30 ns 487.30 ns 488.38 ns]
                        change: [-4.0233% -3.7471% -3.4756%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) high mild
  5 (5.00%) high severe

fine/image/transform/rotate_u8_neon
                        time:   [685.88 ns 686.59 ns 687.42 ns]
                        change: [-0.6023% -0.3584% -0.1247%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe

fine/image/quality/low_u8_neon
                        time:   [484.99 ns 485.39 ns 485.85 ns]
                        change: [-4.4166% -4.1752% -3.9803%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

fine/image/quality/medium_u8_neon
                        time:   [2.9297 µs 2.9323 µs 2.9353 µs]
                        change: [+2.0053% +2.1761% +2.3413%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

fine/image/quality/high_u8_neon
                        time:   [12.502 µs 12.511 µs 12.523 µs]
                        change: [+7.4495% +7.6459% +7.8318%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

fine/image/extend/pad_u8_neon
                        time:   [485.35 ns 486.17 ns 487.14 ns]
                        change: [-4.3114% -4.0895% -3.8692%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

fine/image/extend/repeat_u8_neon
                        time:   [553.19 ns 553.58 ns 554.07 ns]
                        change: [-0.3337% -0.0878% +0.1720%] (p = 0.50 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

fine/image/extend/reflect_u8_neon
                        time:   [827.97 ns 852.81 ns 885.82 ns]
                        change: [+6.1852% +15.152% +25.318%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) high mild
  12 (12.00%) high severe

Shnatsel · 2026-01-23T15:08:07Z

@LaurenzV could you compare this PR against the commit ccf4763 on main? That's the latest one before #184.

I think it would be valuable to look at the effect of removing the use of intrinsics for loads in isolation, without rolling up all the other changes made since December.

LaurenzV · 2026-01-23T15:14:34Z

So you mean basically reverting #184? Seems to be about the same unfortunately.

LaurenzV · 2026-01-23T15:26:46Z

So just for the record:

Before (using fearless_simd 0.3):

impl<S: Simd> Iterator for GradientPainter<'_, S> {
    type Item = u8x64<S>;

    #[inline(always)]
    fn next(&mut self) -> Option<Self::Item> {
        let extend = self.gradient.extend;
        let pos = f32x16::from_slice(self.simd, self.t_vals.next()?);
        let t_vals = apply_extend(pos, extend);
        let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();

        let mut vals = [0_u8; 64];
        for (val, idx) in vals.chunks_exact_mut(4).zip(*indices) {
            val.copy_from_slice(&self.lut[idx as usize]);
        }

        Some(u8x64::from_slice(self.simd, &vals))
    }
}

impl<S: Simd> crate::fine::Painter for GradientPainter<'_, S> {
    #[inline(never)]
    fn paint_u8(&mut self, buf: &mut [u8]) {
        self.simd.vectorize(
            #[inline(always)]
            || {
                for chunk in buf.chunks_exact_mut(64) {
                    chunk.copy_from_slice(self.next().unwrap().as_slice());
                }
            },
        );
    }

    fn paint_f32(&mut self, _: &mut [f32]) {
        unimplemented!()
    }
}

     Running benches/main.rs (/Users/lstampfl/Programming/GitHub/vello/target/release/deps/main-95c585999c9437b3)
fine/gradient/linear/opaque_u8_neon
                        time:   [441.77 ns 442.81 ns 444.06 ns]
                        change: [-16.527% -16.162% -15.780%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  9 (9.00%) high mild
  6 (6.00%) high severe

After (using fearless_simd main + this PR + #171):

impl<S: Simd> crate::fine::Painter for GradientPainter<'_, S> {
    #[inline(never)]
    fn paint_u8(&mut self, buf: &mut [u8]) {
        self.simd.vectorize(
            #[inline(always)]
            || {
                let src: &[u32] = bytemuck::cast_slice(&self.lut);
                let dest: &mut [u32] = bytemuck::cast_slice_mut(buf);
                
                for chunk in dest.chunks_exact_mut(16) {
                    let extend = self.gradient.extend;
                    let pos = f32x16::from_slice(self.simd, self.t_vals.next().unwrap());
                    let t_vals = apply_extend(pos, extend);
                    let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();
                    indices.gather_into(src, chunk);
                }
            },
        );
    }

    fn paint_f32(&mut self, _: &mut [f32]) {
        unimplemented!()
    }
}

     Running benches/main.rs (/Users/lstampfl/Programming/GitHub/vello/target/release/deps/main-b47aadb4df1020e4)
fine/gradient/linear/opaque_u8_neon
                        time:   [529.80 ns 530.41 ns 531.13 ns]
                        change: [+18.652% +19.229% +19.748%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

Here are the assemblies: https://gist.github.com/LaurenzV/a5ed17df074e7de3eed2b96c41f121d2

For before, there are 2, one for the Fallback dispatch and one for the Neon one, unfortunately I'm not sure which one is which.

Shnatsel · 2026-01-23T15:29:12Z

This is before/after this PR, or are there some other changes that are also included?

LaurenzV · 2026-01-23T15:30:19Z

If I change it back to not use the new gather, After is slightly better but still not the same as before:

    #[inline(never)]
    fn paint_u8(&mut self, buf: &mut [u8]) {
        self.simd.vectorize(
            #[inline(always)]
            || {
                self.simd.vectorize(
                    #[inline(always)]
                    || {
                        for chunk in buf.chunks_exact_mut(64) {
                            let extend = self.gradient.extend;
                            let pos = f32x16::from_slice(self.simd, self.t_vals.next().unwrap());
                            let t_vals = apply_extend(pos, extend);
                            let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();

                            for (val, idx) in chunk.chunks_exact_mut(4).zip(*indices) {
                                val.copy_from_slice(&self.lut[idx as usize]);
                            }
                        }
                    },
                );
            },
        );
    }

fine/gradient/linear/opaque_u8_neon
                        time:   [501.64 ns 502.35 ns 503.14 ns]
                        change: [-5.4730% -5.2248% -4.9749%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

LaurenzV · 2026-01-23T15:31:04Z

This is before/after this PR, or are there some other changes that are also included?

Before is using fearless_simd 0.3, After is using fearless_simd main + this PR + #171

Shnatsel · 2026-01-23T15:38:36Z

If I change it back to not use the new gather, After is slightly better but still not the same as before:

Could you share the assembly for this case?

LaurenzV · 2026-01-23T15:40:11Z

Could you share the assembly for this case?

Here:

Working with file: /Users/lstampfl/Programming/GitHub/vello/target/release/deps/vello_cpu-f646e0ded71d97fc.s
        .globl  <vello_cpu::fine::lowp::gradient::GradientPainter<S> as vello_cpu::fine::Painter>::paint_u8
        .p2align        2
<vello_cpu::fine::lowp::gradient::GradientPainter<S> as vello_cpu::fine::Painter>::paint_u8:
Lfunc_begin1:
        .cfi_startproc
        sub sp, sp, #32
        .cfi_def_cfa_offset 32
        stp x29, x30, [sp, #16]
        add x29, sp, #16
        .cfi_def_cfa w29, 16
        .cfi_offset w30, -8
        .cfi_offset w29, -16
        .cfi_remember_state
        tst x2, #0xffffffffffffffc0
        b.eq LBB1_26
        ldr x11, [x0, #96]
        ldp x9, x10, [x0, #64]
        cmp x11, #16
        b.ne LBB1_27
        mov x8, x1
        ldp x1, x12, [x0, #112]
        ldr x11, [x0, #104]
        ldrb w12, [x12, #309]
        ldp q0, q1, [x0]
        ldp q2, q3, [x0, #32]
        and x13, x2, #0xffffffffffffffc0
        neg x13, x13
        add x14, x9, #64
        movi.2d v4, #0000000000000000
        fmov.4s v5, #1.00000000
        fmov.4s v6, #-1.00000000
        movi.4s v7, #63, lsl #24
LBB1_3:
        subs x10, x10, #16
        b.lo LBB1_28
        stp x14, x10, [x0, #64]
        ldp q19, q18, [x14, #-64]
        ldp q17, q16, [x14, #-32]
        cbz w12, LBB1_8
        cmp w12, #1
        b.ne LBB1_7
        frintm.4s v20, v19
        frintm.4s v21, v18
        frintm.4s v22, v17
        frintm.4s v23, v16
        fsub.4s v19, v19, v20
        fsub.4s v18, v18, v21
        fsub.4s v17, v17, v22
        fsub.4s v16, v16, v23
        fcvtzs.4s v20, v19
        scvtf.4s v20, v20
        fsub.4s v19, v19, v20
        fcvtzs.4s v20, v18
        scvtf.4s v20, v20
        fsub.4s v18, v18, v20
        fcvtzs.4s v20, v17
        scvtf.4s v20, v20
        fsub.4s v17, v17, v20
        fcvtzs.4s v20, v16
        scvtf.4s v20, v20
        fsub.4s v16, v16, v20
        b LBB1_9
LBB1_7:
        fadd.4s v19, v19, v6
        fadd.4s v18, v18, v6
        fadd.4s v17, v17, v6
        fadd.4s v16, v16, v6
        fmul.4s v20, v19, v7
        fmul.4s v21, v18, v7
        fmul.4s v22, v17, v7
        fmul.4s v23, v16, v7
        frintm.4s v20, v20
        frintm.4s v21, v21
        frintm.4s v22, v22
        frintm.4s v23, v23
        fadd.4s v20, v20, v20
        fadd.4s v21, v21, v21
        fadd.4s v22, v22, v22
        fadd.4s v23, v23, v23
        fsub.4s v19, v19, v20
        fsub.4s v18, v18, v21
        fsub.4s v17, v17, v22
        fsub.4s v16, v16, v23
        fadd.4s v19, v19, v6
        fadd.4s v18, v18, v6
        fadd.4s v17, v17, v6
        fadd.4s v16, v16, v6
        fabs.4s v19, v19
        fabs.4s v18, v18
        fabs.4s v17, v17
        fabs.4s v16, v16
LBB1_8:
        fmax.4s v19, v19, v4
        fmax.4s v18, v18, v4
        fmax.4s v17, v17, v4
        fmax.4s v16, v16, v4
        fmin.4s v19, v19, v5
        fmin.4s v18, v18, v5
        fmin.4s v17, v17, v5
        fmin.4s v16, v16, v5
LBB1_9:
        fmul.4s v19, v19, v0
        fcvtzu.4s v19, v19
        fmov w9, s19
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8]
        mov.s w9, v19[1]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #4]
        mov.s w9, v19[2]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #8]
        mov.s w9, v19[3]
        cmp x1, x9
        b.ls LBB1_29
        fmul.4s v18, v18, v1
        fcvtzu.4s v18, v18
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #12]
        fmov w9, s18
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #16]
        mov.s w9, v18[1]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #20]
        mov.s w9, v18[2]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #24]
        mov.s w9, v18[3]
        cmp x1, x9
        b.ls LBB1_29
        fmul.4s v17, v17, v2
        fcvtzu.4s v17, v17
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #28]
        fmov w9, s17
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #32]
        mov.s w9, v17[1]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #36]
        mov.s w9, v17[2]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #40]
        mov.s w9, v17[3]
        cmp x1, x9
        b.ls LBB1_29
        fmul.4s v16, v16, v3
        fcvtzu.4s v16, v16
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #44]
        fmov w9, s16
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #48]
        mov.s w9, v16[1]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #52]
        mov.s w9, v16[2]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #56]
        mov.s w9, v16[3]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #60]
        add x14, x14, #64
        add x8, x8, #64
        adds x13, x13, #64
        b.ne LBB1_3
LBB1_26:
        .cfi_def_cfa wsp, 32
        ldp x29, x30, [sp, #16]
        add sp, sp, #32
        .cfi_def_cfa_offset 0
        .cfi_restore w30
        .cfi_restore w29
        ret
LBB1_27:
        .cfi_restore_state
        subs x8, x10, x11
        b.hs LBB1_30
LBB1_28:
Lloh6:
        adrp x0, l_anon.a945abf46221ba9c4f7c92070933bb71.3@PAGE
Lloh7:
        add x0, x0, l_anon.a945abf46221ba9c4f7c92070933bb71.3@PAGEOFF
        bl core::option::unwrap_failed
LBB1_29:
Lloh8:
        adrp x2, l_anon.a945abf46221ba9c4f7c92070933bb71.6@PAGE
Lloh9:
        add x2, x2, l_anon.a945abf46221ba9c4f7c92070933bb71.6@PAGEOFF
        mov x0, x9
        bl core::panicking::panic_bounds_check
LBB1_30:
        add x9, x9, x11, lsl #2
        stp x9, x8, [x0, #64]
Lloh10:
        adrp x0, l_anon.a945abf46221ba9c4f7c92070933bb71.83@PAGE
Lloh11:
        add x0, x0, l_anon.a945abf46221ba9c4f7c92070933bb71.83@PAGEOFF
Lloh12:
        adrp x3, l_anon.a945abf46221ba9c4f7c92070933bb71.84@PAGE
Lloh13:
        add x3, x3, l_anon.a945abf46221ba9c4f7c92070933bb71.84@PAGEOFF
Lloh14:
        adrp x4, l_anon.a945abf46221ba9c4f7c92070933bb71.5@PAGE
Lloh15:
        add x4, x4, l_anon.a945abf46221ba9c4f7c92070933bb71.5@PAGEOFF
        sub x2, x29, #1
        mov w1, #43
        bl core::result::unwrap_failed
        .loh AdrpAdd    Lloh6, Lloh7
        .loh AdrpAdd    Lloh8, Lloh9
        .loh AdrpAdd    Lloh14, Lloh15
        .loh AdrpAdd    Lloh12, Lloh13
        .loh AdrpAdd    Lloh10, Lloh11

Shnatsel · 2026-01-23T16:05:24Z

Ah, yeah, it's the bounds-checks-on-gather case again. The compiler just so happens to structure the load loop differently. This seems to be incidental to the use of intrinsics. I am not very good at reading Aarch64 assembly but Gemini has a convincing explanation of what's happening.

This PR fixes awful codegen for loading contiguous data into vector types. On main f32x16::simd_from(simd, input_slice); causes the emitted code to load data into registers, immediately spill it to the stack, and then load it back into registers again. This happens even for loading constants! vello_cpu clearly benefits from this fix in some parts too.

So on balance I think this is well worth merging: it fixes some really awful codegen for the most common case, and the regression for the fill case seems entirely incidental and will likely change from one LLVM version to another anyway.

LaurenzV · 2026-01-23T16:20:56Z

Sure, do the changes with transmute_copy look sound to you? I'll also take another closer look, but would be good to have your opinion as well.

Shnatsel · 2026-01-23T16:25:42Z

I've looked at the documentation for transmute_copy and its use here seems sound.

I don't really follow what's happening in the stores, all that core::ptr::copy_nonoverlapping business.

PhastFT tests pass on this branch so it's certainly not grossly wrong.

LaurenzV · 2026-01-23T19:17:54Z

@valadaptive if I use gather instead of gather_into, I indeed get a better result:

fine/gradient/linear/opaque_u8_neon
                        time:   [473.02 ns 476.11 ns 482.30 ns]
                        change: [-2.6771% -1.3268% +0.4153%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high severe

Still a bit worse than current main, but I think at this point I can live with the difference!

valadaptive · 2026-01-23T20:02:30Z

Still a bit worse than current main, but I think at this point I can live with the difference!

At the risk of getting into micro-optimization, is there any benefit if you split up the f32x16 into four blocks of f32x4, then gather+store each sequentially? Instead of doing 16 loads followed by 16 stores, this would result in 4 loads followed by 4 stores, 4 times in a row.

If it turns out to be beneficial, I could implement it in gather_into directly. Although it might take some tuning to figure out the best block size...does it depend on the element count? The native vector width? This is really the sort of thing that LLVM should be handling for us.

I might take a look at LLVM at some point, but it's not really an area of it that I've looked into before.

Shnatsel · 2026-01-23T20:09:59Z

With benchmarks all on par or improved and safety assured, is this good to go?

I'm excited to see this merged because it's the last blocker for merging the migration of https://github.com/QuState/PHastFT/ from wide to fearless_simd.

LaurenzV · 2026-01-23T21:50:01Z

I’ll look at it tomorrow

LaurenzV · 2026-01-24T11:53:34Z

fearless_simd_gen/src/generic.rs

+
+    // There are architecture-specific "load" intrinsics, but they can actually be *worse* for performance. If they
+    // lower to LLVM intrinsics, they will likely not be optimized until much later in the pipeline (if at all),
+    // resulting in substantially worse codegen.


Let's also link to this PR for context.

LaurenzV · 2026-01-24T11:58:10Z

fearless_simd_gen/src/generic.rs

+    let store_expr = quote! {
+        unsafe {
+            // Copies `count` scalars from the backing type, which has the same layout as the destination array (see
+            // `generic_as_array`). We know that the source and destination are aligned to at least the alignment of the


Because the slices themselves are guaranteed to have "correct" alignment, right?

Yes, the existence of a &[f64] that doesn't point to aligned f64s is UB, see https://doc.rust-lang.org/stable/std/slice/fn.from_raw_parts.html

Would still be good to clarify in the comment, I think.

Shnatsel · 2026-01-24T22:06:04Z

Now that this is squared away, I've merged the PR to convert https://github.com/QuState/PHastFT/ to fearless_simd. Thanks!

Shnatsel · 2026-01-27T17:29:10Z

@folkertdev has kindly helped me look into the assembly and the LLVM IR, and managed to come up with a minimized example showing the missed optimization: https://rust.godbolt.org/z/v9hqY3csr

The intermediate load onto the stack was caused by the Rust code containing an intermediate array that rustc failed to optimize out. This is very likely a bug in stdarch or LLVM.

You can also see that when working correctly, the intrinsic produces a ld1.4s instruction while regular LLVM optimizations produce two ldp instructions. Based on a conversation with an ARM engineer, a single ld1.4s and two ldp should have identical performance on all relevant cores; in fact, two ldps are slightly preferred because there are additional hazards associated with instructions like ld1.4s that decompose into multiple u-ops behind the scenes.

LaurenzV · 2026-01-27T17:42:55Z

I guess we should open an issue in the Rust repo for that? 😄

folkertdev · 2026-01-27T17:55:52Z

I'm a co-maintainer of rust-lang/stdarch, so I'll make changes/open issues where needed.

we can fix it ourselves in stdarch, I'm just figuring out how to test it properly
I suspect that Arm should fix this in LLVM

Shnatsel · 2026-01-27T18:20:43Z

PR for stdarch is up: rust-lang/stdarch#2004

Don't use load/store intrinsics

f147977

valadaptive requested a review from LaurenzV January 23, 2026 03:34

valadaptive mentioned this pull request Jan 23, 2026

Port to fearless_simd QuState/PhastFT#58

Merged

Shnatsel mentioned this pull request Jan 23, 2026

Less overzealous inlining #183

Open

Document safety a bit more

f46212d

LaurenzV approved these changes Jan 24, 2026

View reviewed changes

Clarify comments

81cba51

valadaptive added this pull request to the merge queue Jan 24, 2026

Merged via the queue into linebender:main with commit bf99e21 Jan 24, 2026
18 checks passed

valadaptive deleted the no-x4 branch January 24, 2026 20:23

folkertdev mentioned this pull request Jan 27, 2026

aarch64: use read_unaligned for vld1_* rust-lang/stdarch#2004

Open

Don't use load/store intrinsics #185

Don't use load/store intrinsics #185

Uh oh!

Conversation

valadaptive commented Jan 23, 2026

Uh oh!

Shnatsel commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LaurenzV commented Jan 23, 2026

Uh oh!

Shnatsel commented Jan 23, 2026

Uh oh!

LaurenzV commented Jan 23, 2026

Uh oh!

LaurenzV commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shnatsel commented Jan 23, 2026

Uh oh!

LaurenzV commented Jan 23, 2026

Uh oh!

LaurenzV commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shnatsel commented Jan 23, 2026

Uh oh!

LaurenzV commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shnatsel commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LaurenzV commented Jan 23, 2026

Uh oh!

Shnatsel commented Jan 23, 2026

Uh oh!

LaurenzV commented Jan 23, 2026

Uh oh!

valadaptive commented Jan 23, 2026

Uh oh!

Shnatsel commented Jan 23, 2026

Uh oh!

LaurenzV commented Jan 23, 2026

Uh oh!

LaurenzV Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

LaurenzV Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Shnatsel Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

LaurenzV Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shnatsel commented Jan 24, 2026

Uh oh!

Shnatsel commented Jan 27, 2026

Uh oh!

LaurenzV commented Jan 27, 2026

Uh oh!

folkertdev commented Jan 27, 2026

Uh oh!

Shnatsel commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shnatsel commented Jan 23, 2026 •

edited

Loading

LaurenzV commented Jan 23, 2026 •

edited

Loading

LaurenzV commented Jan 23, 2026 •

edited

Loading

LaurenzV commented Jan 23, 2026 •

edited

Loading

Shnatsel commented Jan 23, 2026 •

edited

Loading