Skip to content

Conversation

@neilconway
Copy link
Contributor

The previous implementation had a fast path for ASCII-only inputs, but it was still relatively slow. Switch to using memchr::memchr() to find the first matching byte and then check the rest of the bytes by hand. This improves performance for ASCII inputs by 2x-4x on the built-in strpos benchmarks.

Which issue does this PR close?

Are these changes tested?

Yes, passes unit tests and SLT.

Are there any user-facing changes?

No.

@neilconway
Copy link
Contributor Author

Benchmark results:

$ cargo bench --bench strpos -- --baseline strpos-vanilla
   Compiling datafusion-functions v52.1.0 (/Users/neilconway/datafusion/datafusion/functions)
    Finished `bench` profile [optimized] target(s) in 49.54s
     Running benches/strpos.rs (target/release/deps/strpos-276a7f6d948782b8)
Gnuplot not found, using plotters backend
strpos_StringArray_ascii_str_len_8
                        time:   [70.568 µs 70.979 µs 71.399 µs]
                        change: [−42.408% −42.154% −41.895%] (p = 0.00 < 0.05)
                        Performance has improved.

strpos_StringArray_utf8_str_len_8
                        time:   [139.70 µs 139.98 µs 140.24 µs]
                        change: [−2.8251% −2.5080% −2.2091%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

strpos_StringViewArray_ascii_str_len_8
                        time:   [84.823 µs 85.501 µs 86.164 µs]
                        change: [−36.379% −35.942% −35.475%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

strpos_StringViewArray_utf8_str_len_8
                        time:   [149.49 µs 149.70 µs 149.91 µs]
                        change: [−1.3145% −1.0960% −0.8604%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

strpos_StringArray_ascii_str_len_32
                        time:   [88.618 µs 88.681 µs 88.746 µs]
                        change: [−59.156% −59.095% −59.039%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild

strpos_StringArray_utf8_str_len_32
                        time:   [288.50 µs 288.98 µs 289.65 µs]
                        change: [−0.7910% −0.6439% −0.4836%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

strpos_StringViewArray_ascii_str_len_32
                        time:   [103.70 µs 103.83 µs 103.98 µs]
                        change: [−55.373% −55.209% −55.040%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) low severe
  11 (11.00%) high mild
  7 (7.00%) high severe

strpos_StringViewArray_utf8_str_len_32
                        time:   [311.17 µs 311.76 µs 312.27 µs]
                        change: [+0.9177% +1.1383% +1.3431%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

strpos_StringArray_ascii_str_len_128
                        time:   [135.59 µs 136.00 µs 136.40 µs]
                        change: [−79.902% −79.847% −79.794%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking strpos_StringArray_utf8_str_len_128: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.3s, enable flat sampling, or reduce sample count to 60.
strpos_StringArray_utf8_str_len_128
                        time:   [1.2347 ms 1.2360 ms 1.2373 ms]
                        change: [−1.2792% −1.1145% −0.9587%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low mild

strpos_StringViewArray_ascii_str_len_128
                        time:   [173.34 µs 177.51 µs 181.56 µs]
                        change: [−74.843% −74.464% −74.023%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking strpos_StringViewArray_utf8_str_len_128: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.3s, enable flat sampling, or reduce sample count to 60.
strpos_StringViewArray_utf8_str_len_128
                        time:   [1.2400 ms 1.2414 ms 1.2428 ms]
                        change: [−1.4076% −1.2513% −1.0985%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

strpos_StringArray_ascii_str_len_4096
                        time:   [4.4126 ms 4.4207 ms 4.4292 ms]
                        change: [−76.979% −76.930% −76.887%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

strpos_StringArray_utf8_str_len_4096
                        time:   [36.033 ms 36.097 ms 36.179 ms]
                        change: [−1.2339% −1.0242% −0.7534%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

strpos_StringViewArray_ascii_str_len_4096
                        time:   [4.6480 ms 4.6559 ms 4.6643 ms]
                        change: [−75.980% −75.938% −75.899%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe

strpos_StringViewArray_utf8_str_len_4096
                        time:   [36.095 ms 36.134 ms 36.173 ms]
                        change: [−1.0341% −0.9052% −0.7789%] (p = 0.00 < 0.05)
                        Change within noise threshold.

@github-actions github-actions bot added the functions Changes to functions implementation label Feb 11, 2026
@neilconway
Copy link
Contributor Author

I added some notes on the approach and some future possible improvements to #20294 20294

@kumarUjjawal
Copy link
Contributor

It would be great if you could use the PR template with relevant details, it maintains consistency.

@neilconway
Copy link
Contributor Author

It would be great if you could use the PR template with relevant details, it maintains consistency.

Sure. I've been removing sections that would otherwise have been left empty, but I can leave the full template in if you'd prefer.

The previous implementation had a fast path for ASCII-only inputs, but
it was still relatively slow. Switch to using memchr::memchr() to find
the first matching byte and then check the rest of the bytes by hand.
This improves performance for ASCII inputs by 2x-4x on the built-in
strpos benchmarks.
@neilconway neilconway force-pushed the neilc/optimize-strpos branch from 1cb12b8 to 7e1ef9f Compare February 11, 2026 19:00
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization!

Left an idea for potential simplification, if it's slower, we can proceed with the current implementation.

/// `memchr` does not, and strpos is often invoked many times on short inputs.
/// Returns a 1-based position, or 0 if not found.
/// Both inputs must be ASCII-only.
fn find_ascii_substring(haystack: &[u8], needle: &[u8]) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use memchr::memmem::find() directly? Based on the Complexity section, it seems has implemented the same algorithm.
https://docs.rs/memchr/latest/memchr/memmem/fn.find.html

Copy link
Contributor Author

@neilconway neilconway Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! When I tried using memmem::find(), it was substantially slower -- presumably because it incurs some per-call overhead (I'd imagine setting up lookup tables etc.) that memchr does not.

I'd like to explore optimizing the (common) case where strpos() is invoked with a constant substring; in that case we could construct a memmove::Finder once, and use it for the entire input batch. But this PR is already a significant win so my thought was to defer that to a subsequent PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize strpos() for ASCII inputs

3 participants