perf: Implement spark_translate function to improve `translate` performance #2993

shuch3ng · 2025-12-27T09:23:22Z

Which issue does this PR close?

Closes #2976.

Rationale for this change

What changes are included in this PR?

This PR implements a custom spark_translate function optimised for Spark's translate.

How are these changes tested?

Unit tests are included

Benchmark:

OpenJDK 64-Bit Server VM 17.0.17+10 on Mac OS X 26.2
Apple M2 Pro
translate:                                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                             39348          39545         280          0.0       37524.8       1.0X
Comet (Scan)                                      38891          38904          18          0.0       37089.8       1.0X
Comet (Scan + Exec)                               22343          22442         139          0.0       21308.1       1.8X

…ranslate` expression

coderfender · 2025-12-27T09:26:18Z

Thank you for the PR @shuch3ng , Any reason why we cant improve upstream (datafusion)'s spark_translate to not repeat ourselves ?

shuch3ng · 2025-12-27T10:31:29Z

Hi @coderfender.

That's a good question. I implemented spark_translate here because Andy created issue #2976 in this project so I didn’t think too much at the time.

Upon checking DataFusion translate, I noticed two differences compared to Spark's:

It sets graphemes(true), while Spark uses codepoints, which might lead to incompatibility
It uses last-occurrence-wins for duplicates, while Spark uses first-occurrence-wins.

For example:

❯ datafusion-cli
DataFusion CLI v51.0.0
> SELECT translate('abcbcb', 'bcbc', '1234') AS result;
+--------+
| result |
+--------+
| a34343 |
+--------+
1 row(s) fetched.
Elapsed 0.005 seconds.

 ❯ spark-sql
spark-sql (default)> SELECT translate('abcbcb', 'bcbc', '1234');
a12121
Time taken: 1.273 seconds, Fetched 1 row(s)

I'm not very familiar with the DataFusion project so I could be wrong, but pushing the changes upstream could affect other use cases in DataFusion.

shuch3ng · 2025-12-27T14:03:36Z

See my comment on #2976, the PR can be closed if it's confirmed as non-issue.

w.r.t. the two differences mentioned above, they are out of the scope.

codecov-commenter · 2025-12-27T15:42:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.58%. Comparing base (f09f8af) to head (0363685).
⚠️ Report is 803 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2993      +/-   ##
============================================
+ Coverage     56.12%   59.58%   +3.46%     
- Complexity      976     1377     +401     
============================================
  Files           119      167      +48     
  Lines         11743    15493    +3750     
  Branches       2251     2569     +318     
============================================
+ Hits           6591     9232    +2641     
- Misses         4012     4962     +950     
- Partials       1140     1299     +159

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2026-02-11T20:34:12Z

Hi @coderfender.

That's a good question. I implemented spark_translate here because Andy created issue #2976 in this project so I didn’t think too much at the time.

Upon checking DataFusion translate, I noticed two differences compared to Spark's:

It sets graphemes(true), while Spark uses codepoints, which might lead to incompatibility

It uses last-occurrence-wins for duplicates, while Spark uses first-occurrence-wins.

For example:
❯ datafusion-cli
DataFusion CLI v51.0.0
> SELECT translate('abcbcb', 'bcbc', '1234') AS result;
+--------+
| result |
+--------+
| a34343 |
+--------+
1 row(s) fetched.
Elapsed 0.005 seconds.
❯ spark-sql
spark-sql (default)> SELECT translate('abcbcb', 'bcbc', '1234');
a12121
Time taken: 1.273 seconds, Fetched 1 row(s)
I'm not very familiar with the DataFusion project so I could be wrong, but pushing the changes upstream could affect other use cases in DataFusion.

Apologies for not reviewing this PR sooner. I agree with @shuch3ng that the Comet requirements are not compatible with the DataFusion implementation. It makes sense to implement in this repo, or to eventually upstream to the data fusion-spark crate in the DataFusion repo.

andygrove

Thanks for the contribution @shuch3ng.

This looks good overall, but my main concern is around argument pattern coverage. The implementation only handles the case where from and to are ScalarValue::Utf8(Some(...)), but CometScalarFunction("translate") sends all
argument patterns to the native side without filtering, and there's no runtime fallback to Spark. The existing SQL tests in string_translate.sql include queries like SELECT translate(s, from_str, to_str) FROM test_translate where all three arguments are columns — these would hit the error branch since the match only accepts [Array, Scalar, Scalar]. CI is still pending, but I'd expect those tests to fail.

A few ways this could be addressed: fall back to DataFusion's built-in translate for unmatched patterns (registering it from the registry for the general case), add a custom serde handler (like CometSubstring does) that only routes to the custom implementation when from/to are foldable, or handle all argument patterns in the Rust code.

perf: Implement spark_translate function to improve performance of `t…

0363685

…ranslate` expression

shuch3ng force-pushed the improve-translate-perf branch from 21ebf69 to 0363685 Compare December 27, 2025 09:26

shuch3ng and others added 2 commits December 28, 2025 11:38

fix lint error

68d1e49

Merge branch 'main' into improve-translate-perf

d70b697

andygrove reviewed Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Implement spark_translate function to improve `translate` performance #2993

perf: Implement spark_translate function to improve `translate` performance #2993

shuch3ng commented Dec 27, 2025

Uh oh!

coderfender commented Dec 27, 2025 •

edited

Loading

Uh oh!

shuch3ng commented Dec 27, 2025

Uh oh!

shuch3ng commented Dec 27, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Dec 27, 2025 •

edited

Loading

Uh oh!

andygrove commented Feb 11, 2026

Uh oh!

andygrove left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf: Implement spark_translate function to improve translate performance #2993

Are you sure you want to change the base?

perf: Implement spark_translate function to improve translate performance #2993

Conversation

shuch3ng commented Dec 27, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

coderfender commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shuch3ng commented Dec 27, 2025

Uh oh!

shuch3ng commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Feb 11, 2026

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf: Implement spark_translate function to improve `translate` performance #2993

perf: Implement spark_translate function to improve `translate` performance #2993

coderfender commented Dec 27, 2025 •

edited

Loading

shuch3ng commented Dec 27, 2025 •

edited

Loading

codecov-commenter commented Dec 27, 2025 •

edited

Loading