-
Notifications
You must be signed in to change notification settings - Fork 284
perf: Implement spark_translate function to improve translate performance
#2993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ranslate` expression
21ebf69 to
0363685
Compare
|
Thank you for the PR @shuch3ng , Any reason why we cant improve upstream (datafusion)'s |
|
Hi @coderfender. That's a good question. I implemented Upon checking DataFusion
For example: I'm not very familiar with the DataFusion project so I could be wrong, but pushing the changes upstream could affect other use cases in DataFusion. |
|
See my comment on #2976, the PR can be closed if it's confirmed as non-issue. w.r.t. the two differences mentioned above, they are out of the scope. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2993 +/- ##
============================================
+ Coverage 56.12% 59.58% +3.46%
- Complexity 976 1377 +401
============================================
Files 119 167 +48
Lines 11743 15493 +3750
Branches 2251 2569 +318
============================================
+ Hits 6591 9232 +2641
- Misses 4012 4962 +950
- Partials 1140 1299 +159 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Apologies for not reviewing this PR sooner. I agree with @shuch3ng that the Comet requirements are not compatible with the DataFusion implementation. It makes sense to implement in this repo, or to eventually upstream to the |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @shuch3ng.
This looks good overall, but my main concern is around argument pattern coverage. The implementation only handles the case where from and to are ScalarValue::Utf8(Some(...)), but CometScalarFunction("translate") sends all
argument patterns to the native side without filtering, and there's no runtime fallback to Spark. The existing SQL tests in string_translate.sql include queries like SELECT translate(s, from_str, to_str) FROM test_translate where all three arguments are columns — these would hit the error branch since the match only accepts [Array, Scalar, Scalar]. CI is still pending, but I'd expect those tests to fail.
A few ways this could be addressed: fall back to DataFusion's built-in translate for unmatched patterns (registering it from the registry for the general case), add a custom serde handler (like CometSubstring does) that only routes to the custom implementation when from/to are foldable, or handle all argument patterns in the Rust code.
Which issue does this PR close?
Closes #2976.
Rationale for this change
What changes are included in this PR?
This PR implements a custom
spark_translatefunction optimised for Spark's translate.How are these changes tested?
Unit tests are included
Benchmark: