-
Notifications
You must be signed in to change notification settings - Fork 284
Open
Labels
enhancementNew feature or requestNew feature or request
Description
What is the problem the feature request solves?
RFC: Cast sub system Modularization
Author: Bhargava Vadlamani ( @coderfender )
Summary
Refactor the monolithic cast.rs (3,700 LOC) into separate focused modules organized by data type category, using a two-level static dispatch flow
###Problems
- Merge Conflicts - Multiple contributors editing the same large file causes frequent conflicts
- No Enforced Contract - No trait defining what a cast implementation must provide
- Poor Discoverability - Hard to find specific cast logic in 3,700 lines
- Scattered Changes - Adding a new cast type requires edits across distant parts of the file
Proposal
Trait Definition
/// Core trait for Spark-compatible cast operations.
/// nice to have : Impl should branch on EvalMode at a payload level, not row level ( to reduce branching overhead)
pub trait SparkCast: Debug {
fn can_cast(&self, from: &DataType, to: &DataType) -> bool;
fn cast(
&self,
array: &ArrayRef,
from: &DataType,
to: &DataType,
opts: &SparkCastOptions,
) -> SparkResult<ArrayRef>;
}Module Structure
(we currently only have cast.rs as a single cast file with 100 + lines of match statements and cast functions with different signatures making it difficult to maintain)
native/spark-expr/src/cast/
├── mod.rs # Public API, dispatcher
├── traits.rs # SparkCast trait
├── string.rs # String ↔ other types
├── numeric.rs # Integer, Float, Decimal
├── temporal.rs # Date, Timestamp
├── boolean.rs # Boolean conversions
├── complex_types.rs # Struct, Array, Map
├── binary.rs # Binary conversions
└── utils.rs # Shared helpers
Two-Level Static Dispatch
Level 1 - mod.rs (routes by from type category):
match from {
_ if is_string(from) => string::cast_from_string(array, to, opts),
_ if is_boolean(from) => boolean::cast_from_boolean(array, to, opts),
_ if is_numeric(from) => numeric::cast_from_numeric(array, from, to, opts),
_ if is_temporal(from) => temporal::cast_from_temporal(array, from, to, opts),
_ if is_complex(from) => complex_types::cast_from_complex(array, from, to, opts),
_ if is_binary(from) => binary::cast_from_binary(array, to, opts),
// pursue datafusion compatible casts here or in the sub-module itself
}utils.rs (type category helpers):
pub fn is_numeric(dt: &DataType) -> bool {
matches!(dt, Int8 | Int16 | Int32 | Int64 | Float32 | Float64 | Decimal128(_, _))
}
pub fn is_temporal(dt: &DataType) -> bool {
matches!(dt, Date32 | Date64 | Timestamp(_, _))
}
pub fn is_string(dt: &DataType) -> bool {
matches!(dt, Utf8 | LargeUtf8)
}
// ...Level 2 - string.rs (routes by to type):
pub fn cast_from_string(array: &ArrayRef, to: &DataType, opts: &SparkCastOptions) -> SparkResult<ArrayRef> {
match to {
Boolean => utf8_to_boolean(array, opts),
Int8 => utf8_to_int8(array, opts),
Int16 => utf8_to_int16(array, opts),
// ...
}
}Level 3 - actual function (branches by eval_mode at array level): (This is adaptable and not all casts might need separate implementation at an EValMode level)
fn utf8_to_int8(array: &ArrayRef, opts: &SparkCastOptions) -> SparkResult<ArrayRef> {
match opts.eval_mode {
EvalMode::Legacy => utf8_to_int8_legacy(array),
EvalMode::Ansi => utf8_to_int8_ansi(array),
EvalMode::Try => utf8_to_int8_try(array),
}
}Migration Strategy
Phase 1: Setup (1 PR)
- Create
cast/directory - Add
mod.rs,traits.rs,utils.rswith type category helpers - Start with smallest:
boolean.rs
**Phase 2: Migrate data types to individual sub modules **
- Proceed to other cast operation refactor based on datatypes
Phase 3: Cleanup (1 PR)
- Remove old
cast.rs - refactor imports across cast subsystem
Testing
- Existing tests remain valid (or minimal changes to adhere to standard cast function signature)
- Each module gets its own
#[cfg(test)]section (easier to add further tests) - No new tests needed initially - existing coverage carries over
- Future: add module-specific unit tests as needed
Benchmarking
- Existing benchmarking will be modulated to the corresponding data type file
- All code changes are evaluated to make sure modularization doesn't compromise latency
Alternatives Considered
- Keep single file but reorganize to improve readability - Doesn't solve merge conflicts
- Trait objects with dynamic dispatch - Virtual call overhead, might add additional performance overhead (a table of cast impl to route to the right module)
- Macros - Less readable, harder to debug
Open Questions
- Should we handle Dictionary types in a separate module or within each category?
- Naming convention can be better
- Generate dynamic cast support documentation (similar to scala side) and probably handle unreachable paths during compile time ?
Future enhancements
- Once this is live, I plan to follow the same approach to other comet native functions
Update after review from @comphead , @andygrove
- Definitely a +1 to modularize code and move cast ops to sub modules per cast.
- Trait usage might be unnecessary
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request