Skip to content

RFC: Cast sub system Modularization #3459

@coderfender

Description

@coderfender

What is the problem the feature request solves?

RFC: Cast sub system Modularization

Author: Bhargava Vadlamani ( @coderfender )

Summary

Refactor the monolithic cast.rs (3,700 LOC) into separate focused modules organized by data type category, using a two-level static dispatch flow

###Problems

  1. Merge Conflicts - Multiple contributors editing the same large file causes frequent conflicts
  2. No Enforced Contract - No trait defining what a cast implementation must provide
  3. Poor Discoverability - Hard to find specific cast logic in 3,700 lines
  4. Scattered Changes - Adding a new cast type requires edits across distant parts of the file

Proposal

Trait Definition

/// Core trait for Spark-compatible cast operations.
/// nice to have : Impl should branch on EvalMode at  a payload level, not row level ( to reduce branching overhead)
pub trait SparkCast: Debug {
    fn can_cast(&self, from: &DataType, to: &DataType) -> bool;

    fn cast(
        &self,
        array: &ArrayRef,
        from: &DataType,
        to: &DataType,
        opts: &SparkCastOptions,
    ) -> SparkResult<ArrayRef>;
}

Module Structure

(we currently only have cast.rs as a single cast file with 100 + lines of match statements and cast functions with different signatures making it difficult to maintain)

native/spark-expr/src/cast/
├── mod.rs              # Public API, dispatcher
├── traits.rs           # SparkCast trait
├── string.rs           # String ↔ other types
├── numeric.rs          # Integer, Float, Decimal
├── temporal.rs         # Date, Timestamp
├── boolean.rs          # Boolean conversions
├── complex_types.rs    # Struct, Array, Map
├── binary.rs           # Binary conversions
└── utils.rs            # Shared helpers

Two-Level Static Dispatch

Level 1 - mod.rs (routes by from type category):

match from {
    _ if is_string(from) => string::cast_from_string(array, to, opts),
    _ if is_boolean(from) => boolean::cast_from_boolean(array, to, opts),
    _ if is_numeric(from) => numeric::cast_from_numeric(array, from, to, opts),
    _ if is_temporal(from) => temporal::cast_from_temporal(array, from, to, opts),
    _ if is_complex(from) => complex_types::cast_from_complex(array, from, to, opts),
    _ if is_binary(from) => binary::cast_from_binary(array, to, opts),
 // pursue datafusion compatible casts here or in the sub-module itself
}

utils.rs (type category helpers):

pub fn is_numeric(dt: &DataType) -> bool {
    matches!(dt, Int8 | Int16 | Int32 | Int64 | Float32 | Float64 | Decimal128(_, _))
}

pub fn is_temporal(dt: &DataType) -> bool {
    matches!(dt, Date32 | Date64 | Timestamp(_, _))
}

pub fn is_string(dt: &DataType) -> bool {
    matches!(dt, Utf8 | LargeUtf8)
}

// ...

Level 2 - string.rs (routes by to type):

pub fn cast_from_string(array: &ArrayRef, to: &DataType, opts: &SparkCastOptions) -> SparkResult<ArrayRef> {
    match to {
        Boolean => utf8_to_boolean(array, opts),
        Int8 => utf8_to_int8(array, opts),
        Int16 => utf8_to_int16(array, opts),
        // ...
    }
}

Level 3 - actual function (branches by eval_mode at array level): (This is adaptable and not all casts might need separate implementation at an EValMode level)

fn utf8_to_int8(array: &ArrayRef, opts: &SparkCastOptions) -> SparkResult<ArrayRef> {
    match opts.eval_mode {
        EvalMode::Legacy => utf8_to_int8_legacy(array),
        EvalMode::Ansi => utf8_to_int8_ansi(array),
        EvalMode::Try => utf8_to_int8_try(array),
    }
}

Migration Strategy

Phase 1: Setup (1 PR)

  • Create cast/ directory
  • Add mod.rs, traits.rs, utils.rs with type category helpers
  • Start with smallest: boolean.rs

**Phase 2: Migrate data types to individual sub modules **

  • Proceed to other cast operation refactor based on datatypes

Phase 3: Cleanup (1 PR)

  • Remove old cast.rs
  • refactor imports across cast subsystem

Testing

  • Existing tests remain valid (or minimal changes to adhere to standard cast function signature)
  • Each module gets its own #[cfg(test)] section (easier to add further tests)
  • No new tests needed initially - existing coverage carries over
  • Future: add module-specific unit tests as needed

Benchmarking

  1. Existing benchmarking will be modulated to the corresponding data type file
  2. All code changes are evaluated to make sure modularization doesn't compromise latency

Alternatives Considered

  1. Keep single file but reorganize to improve readability - Doesn't solve merge conflicts
  2. Trait objects with dynamic dispatch - Virtual call overhead, might add additional performance overhead (a table of cast impl to route to the right module)
  3. Macros - Less readable, harder to debug

Open Questions

  1. Should we handle Dictionary types in a separate module or within each category?
  2. Naming convention can be better
  3. Generate dynamic cast support documentation (similar to scala side) and probably handle unreachable paths during compile time ?

Future enhancements

  1. Once this is live, I plan to follow the same approach to other comet native functions

Update after review from @comphead , @andygrove

  1. Definitely a +1 to modularize code and move cast ops to sub modules per cast.
  2. Trait usage might be unnecessary

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions