RFC: Cast sub system Modularization

### What is the problem the feature request solves?

# RFC: Cast sub system Modularization

**Author:** Bhargava Vadlamani ( @coderfender )

### Summary

Refactor the monolithic [cast.rs](https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/conversion_funcs/cast.rs) (3,700 LOC) into separate focused modules organized by data type category, using a two-level static dispatch flow

###Problems

1. Merge Conflicts - Multiple contributors editing the same large file causes frequent conflicts
2. No Enforced Contract - No trait defining what a cast implementation must provide
3. Poor Discoverability - Hard to find specific cast logic in 3,700 lines
4. Scattered Changes - Adding a new cast type requires edits across distant parts of the file

### Proposal

### Trait Definition

```rust
/// Core trait for Spark-compatible cast operations.
/// nice to have : Impl should branch on EvalMode at  a payload level, not row level ( to reduce branching overhead)
pub trait SparkCast: Debug {
    fn can_cast(&self, from: &DataType, to: &DataType) -> bool;

    fn cast(
        &self,
        array: &ArrayRef,
        from: &DataType,
        to: &DataType,
        opts: &SparkCastOptions,
    ) -> SparkResult<ArrayRef>;
}
```

### Module Structure
 (we currently only have `cast.rs` as a single cast file with 100 + lines of match statements and cast functions with different signatures making it difficult to maintain)

```
native/spark-expr/src/cast/
├── mod.rs              # Public API, dispatcher
├── traits.rs           # SparkCast trait
├── string.rs           # String ↔ other types
├── numeric.rs          # Integer, Float, Decimal
├── temporal.rs         # Date, Timestamp
├── boolean.rs          # Boolean conversions
├── complex_types.rs    # Struct, Array, Map
├── binary.rs           # Binary conversions
└── utils.rs            # Shared helpers
```

### Two-Level Static Dispatch

**Level 1 - mod.rs** (routes by `from` type category):

```rust
match from {
    _ if is_string(from) => string::cast_from_string(array, to, opts),
    _ if is_boolean(from) => boolean::cast_from_boolean(array, to, opts),
    _ if is_numeric(from) => numeric::cast_from_numeric(array, from, to, opts),
    _ if is_temporal(from) => temporal::cast_from_temporal(array, from, to, opts),
    _ if is_complex(from) => complex_types::cast_from_complex(array, from, to, opts),
    _ if is_binary(from) => binary::cast_from_binary(array, to, opts),
 // pursue datafusion compatible casts here or in the sub-module itself
}
```

**utils.rs** (type category helpers):

```rust
pub fn is_numeric(dt: &DataType) -> bool {
    matches!(dt, Int8 | Int16 | Int32 | Int64 | Float32 | Float64 | Decimal128(_, _))
}

pub fn is_temporal(dt: &DataType) -> bool {
    matches!(dt, Date32 | Date64 | Timestamp(_, _))
}

pub fn is_string(dt: &DataType) -> bool {
    matches!(dt, Utf8 | LargeUtf8)
}

// ...
```

**Level 2 - string.rs** (routes by `to` type):

```rust
pub fn cast_from_string(array: &ArrayRef, to: &DataType, opts: &SparkCastOptions) -> SparkResult<ArrayRef> {
    match to {
        Boolean => utf8_to_boolean(array, opts),
        Int8 => utf8_to_int8(array, opts),
        Int16 => utf8_to_int16(array, opts),
        // ...
    }
}
```

**Level 3 - actual function** (branches by eval_mode at array level): (This is adaptable and not all casts might need separate implementation at an EValMode level)

```rust
fn utf8_to_int8(array: &ArrayRef, opts: &SparkCastOptions) -> SparkResult<ArrayRef> {
    match opts.eval_mode {
        EvalMode::Legacy => utf8_to_int8_legacy(array),
        EvalMode::Ansi => utf8_to_int8_ansi(array),
        EvalMode::Try => utf8_to_int8_try(array),
    }
}
```

### Migration Strategy

**Phase 1: Setup (1 PR)**
- Create `cast/` directory
- Add `mod.rs`, `traits.rs`, `utils.rs` with type category helpers
- Start with smallest: `boolean.rs`

**Phase 2: Migrate data types to individual sub modules **
- Proceed to other cast operation refactor based on datatypes

**Phase 3: Cleanup (1 PR)**
- Remove old `cast.rs`
- refactor imports across cast subsystem


### Testing

- Existing tests remain valid (or minimal changes to adhere to standard cast function signature)
- Each module gets its own `#[cfg(test)]` section (easier to add further tests) 
- No new tests needed initially - existing coverage carries over
- Future: add module-specific unit tests as needed

### Benchmarking

1. Existing benchmarking will be modulated to the corresponding data type file
2. All code changes are evaluated to make sure modularization doesn't compromise latency

### Alternatives Considered

1. **Keep single file but reorganize to improve readability** - Doesn't solve merge conflicts
2. **Trait objects with dynamic dispatch** - Virtual call overhead, might add additional performance overhead (a table of cast impl to route to the right module)
3. **Macros** - Less readable, harder to debug

### Open Questions

1. Should we handle Dictionary types in a separate module or within each category?
2. Naming convention can be better
3. Generate dynamic cast support documentation (similar to scala side) and probably handle unreachable paths during compile time ?


### Future enhancements
1.  Once this is live, I plan to follow the same approach to other comet native functions 


### Update after review from @comphead  , @andygrove  
1.  Definitely a +1 to modularize code and move cast ops to sub modules per cast. 
2. Trait usage might be unnecessary 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Cast sub system Modularization #3459

What is the problem the feature request solves?

RFC: Cast sub system Modularization

Summary

Proposal

Trait Definition

Module Structure

Two-Level Static Dispatch

Migration Strategy

Testing

Benchmarking

Alternatives Considered

Open Questions

Future enhancements

Update after review from @comphead , @andygrove

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Cast sub system Modularization #3459

Description

What is the problem the feature request solves?

RFC: Cast sub system Modularization

Summary

Proposal

Trait Definition

Module Structure

Two-Level Static Dispatch

Migration Strategy

Testing

Benchmarking

Alternatives Considered

Open Questions

Future enhancements

Update after review from @comphead , @andygrove

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions