Skip to content

Conversation

@JoshElkind
Copy link

Which issue does this PR close?

Rationale for this change

Physical plans that read Arrow files (.arrow / IPC) could not be serialized or deserialized via the proto layer. PhysicalPlanNode already had scan nodes for Parquet, CSV, JSON, Avro, and in-memory sources, but not for Arrow, so a DataSourceExec using ArrowSource was not round-trippable. That blocked use cases like distributing plans that scan Arrow files (e.g. Ballista). This change adds Arrow scan to the proto layer so those plans can be serialized and deserialized like the other file formats.

What changes are included in this PR?

Proto: Added ArrowScanExecNode (with FileScanExecConf base_conf) and arrow_scan = 38 to the PhysicalPlanNode oneof in datafusion.proto.

Generated code: Updated prost.rs and pbjson.rs to include ArrowScanExecNode and the ArrowScan variant (manual edits; protoc was not run).

To-proto: In try_from_data_source_exec, when the data source is a FileScanConfig whose file source is ArrowSource, it is now serialized as ArrowScanExecNode.

From-proto: Implemented try_into_arrow_scan_physical_plan to deserialize ArrowScanExecNode into DataSourceExec with ArrowSource; missing base_conf returns an explicit error (no .unwrap()).

Test: Added roundtrip_arrow_scan in roundtrip_physical_plan.rs to assert Arrow scan plans round-trip correctly.

Are these changes tested?

Yes. A new test roundtrip_arrow_scan builds a physical plan that scans Arrow files, serializes it to bytes and deserializes it back, and asserts the round-tripped plan matches the original. The full cargo test -p datafusion-proto suite (150 tests: unit, integration, and doc tests) passes, including all existing roundtrip and serialization tests.

Are there any user-facing changes?

No. This only extends the existing physical-plan proto support to Arrow scan. Callers that already serialize/deserialize physical plans (e.g. for distributed execution) can now round-trip plans that read Arrow files in addition to Parquet, CSV, JSON, and Avro, with no API or behavioral changes for existing usage.

- Add ArrowScanExecNode message and arrow_scan to PhysicalPlanNode oneof in proto
- Regenerate prost/pbjson (manual: ArrowScanExecNode + ArrowScan variant)
- To-proto: serialize DataSourceExec with ArrowSource as ArrowScanExecNode
- From-proto: deserialize ArrowScanExecNode to DataSourceExec with ArrowSource
- Add roundtrip_arrow_scan test in roundtrip_physical_plan

Closes apache#20280
@github-actions github-actions bot added the proto Related to proto crate label Feb 11, 2026
Copy link
Contributor

@milenkovicm milenkovicm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @JoshElkind i've just glanced it it looks ok on the first pass,
will try to review it asap

Copy link
Contributor

@milenkovicm milenkovicm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good,
thanks @JoshElkind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add protoc support for ArrowScanExecNode

2 participants