Add protoc support for ArrowScanExecNode (#20280) #20284
+187
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
protocsupport forArrowScanExecNode#20280.Rationale for this change
Physical plans that read Arrow files (.arrow / IPC) could not be serialized or deserialized via the proto layer. PhysicalPlanNode already had scan nodes for Parquet, CSV, JSON, Avro, and in-memory sources, but not for Arrow, so a DataSourceExec using ArrowSource was not round-trippable. That blocked use cases like distributing plans that scan Arrow files (e.g. Ballista). This change adds Arrow scan to the proto layer so those plans can be serialized and deserialized like the other file formats.
What changes are included in this PR?
Proto: Added ArrowScanExecNode (with FileScanExecConf base_conf) and arrow_scan = 38 to the PhysicalPlanNode oneof in datafusion.proto.
Generated code: Updated prost.rs and pbjson.rs to include ArrowScanExecNode and the ArrowScan variant (manual edits; protoc was not run).
To-proto: In try_from_data_source_exec, when the data source is a FileScanConfig whose file source is ArrowSource, it is now serialized as ArrowScanExecNode.
From-proto: Implemented try_into_arrow_scan_physical_plan to deserialize ArrowScanExecNode into DataSourceExec with ArrowSource; missing base_conf returns an explicit error (no .unwrap()).
Test: Added roundtrip_arrow_scan in roundtrip_physical_plan.rs to assert Arrow scan plans round-trip correctly.
Are these changes tested?
Yes. A new test roundtrip_arrow_scan builds a physical plan that scans Arrow files, serializes it to bytes and deserializes it back, and asserts the round-tripped plan matches the original. The full cargo test -p datafusion-proto suite (150 tests: unit, integration, and doc tests) passes, including all existing roundtrip and serialization tests.
Are there any user-facing changes?
No. This only extends the existing physical-plan proto support to Arrow scan. Callers that already serialize/deserialize physical plans (e.g. for distributed execution) can now round-trip plans that read Arrow files in addition to Parquet, CSV, JSON, and Avro, with no API or behavioral changes for existing usage.