feat: CometNativeScan per-partition plan data, add DPP [iceberg] #3446

mbutrovich · 2026-02-08T17:21:44Z

Which issue does this PR close?

Rationale for this change

Dynamic Partition Pruning (DPP) is essential for optimizing star schema queries. Previously, native_datafusion scans would fall back to Spark when DPP was present. This PR enables full DPP support by deferring partition serialization until execution time, after DPP subqueries are resolved.

Additionally, this approach reduces serialization overhead when scanning large Parquet datasets, as partition metadata is no longer replicated across all Spark partitions.

What changes are included in this PR?

Architecture:

CometNativeScanExec now uses lazy serializedPartitionData to defer serialization to execution time
CometNativeScan.convert() creates a placeholder with only a scan_id at planning time
serializePartitions() resolves DPP subqueries and serializes filtered partitions at execution
Uses originalPlan.partitionFilters instead of partitionFilters because AQE's PlanDynamicPruningFilters converts subqueries to literals via makeCopy

SubqueryBroadcast Transformation:

Transform SubqueryBroadcastExec children to use CometBroadcastExchangeExec wrapped in CometColumnarToRowExec for row-based output
This enables broadcast exchange reuse via canonicalization - the same CometBroadcastExchangeExec used in the join can be reused by the DPP subquery filter
Uses transformUp instead of transformUpWithSubqueries to preserve ReusedSubqueryExec object identity for scalar subqueries

Configuration:

New flag spark.comet.scan.dpp.enabled (default: true) replaces spark.comet.dppFallback.enabled

Shims:

Added getDppFilteredFilePartitions() and getDppFilteredBucketedFilePartitions() to ShimCometScanExec for Spark 3.4/3.5/4.0
Added resolveSubqueryAdaptiveBroadcast() to ShimSubqueryBroadcast

Spark Diff Updates:

Updated DynamicPartitionPruningSuite.checkPartitionPruningPredicate to recognize CometColumnarToRowExec → CometBroadcastExchangeExec as a valid SubqueryBroadcast child structure

Other:

Removed custom equals/hashCode from CometNativeScanExec to prevent incorrect AQE exchange reuse between scans with different projections

How are these changes tested?

New CometExecSuite tests for DPP with native_datafusion scans, multiple partition columns, non-broadcast subqueries (SubqueryExec), and subquery reuse (ReusedSubqueryExec)
New test for DPP broadcast exchange reuse validation
New test for SubqueryBroadcast transformation preserving ReusedSubqueryExec references
New CometIcebergNativeSuite test for Iceberg DPP with non-broadcast joins
Updated Spark 3.5.8 diff to handle Comet's SubqueryBroadcast structure in DPP validation and CometNativeScan in DPP validation

…dd DPP.

# Conflicts: # spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

….sql.execution.SubqueryExec

…t tests.

…ng materialize"

…uning"

…Row -> CometNativeExec into SubqueryBroadcast -> CometColumnarToRow -> CometBroadcastExchange -> CometNativeExec - This allows CometBroadcastExchange to be reused by both the SubqueryBroadcast path and the join path - CometColumnarToRowExec is still needed because SubqueryBroadcastExec expects HashedRelation from doExecuteBroadcast()

…SubqueryExec. - New plans.

andygrove

Awesome work @mbutrovich! LGTM pending CI. Let's get this merged and keep iterating/testing.

mbutrovich added 2 commits February 8, 2026 12:20

Adopt PR apache#3349's per-partition scan logic to CometNativeScan. A…

3a9f6fa

…dd DPP.

Fix encryption.

9a3f747

mbutrovich added this to the 0.14.0 milestone Feb 8, 2026

mbutrovich added 27 commits February 8, 2026 13:01

Make format.

9af450d

Fix Spark 4 DPP API?

d9c4903

New plans.

f572220

make format

b32660e

Update CometScanRuleSuite.

ee0806e

Update the DPP config for Comet.

d444046

Merge branch 'main' into cometnativescan-dpp

20178dd

# Conflicts: # spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

Fix after upmerge.

59d7b41

Update plans.

259fc99

Add test to reproduce Unexpected subquery plan type: org.apache.spark…

a804920

….sql.execution.SubqueryExec

Add failing Iceberg test too.

4352c04

Handle SubqueryExec in addition to SubqueryBroadcastExec. Add relevan…

7f0004f

…t tests.

Handle SubqueryExec in addition to SubqueryBroadcastExec. Add relevan…

959d517

…t tests.

Handle SubqueryExec in addition to SubqueryBroadcastExec. Add relevan…

69b3559

…t tests.

Fix format.

b7eedd1

clean up tables in new tests

e9dab0e

update spark 3.5.8 diff

b9b5bb8

add bucketed DPP scan support

0e71307

update spark 4.0 shim

fa6fd1a

make format

ed50be2

fix shims

48ba80a

fix shims

cd6539c

fix canonicalization?

9b00cc2

Update diffs

704bd2d

Try again with canonicalization

916c37e

Try again with canonicalization

b2d1540

Update diffs.

d142fa3

mbutrovich added 2 commits February 10, 2026 10:27

Attempt to fix "SPARK-30291: AQE should catch the exceptions when doi…

288a248

…ng materialize"

fix "DPP with native_datafusion scan - join with dynamic partition pr…

10f9e42

…uning"

andygrove mentioned this pull request Feb 10, 2026

chore: Remove all remaining uses of legacy BatchReader from Comet [iceberg] #3468

Draft

mbutrovich added 3 commits February 10, 2026 12:41

Merge branch 'main' into cometnativescan-dpp

bcfa289

Merge branch 'main' into cometnativescan-dpp

daab410

Rename NativePlanDataInjector

47993d5

mbutrovich mentioned this pull request Feb 10, 2026

Add native_datafusion V2 DataSource API reader #3481

Open

andygrove mentioned this pull request Feb 10, 2026

fix: unignore input_file_name Spark SQL tests for native_datafusion #3458

Draft

mbutrovich marked this pull request as ready for review February 10, 2026 21:06

minor cleanup

e25798a

mbutrovich marked this pull request as draft February 11, 2026 11:55

mbutrovich added 5 commits February 11, 2026 07:39

- Changed transformUpWithSubqueries -> transformUp to preserve Reused…

d14c440

…SubqueryExec. - New plans.

Merge branch 'main' into cometnativescan-dpp

9c08019

Merge branch 'main' into cometnativescan-dpp

31d290f

Update DynamicPartitionPruningSuiteBase.

fc13972

andygrove approved these changes Feb 11, 2026

View reviewed changes

mbutrovich added 5 commits February 11, 2026 11:34

Merge branch 'main' into cometnativescan-dpp

5f7c65a

Update diff for ReusedExchangeExec wrapping CometBroadcastExchangeExec

47b80c9

Merge branch 'main' into cometnativescan-dpp

85afba4

try to do a deferred rule for aqe dpp

d885aa3

try to do a deferred rule for aqe dpp

7e55091

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CometNativeScan per-partition plan data, add DPP [iceberg] #3446

feat: CometNativeScan per-partition plan data, add DPP [iceberg] #3446

mbutrovich commented Feb 8, 2026 •

edited

Loading

Uh oh!

andygrove left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: CometNativeScan per-partition plan data, add DPP [iceberg] #3446

Are you sure you want to change the base?

feat: CometNativeScan per-partition plan data, add DPP [iceberg] #3446

Conversation

mbutrovich commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Feb 8, 2026 •

edited

Loading