Skip to content

Conversation

@adriangb
Copy link
Contributor

Summary

  • Sort files within each file group by min/max statistics during sort pushdown to better align with the requested ordering
  • When files are non-overlapping and within-file ordering is guaranteed (Parquet returns Exact), the SortExec is completely eliminated
  • When files overlap, best-effort statistics-based reordering is applied with SortExec retained for correctness (Inexact)
  • ParquetSource::try_pushdown_sort now returns Exact when the file's natural ordering already satisfies the request, enabling sort elimination for the same-direction case
  • Add SLT integration tests covering non-overlapping sort elimination, overlapping files, reverse scan with mixed file naming, and multi-group merging

Related to #19724

Test plan

  • cargo test -p datafusion-sqllogictest --test sqllogictests -- sort_pushdown passes
  • cargo test -p datafusion-sqllogictest --test sqllogictests -- parquet_sorted_statistics passes
  • New SLT tests verify EXPLAIN plans show correct optimizer behavior (sort elimination, SortExec retention, reverse_row_groups, file ordering)
  • New SLT tests verify query result correctness for all scenarios

🤖 Generated with Claude Code

Add statistics-based file reordering within file groups during sort
pushdown. When pushing sort requirements down to DataSourceExec, files
within each group are now sorted by their min/max statistics to better
align with the requested ordering. Non-overlapping files with guaranteed
within-file ordering enable complete sort elimination (Exact), while
overlapping files get best-effort reordering (Inexact) with SortExec
retained for correctness.

Key changes:
- FileScanConfig::rebuild_with_source sorts files by statistics
- FileScanConfig::sort_files_within_groups_by_statistics detects
  non-overlapping file ranges for sort elimination
- ParquetSource::try_pushdown_sort returns Exact when natural ordering
  matches the request, enabling sort elimination
- Add SLT integration tests (Tests 4-7) covering non-overlapping,
  overlapping, reverse scan, and multi-group scenarios

Related to apache#19724

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Feb 12, 2026
@adriangb adriangb marked this pull request as draft February 12, 2026 01:16
Replace old-style format arguments with inlined variable syntax
to fix clippy warnings (uninlined_format_args).

Co-Authored-By: Claude Haiku 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant