Skip to content

Conversation

@gabotechs
Copy link
Contributor

@gabotechs gabotechs commented Feb 11, 2026

Which issue does this PR close?

TODO: create an epic for stats estimation improvements

  • Closes #.

Rationale for this change

The statistics propagation system is lacking some overall accuracy. There should be many low hanging fruits regarding statistic propagation that if addressed could increase the quality of the physical plans as a results of better estimations.

The idea behind this PR is to add some tests for qualifying further improvements in the statistic propagation system, so that people can incrementally improve it over time.

What changes are included in this PR?

Adds some new integration tests that compare the results estimated through statistics vs what actually happened within the query for the TPC-DS dataset (just that one for now):

  1. It calls ExecutePlan::partition_statistics() on all the nodes, collecting their estimations.
  2. It executes the plan against actual parquet files, collecting the relevant metrics.
  3. It computes an accuracy factor based on what was estimated VS what actually happen
  4. It commits the avg accuracy into insta snapshots in the tests, so that further contributions can show some improvements in the overall accuracy number.

It also adds a Debug implementation that allows visualizing the comparison in a fine-grained way. This should allow contributors to quickly identify places where stats are being innacurate:

SortPreservingMergeExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
  SortExec(TopK): output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
    HashJoinExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
      HashJoinExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
        CoalescePartitionsExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
          HashJoinExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
            CoalescePartitionsExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Inexact(36) vs 0 (0%)
              ProjectionExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Inexact(36) vs 0 (0%)
                FilterExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
                  RepartitionExec: output_rows=Inexact(12) vs 0 (0%) output_bytes=Absent vs 0 (0%)
                    DataSourceExec: output_rows=Inexact(12) vs 0 (0%) output_bytes=Absent vs 0 (0%)
            ProjectionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(256) vs 128 (50%)
              AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 128 (0%)
                RepartitionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                  AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 128 (0%)
                    RepartitionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                      HashJoinExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                        CoalescePartitionsExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(96) vs 32832 (1%)
                          ProjectionExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(96) vs 32832 (1%)
                            FilterExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(59) vs 32768 (1%)
                              RepartitionExec: output_rows=Inexact(1000) vs 63 (7%) output_bytes=Inexact(8000) vs 65536 (13%)
                                DataSourceExec: output_rows=Inexact(1000) vs 63 (7%) output_bytes=Inexact(8000) vs 504 (7%)
                        DataSourceExec: output_rows=Inexact(1000) vs 1000 (100%) output_bytes=Inexact(40000) vs 63360 (64%)
        DataSourceExec: output_rows=Inexact(1000) vs 1000 (100%) output_bytes=Absent vs 48160 (0%)
      ProjectionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(192) vs 1088 (18%)
        AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(192) vs 1088 (18%)
          RepartitionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(192) vs 262144 (1%)
            AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(192) vs 1120 (18%)
              ProjectionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(192) vs 96 (50%)
                AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 128 (0%)
                  RepartitionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                    AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 128 (0%)
                      RepartitionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                        HashJoinExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                          CoalescePartitionsExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(96) vs 32832 (1%)
                            ProjectionExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(96) vs 32832 (1%)
                              FilterExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(59) vs 32768 (1%)
                                RepartitionExec: output_rows=Inexact(1000) vs 63 (7%) output_bytes=Inexact(8000) vs 65536 (13%)
                                  DataSourceExec: output_rows=Inexact(1000) vs 63 (7%) output_bytes=Inexact(8000) vs 504 (7%)
                          DataSourceExec: output_rows=Inexact(1000) vs 1000 (100%) output_bytes=Inexact(40000) vs 63360 (64%)

Are these changes tested?

This changes are exclusively new integration tests

Are there any user-facing changes?

No

@gabotechs gabotechs marked this pull request as draft February 11, 2026 15:44
@github-actions github-actions bot added the core Core DataFusion crate label Feb 11, 2026
@gabotechs gabotechs force-pushed the add-statistics-tests branch from 4cdcdba to 2ee489c Compare February 11, 2026 15:47
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this sizes are acceptable for this project. The new parquet files are the original ones for the TPC-DS benchmark sampled to 1000 rows max.

The total of all the files is 1.8Mb, usually I'd consider this acceptable to be committed directly to a git repository, but let me know if this falls within this project's policy.

Comment on lines +34 to +38
assert_snapshot!(display, @r"
row_estimation_accuracy=20%
byte_estimation_accuracy=8%
");
Ok(())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that further contributions should see this numbers getting increased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant