Add statistics integration tests #20292

gabotechs · 2026-02-11T15:44:42Z

Which issue does this PR close?

TODO: create an epic for stats estimation improvements

Closes #.

Rationale for this change

The statistics propagation system is lacking some overall accuracy. There should be many low hanging fruits regarding statistic propagation that if addressed could increase the quality of the physical plans as a results of better estimations.

The idea behind this PR is to add some tests for qualifying further improvements in the statistic propagation system, so that people can incrementally improve it over time.

What changes are included in this PR?

Adds some new integration tests that compare the results estimated through statistics vs what actually happened within the query for the TPC-DS dataset (just that one for now):

It calls ExecutePlan::partition_statistics() on all the nodes, collecting their estimations.
It executes the plan against actual parquet files, collecting the relevant metrics.
It computes an accuracy factor based on what was estimated VS what actually happen
It commits the avg accuracy into insta snapshots in the tests, so that further contributions can show some improvements in the overall accuracy number.

It also adds a Debug implementation that allows visualizing the comparison in a fine-grained way. This should allow contributors to quickly identify places where stats are being innacurate:

SortPreservingMergeExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
  SortExec(TopK): output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
    HashJoinExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
      HashJoinExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
        CoalescePartitionsExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
          HashJoinExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
            CoalescePartitionsExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Inexact(36) vs 0 (0%)
              ProjectionExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Inexact(36) vs 0 (0%)
                FilterExec: output_rows=Inexact(3) vs 0 (0%) output_bytes=Absent vs 0 (0%)
                  RepartitionExec: output_rows=Inexact(12) vs 0 (0%) output_bytes=Absent vs 0 (0%)
                    DataSourceExec: output_rows=Inexact(12) vs 0 (0%) output_bytes=Absent vs 0 (0%)
            ProjectionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(256) vs 128 (50%)
              AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 128 (0%)
                RepartitionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                  AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 128 (0%)
                    RepartitionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                      HashJoinExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                        CoalescePartitionsExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(96) vs 32832 (1%)
                          ProjectionExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(96) vs 32832 (1%)
                            FilterExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(59) vs 32768 (1%)
                              RepartitionExec: output_rows=Inexact(1000) vs 63 (7%) output_bytes=Inexact(8000) vs 65536 (13%)
                                DataSourceExec: output_rows=Inexact(1000) vs 63 (7%) output_bytes=Inexact(8000) vs 504 (7%)
                        DataSourceExec: output_rows=Inexact(1000) vs 1000 (100%) output_bytes=Inexact(40000) vs 63360 (64%)
        DataSourceExec: output_rows=Inexact(1000) vs 1000 (100%) output_bytes=Absent vs 48160 (0%)
      ProjectionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(192) vs 1088 (18%)
        AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(192) vs 1088 (18%)
          RepartitionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(192) vs 262144 (1%)
            AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(192) vs 1120 (18%)
              ProjectionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Inexact(192) vs 96 (50%)
                AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 128 (0%)
                  RepartitionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                    AggregateExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 128 (0%)
                      RepartitionExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                        HashJoinExec: output_rows=Inexact(8) vs 1 (13%) output_bytes=Absent vs 262144 (0%)
                          CoalescePartitionsExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(96) vs 32832 (1%)
                            ProjectionExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(96) vs 32832 (1%)
                              FilterExec: output_rows=Inexact(8) vs 4 (50%) output_bytes=Inexact(59) vs 32768 (1%)
                                RepartitionExec: output_rows=Inexact(1000) vs 63 (7%) output_bytes=Inexact(8000) vs 65536 (13%)
                                  DataSourceExec: output_rows=Inexact(1000) vs 63 (7%) output_bytes=Inexact(8000) vs 504 (7%)
                          DataSourceExec: output_rows=Inexact(1000) vs 1000 (100%) output_bytes=Inexact(40000) vs 63360 (64%)

Are these changes tested?

This changes are exclusively new integration tests

Are there any user-facing changes?

No

gabotechs · 2026-02-11T15:49:39Z

datafusion/core/tests/data/tpcds_catalog_returns_small.parquet

I'm not sure if this sizes are acceptable for this project. The new parquet files are the original ones for the TPC-DS benchmark sampled to 1000 rows max.

The total of all the files is 1.8Mb, usually I'd consider this acceptable to be committed directly to a git repository, but let me know if this falls within this project's policy.

gabotechs · 2026-02-11T15:50:43Z

datafusion/core/tests/statistics/tpcds.rs

+        assert_snapshot!(display, @r"
+        row_estimation_accuracy=20%
+        byte_estimation_accuracy=8%
+        ");
+        Ok(())


The idea is that further contributions should see this numbers getting increased.

gabotechs marked this pull request as draft February 11, 2026 15:44

github-actions bot added the core Core DataFusion crate label Feb 11, 2026

Add statistics integration tests

2ee489c

gabotechs force-pushed the add-statistics-tests branch from 4cdcdba to 2ee489c Compare February 11, 2026 15:47

gabotechs commented Feb 11, 2026

View reviewed changes

gabotechs mentioned this pull request Feb 11, 2026

Implement partition_statistics API for more operators #15873

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add statistics integration tests #20292

Add statistics integration tests #20292

gabotechs commented Feb 11, 2026 •

edited

Loading

Uh oh!

gabotechs Feb 11, 2026

Uh oh!

gabotechs Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add statistics integration tests #20292

Are you sure you want to change the base?

Add statistics integration tests #20292

Conversation

gabotechs commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

gabotechs Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gabotechs Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gabotechs commented Feb 11, 2026 •

edited

Loading