fix: enable Corr by kazuyukitanimura · Pull Request #3892 · apache/datafusion-comet

kazuyukitanimura · 2026-04-03T00:13:13Z

Which issue does this PR close?

Closes #2646

Rationale for this change

This is a fix for the behavior for #2646 (comment)

What changes are included in this PR?

When both inputs to Corr are NaN, return Null

How are these changes tested?

Added tests

comphead · 2026-04-10T23:53:57Z

what exactly the query that failed in spark? I checked DF corr and PGQL corr works the same.

> CREATE TABLE test_corr_nan(x double, y double, grp string);
0 row(s) fetched. 
Elapsed 0.025 seconds.

> INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN' as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN' as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null');
+-------+
| count |
+-------+
| 8     |
+-------+
1 row(s) fetched. 
Elapsed 0.016 seconds.

> SELECT grp, corr(x, y) FROM test_corr_nan GROUP BY grp ORDER BY grp;
+-----------+---------------------------------------+
| grp       | corr(test_corr_nan.x,test_corr_nan.y) |
+-----------+---------------------------------------+
| both_nan  | NaN                                   |
| both_null | NULL                                  |
| nan_null  | NULL                                  |
| nan_val   | NULL                                  |
| null_nan  | NULL                                  |
| null_val  | NULL                                  |
| val_nan   | NULL                                  |
| val_null  | NULL                                  |
+-----------+---------------------------------------+
8 row(s) fetched. 
Elapsed 0.036 seconds.

PGSQL

CREATE TABLE test_corr_nan(x float, y float, grp varchar);

INSERT INTO test_corr_nan VALUES (
cast('NaN' as float), cast('NaN' as float), 'both_nan'), (
cast('NaN' as float), 1.0, 'nan_val'), 
(1.0, cast('NaN' as float), 'val_nan'), 
(NULL, cast('NaN' as float), 'null_nan'), 
(cast('NaN' as float), NULL, 'nan_null'), 
(NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null');


SELECT grp, corr(x, y) FROM test_corr_nan GROUP BY grp ORDER BY grp;

    grp    | corr 
-----------+------
 both_nan  |  NaN
 both_null |     
 nan_null  |     
 nan_val   |     
 null_nan  |     
 null_val  |     
 val_nan   |     
 val_null  |     
(8 rows)

kazuyukitanimura · 2026-04-11T00:07:45Z

what exactly the query that failed in spark?

Thanks @comphead Just to double check, you haven't enabled Comet for Spark?
The original issue is
#2646 (comment)

comphead · 2026-04-11T00:22:23Z

I have some feeling Comet is not using DF based corr and uses its own implementation

impl AggregateUDFImpl for Correlation

More correct behavior is to delegate this to DF like for count

AggregateExprBuilder::new(count_udaf(), children)

comphead

Thanks @kazuyukitanimura I think we need to try to delegate corr to DF correlation::corr_udaf() and remove Comet old implementation

parthchandra

lgtm. Suggestions are non-blocking

parthchandra · 2026-04-11T00:36:47Z

spark/src/test/resources/sql-tests/expressions/aggregate/corr.sql

+CREATE TABLE test_corr_nan(x double, y double, grp string) USING parquet
+
+statement
+INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN' as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN' as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null')


Maybe add a group with mixed nan and valid rows ( eg [(NaN, NaN), (1.0, 2.0), (3.0, 4.0)] )

I also was going to suggest adding a group with multiple NaN rows (e.g. 2-3 rows of (NaN, NaN, 'multi_nan')) to make sure the wrapping approach works when n > 1

parthchandra · 2026-04-11T00:39:18Z

spark/src/main/scala/org/apache/comet/serde/aggregates.scala


 object CometCorr extends CometAggregateExpressionSerde[Corr] {
-
-  override def getSupportLevel(expr: Corr): SupportLevel =


Claude flagged some edge cases we can document -

▎ 1. Legacy mode: When spark.sql.legacy.statisticalAggregate=true, nullOnDivideByZero is false and Spark returns NaN for the n=1 case. With this workaround, Comet would return null instead (because the NaN row gets skipped → n=0). Should we add a getSupportLevel guard that returns Incompatible when corr.nullOnDivideByZero is false? Or at least document this? ▎ 2. Mixed groups: For a group containing (NaN, NaN) alongside valid pairs like (1.0, 2.0), Spark returns NaN (NaN contaminates the accumulator), while this workaround would skip the NaN row and compute a valid correlation over the remaining rows. Is that a known limitation we're OK with?

Same.

Worth double-checking: the original incompatibility note said "returns null instead of NaN in some edge cases." That also describes the behavior in correlation.rs:evaluate() when stddev is zero (constant values produce stddev=0, Spark returns NaN from 0/0, Comet returns null from the null_on_divide_by_zero guard). Is that case also resolved, or should Incompatible remain for that scenario?

Ended up fixing properly for the legacy mode as well. Added tests too

kazuyukitanimura · 2026-04-14T15:30:52Z

Thanks @comphead @parthchandra @andygrove
I think I properly fixed this rather than workingaround

comphead · 2026-04-14T17:42:30Z

Thanks @comphead @parthchandra @andygrove I think I properly fixed this rather than workingaround

I would still try to delegate corr to DF instead of Comet proprietary code, it would make our codebase more lightweight

kazuyukitanimura · 2026-04-14T19:39:54Z

Thanks @comphead @parthchandra @andygrove I think I properly fixed this rather than workingaround

I would still try to delegate corr to DF instead of Comet proprietary code, it would make our codebase more lightweight

There is some behavior differences, so in order to use DF correlation, we need to create Spark expr in DF. Perhaps that will be next time

kazuyukitanimura added 2 commits April 3, 2026 11:01

chor: enable Corr

be084dc

chor: enable Corr

bc0d4e4

kazuyukitanimura force-pushed the fix-2646 branch from 44a5522 to bc0d4e4 Compare April 3, 2026 18:02

kazuyukitanimura marked this pull request as ready for review April 3, 2026 18:02

kazuyukitanimura added 2 commits April 6, 2026 17:49

chor: enable Corr

775eedd

Merge remote-tracking branch 'upstream/main' into fix-2646

4d5ef3a

parthchandra changed the title ~~chor: enable Corr~~ chore: enable Corr Apr 11, 2026

comphead reviewed Apr 11, 2026

View reviewed changes

parthchandra approved these changes Apr 11, 2026

View reviewed changes

kazuyukitanimura added 7 commits April 13, 2026 15:25

Merge remote-tracking branch 'upstream/main' into fix-2646

cd87f12

address review comments

fb2aa6c

address review comments

076aa02

address review comments

0f02ee0

address review comments

7a5e23e

address review comments

acfbd80

address review comments

cd7f935

kazuyukitanimura changed the title ~~chore: enable Corr~~ fix: enable Corr Apr 14, 2026

Merge remote-tracking branch 'upstream/main' into fix-2646

543f6bb


		object CometCorr extends CometAggregateExpressionSerde[Corr] {

		override def getSupportLevel(expr: Corr): SupportLevel =

Conversation

kazuyukitanimura commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead commented Apr 10, 2026

Uh oh!

kazuyukitanimura commented Apr 11, 2026

Uh oh!

comphead commented Apr 11, 2026

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

parthchandra Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

parthchandra Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura commented Apr 14, 2026

Uh oh!

comphead commented Apr 14, 2026

Uh oh!

kazuyukitanimura commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kazuyukitanimura commented Apr 3, 2026 •

edited

Loading