Skip to content

fix: enable Corr#3892

Open
kazuyukitanimura wants to merge 12 commits intoapache:mainfrom
kazuyukitanimura:fix-2646
Open

fix: enable Corr#3892
kazuyukitanimura wants to merge 12 commits intoapache:mainfrom
kazuyukitanimura:fix-2646

Conversation

@kazuyukitanimura
Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura commented Apr 3, 2026

Which issue does this PR close?

Closes #2646

Rationale for this change

This is a fix for the behavior for #2646 (comment)

What changes are included in this PR?

When both inputs to Corr are NaN, return Null

How are these changes tested?

Added tests

@kazuyukitanimura kazuyukitanimura marked this pull request as ready for review April 3, 2026 18:02
@comphead
Copy link
Copy Markdown
Contributor

what exactly the query that failed in spark? I checked DF corr and PGQL corr works the same.

> CREATE TABLE test_corr_nan(x double, y double, grp string);
0 row(s) fetched. 
Elapsed 0.025 seconds.

> INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN' as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN' as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null');
+-------+
| count |
+-------+
| 8     |
+-------+
1 row(s) fetched. 
Elapsed 0.016 seconds.

> SELECT grp, corr(x, y) FROM test_corr_nan GROUP BY grp ORDER BY grp;
+-----------+---------------------------------------+
| grp       | corr(test_corr_nan.x,test_corr_nan.y) |
+-----------+---------------------------------------+
| both_nan  | NaN                                   |
| both_null | NULL                                  |
| nan_null  | NULL                                  |
| nan_val   | NULL                                  |
| null_nan  | NULL                                  |
| null_val  | NULL                                  |
| val_nan   | NULL                                  |
| val_null  | NULL                                  |
+-----------+---------------------------------------+
8 row(s) fetched. 
Elapsed 0.036 seconds.

PGSQL

CREATE TABLE test_corr_nan(x float, y float, grp varchar);

INSERT INTO test_corr_nan VALUES (
cast('NaN' as float), cast('NaN' as float), 'both_nan'), (
cast('NaN' as float), 1.0, 'nan_val'), 
(1.0, cast('NaN' as float), 'val_nan'), 
(NULL, cast('NaN' as float), 'null_nan'), 
(cast('NaN' as float), NULL, 'nan_null'), 
(NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null');


SELECT grp, corr(x, y) FROM test_corr_nan GROUP BY grp ORDER BY grp;

    grp    | corr 
-----------+------
 both_nan  |  NaN
 both_null |     
 nan_null  |     
 nan_val   |     
 null_nan  |     
 null_val  |     
 val_nan   |     
 val_null  |     
(8 rows)

@kazuyukitanimura
Copy link
Copy Markdown
Contributor Author

what exactly the query that failed in spark?

Thanks @comphead Just to double check, you haven't enabled Comet for Spark?
The original issue is
#2646 (comment)

@parthchandra parthchandra changed the title chor: enable Corr chore: enable Corr Apr 11, 2026
@comphead
Copy link
Copy Markdown
Contributor

I have some feeling Comet is not using DF based corr and uses its own implementation

impl AggregateUDFImpl for Correlation 

More correct behavior is to delegate this to DF like for count

AggregateExprBuilder::new(count_udaf(), children)

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kazuyukitanimura I think we need to try to delegate corr to DF correlation::corr_udaf() and remove Comet old implementation

Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Suggestions are non-blocking

CREATE TABLE test_corr_nan(x double, y double, grp string) USING parquet

statement
INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN' as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN' as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a group with mixed nan and valid rows ( eg [(NaN, NaN), (1.0, 2.0), (3.0, 4.0)] )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also was going to suggest adding a group with multiple NaN rows (e.g. 2-3 rows of (NaN, NaN, 'multi_nan')) to make sure the wrapping approach works when n > 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


object CometCorr extends CometAggregateExpressionSerde[Corr] {

override def getSupportLevel(expr: Corr): SupportLevel =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude flagged some edge cases we can document -

 ▎ 1. Legacy mode: When spark.sql.legacy.statisticalAggregate=true, nullOnDivideByZero is false and Spark returns NaN for the n=1 case. With this workaround, Comet would return null instead (because the NaN row gets skipped → n=0). Should we add a getSupportLevel guard that returns Incompatible when
  corr.nullOnDivideByZero is false? Or at least document this?
 ▎ 2. Mixed groups: For a group containing (NaN, NaN) alongside valid pairs like (1.0, 2.0), Spark returns NaN (NaN contaminates the accumulator), while this workaround would skip the NaN row and compute a valid correlation over the remaining rows. Is that a known limitation we're OK with?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

Worth double-checking: the original incompatibility note said "returns null instead of NaN in some edge cases." That also describes the behavior in correlation.rs:evaluate() when stddev is zero (constant values produce stddev=0, Spark returns NaN from 0/0, Comet returns null from the null_on_divide_by_zero guard). Is that case also resolved, or should Incompatible remain for that scenario?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up fixing properly for the legacy mode as well. Added tests too

@kazuyukitanimura kazuyukitanimura changed the title chore: enable Corr fix: enable Corr Apr 14, 2026
@kazuyukitanimura
Copy link
Copy Markdown
Contributor Author

Thanks @comphead @parthchandra @andygrove
I think I properly fixed this rather than workingaround

@comphead
Copy link
Copy Markdown
Contributor

Thanks @comphead @parthchandra @andygrove I think I properly fixed this rather than workingaround

I would still try to delegate corr to DF instead of Comet proprietary code, it would make our codebase more lightweight

@kazuyukitanimura
Copy link
Copy Markdown
Contributor Author

Thanks @comphead @parthchandra @andygrove I think I properly fixed this rather than workingaround

I would still try to delegate corr to DF instead of Comet proprietary code, it would make our codebase more lightweight

There is some behavior differences, so in order to use DF correlation, we need to create Spark expr in DF. Perhaps that will be next time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fuzz test failure: corr null vs Nan

4 participants