Skip to content

[SPARK-55854][SQL] Tag pass-through duplicate attributes in Expand output to prevent AMBIGUOUS_REFERENCE#54641

Open
mihailotim-db wants to merge 1 commit intoapache:masterfrom
mihailotim-db:mihailo-timotic_data/expand_qualify
Open

[SPARK-55854][SQL] Tag pass-through duplicate attributes in Expand output to prevent AMBIGUOUS_REFERENCE#54641
mihailotim-db wants to merge 1 commit intoapache:masterfrom
mihailotim-db:mihailo-timotic_data/expand_qualify

Conversation

@mihailotim-db
Copy link
Contributor

What changes were proposed in this pull request?

When Expand is created for ROLLUP/CUBE/GROUPING SETS, its output contains duplicate-named attributes: the original pass-through child attribute (e.g., a#0) and a new grouping instance created via newInstance() (e.g., a#5). Both share the same name, which causes AMBIGUOUS_REFERENCE errors when any operator performs name-based resolution against the Expand output.

This PR tags pass-through child attributes with __is_duplicate metadata in Expand.apply(), so that AttributeSeq.getCandidatesForResolution deprioritizes them when multiple candidates match by name. This is the same mechanism already used by DeduplicateUnionChildOutput for Union operators.

Only attributes whose ExprId matches a simple Attribute child of a groupByAlias are tagged — complex grouping expressions (e.g., c1 + 1) produce aliases with different names than any child column, so no name conflict arises. ExprId-based resolution (used for already-resolved expressions like aggregate functions) is unaffected.

The fix is guarded behind a new internal config spark.sql.analyzer.expandTagPassthroughDuplicates (default true).

Why are the changes needed?

The Expand operator for ROLLUP/CUBE/GROUPING SETS produces an output like [a#0, b#1, c#2, a#5, gid#3] where a#0 is the pass-through child attribute and a#5 is the new grouping attribute. Both have the name "a". When any operator above the Expand resolves the reference "a" by name (e.g., a Filter or Project inserted by a custom analysis rule, or a correlated subquery whose outer reference resolves against the Expand's output), getCandidatesForResolution returns two candidates, and resolve() throws an AMBIGUOUS_REFERENCE error.

Does this PR introduce any user-facing change?

No. The fix prevents a latent AMBIGUOUS_REFERENCE error in name-based resolution against Expand output. Standard SQL queries are not affected because the Aggregate above the Expand already shields upper operators from seeing the duplicate names. The fix is defensive and makes the Expand output safe for any future feature or custom rule that may resolve names against it.

How was this patch tested?

7 new unit tests in ResolveGroupingAnalyticsSuite:

  • Tagging behavior (flag enabled, default):

    • Tags pass-through attribute for simple single-column grouping (ROLLUP(a))
    • Does not tag for complex grouping expressions (ROLLUP(a + 1))
    • Tags multiple pass-through attributes for multi-column grouping (ROLLUP(a, b))
    • Preserves ExprId and name on tagged attributes
    • Demonstrates that resolve("a") succeeds with tagging and throws AMBIGUOUS_REFERENCE without tagging
  • Flag disabled behavior:

    • No tagging for single-column grouping; resolve("a") throws AMBIGUOUS_REFERENCE
    • No tagging for multi-column grouping; resolve("a") and resolve("b") both throw AMBIGUOUS_REFERENCE

All 9 pre-existing tests in ResolveGroupingAnalyticsSuite continue to pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude claude-4.6-opus-high-thinking (Cursor)

@mihailotim-db mihailotim-db changed the title fix [SPARK-55854][SQL] Tag pass-through duplicate attributes in Expand output to prevent AMBIGUOUS_REFERENCE Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant