[SPARK-55928][SQL] New linter for config effectiveness in views, UDFs and procedures#54653
Open
mihailotim-db wants to merge 1 commit intoapache:masterfrom
Open
Conversation
5dd2812 to
fb4a948
Compare
12105f0 to
fca91ec
Compare
fca91ec to
2581497
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR introduces a
ConfigBindingPolicyframework that enforces all newly added Spark configurations to explicitly declare how their values are bound when used within SQL views, UDFs, or procedures.Background: Conf + views mechanics
There are 3 ways Spark configs can interact with views:
The confusion arises for configurations that are not captured on view/UDF/procedure creation, but still need to be used when querying them. The common assumption is that if a conf is not preserved upon creation, its value inside the view/UDF/procedure will be whatever the value is in the currently active session. This is not true.
If a conf is not preserved on creation, its value when querying the view/UDF/procedure will be:
RETAINED_ANALYSIS_FLAGS).This allowlist is extremely non-obvious and easy to forget about. This has caused regressions in the past where new configs affecting query semantics were not added to the allowlist, causing views and UDFs to silently use Spark defaults instead of session values.
Changes
ConfigBindingPolicyenum (common/utils) with three values:SESSION: The config value propagates from the active session to views/UDFs/procedures.PERSISTED: The config uses the value saved at view/UDF/procedure creation time, or the Spark default if none was saved.NOT_APPLICABLE: The config does not interact with view/UDF/procedure resolution. If accessed at runtime, it behaves the same asSESSION.ConfigBuilder.withBindingPolicy(): A new builder method to declare the binding policy when defining a config.ConfigEntry.bindingPolicy: A new field on all config entries to store the declared policy.Analyzer: Replaces the hardcodedRETAINED_ANALYSIS_FLAGSlist with a dynamic lookup that retains all configs withSESSIONorNOT_APPLICABLEbinding policy when resolving views and SQL UDFs.withBindingPolicy(SESSION)to configs that were previously in the hardcodedRETAINED_ANALYSIS_FLAGSlist:PLAN_CHANGE_LOG_LEVEL,EXPRESSION_TREE_CHANGE_LOG_LEVEL,VIEW_SCHEMA_EVOLUTION_PRESERVE_USER_COMMENTS(inSQLConf)CONVERT_METASTORE_PARQUET,CONVERT_METASTORE_ORC,CONVERT_INSERTING_PARTITIONED_TABLE,CONVERT_METASTORE_CTAS(inHiveUtils)SparkConfigBindingPolicySuite): A new test suite that fails if any newly added config does not declare abindingPolicy, unless it is in an explicit exceptions allowlist. Existing configs without a binding policy have been grandfathered into the allowlist.When to use which policy
SESSIONis the most common policy. Use it for feature flags or bugfix kill-switches where uniform behavior across the entire query is desired. Examples: enabling single-pass analyzer (spark.sql.analyzer.singlePassResolver.enabledTentatively), plan change logging (spark.sql.planChangeLog.level), bugfixes (spark.sql.analyzer.preferColumnOverLcaInArrayIndex). Think about it this way: if you make a behavior change and roll it out on by default, then discover a bug and need to revert it -- unless the policy isSESSION, existing views will still have the old behavior baked in.PERSISTEDshould be used for configs that carry view semantic meaning that should be consistent regardless of session changes. A good example is ANSI mode -- views created with ANSI off should always have ANSI off, regardless of the session value. Examples: ANSI mode, session timezone.NOT_APPLICABLEshould be used for configs that don't interact with view/UDF/procedure resolution at all. Only choose this if you are confident the config doesn't interact with view/UDF/procedure analysis. Examples: UI confs, server confs.Why are all confs affected by the linter?
Even within analysis, Spark can trigger a Spark job recursively which would potentially reference any conf (for example, this is needed for schema inference). The linter is active for all newly added confs regardless of whether they directly interact with view analysis.
Why not fix all existing confs?
Currently, there are over a thousand distinct configs in Spark. Fixing every single conf would introduce behavior changes. The linter only enforces the policy on new additions. Existing confs have been added to an exceptions allowlist. The long-term goal is to have all configs declare a binding policy and remove the exceptions allowlist entirely.
Why are the changes needed?
The
Analyzer.RETAINED_ANALYSIS_FLAGSlist was a manually maintained hardcoded allowlist of configs that should propagate from the active session when resolving views and SQL UDFs. This approach is error-prone: developers adding new configs that affect query semantics could easily forget to add them to this list, causing subtle bugs where views and UDFs silently use Spark defaults instead of session values.By requiring an explicit
ConfigBindingPolicydeclaration on every new config, developers are forced to think about how their config interacts with views, UDFs, and procedures at definition time. The enforcement test catches any new config that lacks this declaration, preventing regressions.Does this PR introduce any user-facing change?
No.
How was this patch tested?
New test suite
SparkConfigBindingPolicySuitewith three tests:Test adding bindingPolicy to config: Verifies a config withSESSIONpolicy has the correct binding policy set.Config enforcement for bindingPolicy: Ensures all registered configs either have abindingPolicydeclared or are in the exceptions allowlist, and that configs withbindingPolicyset are not redundantly in the allowlist.configs-without-binding-policy-exceptions file should be sorted alphabetically: Validates the exceptions file ordering.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor with Claude claude-4.6-opus-high-thinking