Skip to content

[SPARK-55928][SQL] New linter for config effectiveness in views, UDFs and procedures#54653

Open
mihailotim-db wants to merge 1 commit intoapache:masterfrom
mihailotim-db:mihailo-timotic_data/config_binding_policy
Open

[SPARK-55928][SQL] New linter for config effectiveness in views, UDFs and procedures#54653
mihailotim-db wants to merge 1 commit intoapache:masterfrom
mihailotim-db:mihailo-timotic_data/config_binding_policy

Conversation

@mihailotim-db
Copy link
Contributor

@mihailotim-db mihailotim-db commented Mar 6, 2026

What changes were proposed in this pull request?

This PR introduces a ConfigBindingPolicy framework that enforces all newly added Spark configurations to explicitly declare how their values are bound when used within SQL views, UDFs, or procedures.

Background: Conf + views mechanics

There are 3 ways Spark configs can interact with views:

  1. The conf value is stored with a view/UDF/procedure on creation and is applied on read. Session value is deprioritized. Example: ANSI conf, timezone.
  2. The conf is not stored with a view, but its value is propagated through a view from the active session. Example: kill-switches, feature flags.
  3. The conf is neither stored with a view, nor propagated through a view. This is the historical default in Spark.
    The confusion arises for configurations that are not captured on view/UDF/procedure creation, but still need to be used when querying them. The common assumption is that if a conf is not preserved upon creation, its value inside the view/UDF/procedure will be whatever the value is in the currently active session. This is not true.
    If a conf is not preserved on creation, its value when querying the view/UDF/procedure will be:
  • The value from the currently active session, only if the conf is in a hardcoded allowlist (RETAINED_ANALYSIS_FLAGS).
  • The Spark default otherwise.
    This allowlist is extremely non-obvious and easy to forget about. This has caused regressions in the past where new configs affecting query semantics were not added to the allowlist, causing views and UDFs to silently use Spark defaults instead of session values.

Changes

  1. New ConfigBindingPolicy enum (common/utils) with three values:
    • SESSION: The config value propagates from the active session to views/UDFs/procedures.
    • PERSISTED: The config uses the value saved at view/UDF/procedure creation time, or the Spark default if none was saved.
    • NOT_APPLICABLE: The config does not interact with view/UDF/procedure resolution. If accessed at runtime, it behaves the same as SESSION.
  2. ConfigBuilder.withBindingPolicy(): A new builder method to declare the binding policy when defining a config.
  3. ConfigEntry.bindingPolicy: A new field on all config entries to store the declared policy.
  4. Dynamic retained config resolution in Analyzer: Replaces the hardcoded RETAINED_ANALYSIS_FLAGS list with a dynamic lookup that retains all configs with SESSION or NOT_APPLICABLE binding policy when resolving views and SQL UDFs.
  5. Binding policy annotations: Added withBindingPolicy(SESSION) to configs that were previously in the hardcoded RETAINED_ANALYSIS_FLAGS list:
    • PLAN_CHANGE_LOG_LEVEL, EXPRESSION_TREE_CHANGE_LOG_LEVEL, VIEW_SCHEMA_EVOLUTION_PRESERVE_USER_COMMENTS (in SQLConf)
    • CONVERT_METASTORE_PARQUET, CONVERT_METASTORE_ORC, CONVERT_INSERTING_PARTITIONED_TABLE, CONVERT_METASTORE_CTAS (in HiveUtils)
  6. Enforcement test (SparkConfigBindingPolicySuite): A new test suite that fails if any newly added config does not declare a bindingPolicy, unless it is in an explicit exceptions allowlist. Existing configs without a binding policy have been grandfathered into the allowlist.

When to use which policy

  • SESSION is the most common policy. Use it for feature flags or bugfix kill-switches where uniform behavior across the entire query is desired. Examples: enabling single-pass analyzer (spark.sql.analyzer.singlePassResolver.enabledTentatively), plan change logging (spark.sql.planChangeLog.level), bugfixes (spark.sql.analyzer.preferColumnOverLcaInArrayIndex). Think about it this way: if you make a behavior change and roll it out on by default, then discover a bug and need to revert it -- unless the policy is SESSION, existing views will still have the old behavior baked in.
  • PERSISTED should be used for configs that carry view semantic meaning that should be consistent regardless of session changes. A good example is ANSI mode -- views created with ANSI off should always have ANSI off, regardless of the session value. Examples: ANSI mode, session timezone.
  • NOT_APPLICABLE should be used for configs that don't interact with view/UDF/procedure resolution at all. Only choose this if you are confident the config doesn't interact with view/UDF/procedure analysis. Examples: UI confs, server confs.

Why are all confs affected by the linter?

Even within analysis, Spark can trigger a Spark job recursively which would potentially reference any conf (for example, this is needed for schema inference). The linter is active for all newly added confs regardless of whether they directly interact with view analysis.

Why not fix all existing confs?

Currently, there are over a thousand distinct configs in Spark. Fixing every single conf would introduce behavior changes. The linter only enforces the policy on new additions. Existing confs have been added to an exceptions allowlist. The long-term goal is to have all configs declare a binding policy and remove the exceptions allowlist entirely.

Why are the changes needed?

The Analyzer.RETAINED_ANALYSIS_FLAGS list was a manually maintained hardcoded allowlist of configs that should propagate from the active session when resolving views and SQL UDFs. This approach is error-prone: developers adding new configs that affect query semantics could easily forget to add them to this list, causing subtle bugs where views and UDFs silently use Spark defaults instead of session values.
By requiring an explicit ConfigBindingPolicy declaration on every new config, developers are forced to think about how their config interacts with views, UDFs, and procedures at definition time. The enforcement test catches any new config that lacks this declaration, preventing regressions.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New test suite SparkConfigBindingPolicySuite with three tests:

  • Test adding bindingPolicy to config: Verifies a config with SESSION policy has the correct binding policy set.
  • Config enforcement for bindingPolicy: Ensures all registered configs either have a bindingPolicy declared or are in the exceptions allowlist, and that configs with bindingPolicy set are not redundantly in the allowlist.
  • configs-without-binding-policy-exceptions file should be sorted alphabetically: Validates the exceptions file ordering.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor with Claude claude-4.6-opus-high-thinking

@mihailotim-db mihailotim-db force-pushed the mihailo-timotic_data/config_binding_policy branch from 5dd2812 to fb4a948 Compare March 6, 2026 12:47
@dongjoon-hyun dongjoon-hyun marked this pull request as draft March 7, 2026 18:32
@mihailotim-db mihailotim-db force-pushed the mihailo-timotic_data/config_binding_policy branch 5 times, most recently from 12105f0 to fca91ec Compare March 10, 2026 07:45
@mihailotim-db mihailotim-db force-pushed the mihailo-timotic_data/config_binding_policy branch from fca91ec to 2581497 Compare March 10, 2026 07:57
@mihailotim-db mihailotim-db marked this pull request as ready for review March 10, 2026 08:07
@mihailotim-db mihailotim-db changed the title Config binding policy [SPARK-55928][SQL] New linter for config effectiveness in views, UDFs and procedures Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant