Skip to content

[FEATURE]: ML-based Anomaly Detection for row-level (has_no_anomalies) #957

@vb-dbrks

Description

@vb-dbrks

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

When monitoring data quality at scale, users often need to detect "unusual" records that don't fit expected patterns - such as sensor malfunctions or data entry errors that are difficult to catch with simple threshold rules.

  • Data drift - Distribution of values gradually shifts in ways that indicate something broke in the pipeline
  • Data entry errors - Someone enters values in wrong units (meters vs millimeters), transposes digits, or copies garbage
  • Integration mismatches - Data from System A looks completely different than usual after a silent schema change upstream
  • Multivariate anomalies - Individual columns look normal, but the combination is unusual (e.g., age=25, years_experience=40)

Currently, users must manually set static thresholds for each column. Issue #359 adds has_no_outliers() for single-column statistical outlier detection (MAD, Z-score, IQR), but this doesn't catch:

  1. Anomalies that only appear when looking at multiple columns together
  2. Distribution drift over time
  3. Patterns that are hard to define with simple statistical thresholds

Databricks native anomaly detection monitors table freshness and row counts, but doesn't catch unusual values within individual records.

Proposed Solution

ML-based Anomaly Detection with simple APIs and hidden complexity

1. Training an Anomaly Model

We can create a new module "anomaly" as we need a way to "train" a model on historical "normal" data. MLflow/Unity Catalog registration is handled internally by DQX.

from databricks.labs.dqx import anomaly

anomaly.train(
    df=historical_orders_df,
    columns=["amount", "quantity", "shipping_cost", "discount"],
    name="orders_anomaly_model",
    algorithm="isolation_forest"    # extensible to other algorithms later
)

2. Row-level Anomaly Detection Check

The check function will work like other DQX checks.

has_no_anomalies(
    columns=["amount", "quantity", "shipping_cost", "discount"],
    model="orders_anomaly_model",    # reference by name
    score_threshold=0.7              # flag records with anomaly score > 0.7
)

YAML support:

- criticality: warn
  check:
    function: has_no_anomalies
    arguments:
      columns: [amount, quantity, shipping_cost, discount]
      model: orders_anomaly_model
      score_threshold: 0.7

Key Design Principles

  1. Simple APIs, powerful internals - Users call anomaly.train() and reference models by name. They don't need to know MLflow, Unity Catalog, or model versioning.

  2. Managed storage - Models saved to Unity Catalog Model Registry via MLflow. All handled internally by DQX.

  3. Works with existing DQX - has_no_anomalies is a check function like any other. Combine with is_not_null, has_no_outliers, etc.

  4. Row-level results - Each record gets flagged with an anomaly score, integrates with DQX quarantine/reporting.

Additional Context

This complements #359's single-column statistical methods with multivariate ML-based detection. The key differentiator from Databricks native anomaly detection is row-level value anomalies vs table-level freshness monitoring.

Popular DQ tools all offer "Anomaly Detection" as a premium feature. Adding this to DQX would make it competitive with commercial offerings while maintaining the open-source, simple-API philosophy.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions