-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Is there an existing issue for this?
- I have searched the existing issues
Problem statement
When monitoring data quality at scale, users often need to detect "unusual" records that don't fit expected patterns - such as sensor malfunctions or data entry errors that are difficult to catch with simple threshold rules.
- Data drift - Distribution of values gradually shifts in ways that indicate something broke in the pipeline
- Data entry errors - Someone enters values in wrong units (meters vs millimeters), transposes digits, or copies garbage
- Integration mismatches - Data from System A looks completely different than usual after a silent schema change upstream
- Multivariate anomalies - Individual columns look normal, but the combination is unusual (e.g., age=25, years_experience=40)
Currently, users must manually set static thresholds for each column. Issue #359 adds has_no_outliers() for single-column statistical outlier detection (MAD, Z-score, IQR), but this doesn't catch:
- Anomalies that only appear when looking at multiple columns together
- Distribution drift over time
- Patterns that are hard to define with simple statistical thresholds
Databricks native anomaly detection monitors table freshness and row counts, but doesn't catch unusual values within individual records.
Proposed Solution
ML-based Anomaly Detection with simple APIs and hidden complexity
1. Training an Anomaly Model
We can create a new module "anomaly" as we need a way to "train" a model on historical "normal" data. MLflow/Unity Catalog registration is handled internally by DQX.
from databricks.labs.dqx import anomaly
anomaly.train(
df=historical_orders_df,
columns=["amount", "quantity", "shipping_cost", "discount"],
name="orders_anomaly_model",
algorithm="isolation_forest" # extensible to other algorithms later
)2. Row-level Anomaly Detection Check
The check function will work like other DQX checks.
has_no_anomalies(
columns=["amount", "quantity", "shipping_cost", "discount"],
model="orders_anomaly_model", # reference by name
score_threshold=0.7 # flag records with anomaly score > 0.7
)YAML support:
- criticality: warn
check:
function: has_no_anomalies
arguments:
columns: [amount, quantity, shipping_cost, discount]
model: orders_anomaly_model
score_threshold: 0.7Key Design Principles
-
Simple APIs, powerful internals - Users call
anomaly.train()and reference models by name. They don't need to know MLflow, Unity Catalog, or model versioning. -
Managed storage - Models saved to Unity Catalog Model Registry via MLflow. All handled internally by DQX.
-
Works with existing DQX -
has_no_anomaliesis a check function like any other. Combine withis_not_null,has_no_outliers, etc. -
Row-level results - Each record gets flagged with an anomaly score, integrates with DQX quarantine/reporting.
Additional Context
This complements #359's single-column statistical methods with multivariate ML-based detection. The key differentiator from Databricks native anomaly detection is row-level value anomalies vs table-level freshness monitoring.
Popular DQ tools all offer "Anomaly Detection" as a premium feature. Adding this to DQX would make it competitive with commercial offerings while maintaining the open-source, simple-API philosophy.