[FEATURE]: ML-based Anomaly Detection for row-level (has_no_anomalies)

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Problem statement

When monitoring data quality at scale, users often need to detect "unusual" records that don't fit expected patterns - such as sensor malfunctions or data entry errors that are difficult to catch with simple threshold rules.

- **Data drift** - Distribution of values gradually shifts in ways that indicate something broke in the pipeline
- **Data entry errors** - Someone enters values in wrong units (meters vs millimeters), transposes digits, or copies garbage
- **Integration mismatches** - Data from System A looks completely different than usual after a silent schema change upstream
- **Multivariate anomalies** - Individual columns look normal, but the *combination* is unusual (e.g., age=25, years_experience=40)

Currently, users must manually set static thresholds for each column. [Issue #359](https://github.com/databrickslabs/dqx/issues/359) adds `has_no_outliers()` for single-column statistical outlier detection (MAD, Z-score, IQR), but this doesn't catch:
1. Anomalies that only appear when looking at multiple columns together
2. Distribution drift over time
3. Patterns that are hard to define with simple statistical thresholds

[Databricks native anomaly detection](https://learn.microsoft.com/en-gb/azure/databricks/data-quality-monitoring/anomaly-detection/) monitors table freshness and row counts, but doesn't catch unusual values within individual records.

### Proposed Solution

**ML-based Anomaly Detection with simple APIs and hidden complexity**

**1. Training an Anomaly Model**

We can create a new module "anomaly" as we need a way to "train" a model on historical "normal" data. MLflow/Unity Catalog registration is handled internally by DQX.

```python
from databricks.labs.dqx import anomaly

anomaly.train(
    df=historical_orders_df,
    columns=["amount", "quantity", "shipping_cost", "discount"],
    name="orders_anomaly_model",
    algorithm="isolation_forest"    # extensible to other algorithms later
)
```

**2. Row-level Anomaly Detection Check**

The check function will work like other DQX checks.

```python
has_no_anomalies(
    columns=["amount", "quantity", "shipping_cost", "discount"],
    model="orders_anomaly_model",    # reference by name
    score_threshold=0.7              # flag records with anomaly score > 0.7
)
```

YAML support:

```yaml
- criticality: warn
  check:
    function: has_no_anomalies
    arguments:
      columns: [amount, quantity, shipping_cost, discount]
      model: orders_anomaly_model
      score_threshold: 0.7
```

### Key Design Principles

1. **Simple APIs, powerful internals** - Users call `anomaly.train()` and reference models by name. They don't need to know MLflow, Unity Catalog, or model versioning.

2. **Managed storage** - Models saved to Unity Catalog Model Registry via MLflow. All handled internally by DQX.

3. **Works with existing DQX** - `has_no_anomalies` is a check function like any other. Combine with `is_not_null`, `has_no_outliers`, etc.

4. **Row-level results** - Each record gets flagged with an anomaly score, integrates with DQX quarantine/reporting.

### Additional Context

This complements #359's single-column statistical methods with multivariate ML-based detection. The key differentiator from Databricks native anomaly detection is **row-level value anomalies** vs table-level freshness monitoring.

Popular DQ tools all offer "Anomaly Detection" as a premium feature. Adding this to DQX would make it competitive with commercial offerings while maintaining the open-source, simple-API philosophy.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE]: ML-based Anomaly Detection for row-level (has_no_anomalies) #957

Is there an existing issue for this?

Problem statement

Proposed Solution

Key Design Principles

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE]: ML-based Anomaly Detection for row-level (has_no_anomalies) #957

Description

Is there an existing issue for this?

Problem statement

Proposed Solution

Key Design Principles

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions