Aqueductus

A powerful Python framework for validating data quality across different data sources through SQL queries and customizable assertions. Perfect for data engineers and analysts who need to ensure data consistency and quality.

✨ Key Features

🔌 Multiple Data Sources
- Amazon Athena
- MySQL
- PostgresSQL
- SQLite
- Extensible architecture for adding more providers
🔍 Rich Test Types
- Row presence validation
- Negative testing (absence of rows)
- Column completeness checks
- Value distribution analysis
- Row count validation
- Column existence verification
- Pattern matching with regex support
- Comparative operators (>, <, =)
🛠️ Advanced Configuration
- Environment and placeholder variables support
- Multiple data source configurations
- CSV file integration
- Cross-provider testing
📊 Flexible Reporting
- Console output
- JSON export
- JUnit XML (CI/CD friendly)
- Markdown reports

🚀 Quick Start

Installation

To install Aqueductus with SQLite support (default):

pip install aqueductus

To use it with MySQL, PostgreSQL, or Athena, install the corresponding extras:

pip install aqueductus[mysql,postgresql,athena]

Quick Start

Create a configuration file (e.g., config.yaml):

providers:
  - name: my_athena
    type: athena
    config:
      region: ${AWS_REGION}
      work_group: ${AWS_WORKGROUP}
      aws_access_key_id: ${AWS_ACCESS_KEY_ID}
      aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}

tests:
  - name: user_data_validation
    provider: my_athena
    query: >
      SELECT user_id, status
      FROM <<table>>
      WHERE created_date = <<expected_date>>

    # Verify specific rows exist
    contains_rows:
      source: inline
      rows:
        - column1: "value1"
          column2: "value2"
      ignore_columns:
        - timestamp

    # Verify column completeness
    column_ratio:
      - column: status
        value: "active"
        min_ratio: 0.95

Run the tests:

aqueductus config.yaml

🛠️ Using Placeholders

Placeholders allow you to dynamically replace certain parts of your queries or configuration files with values defined in your environment.

How It Works

Define Placeholders in Your YAML File: In your YAML file (or any file you want to use), you can use placeholders in the format <<placeholder>>. For example:
```
query: "SELECT * FROM <<table_name>> WHERE <<column_name>> = 'some_value';"
```
Define the Corresponding Values in environment.py: In the environment.py file, define a dictionary called PLACEHOLDER where you map each placeholder to its corresponding value. For example:
```
PLACEHOLDER = {
    "table_name": "users",
    "column_name": "id"
}
```
Aqueductus Automatically Replaces Placeholders: When Aqueductus processes the YAML file, it will automatically replace any <<placeholder>> with the value defined in the PLACEHOLDER dictionary from environment.py. In this case, the query will be replaced like so:
```
query: "SELECT * FROM users WHERE id = 'some_value';"
```

This allows you to easily reuse queries or configurations with different values based on your environment or specific use case.

📚 Test Types

1. Contains Rows

Verifies that specific rows exist in the query results:

contains_rows:
  source: inline
  rows:
    - column1: "value1"
      column2: "value2"
  ignore_columns:
    - timestamp

2. Not Contains Rows

Ensures specific rows do not exist:

not_contains_rows:
  rows:
    - column1: "invalid"
      column2: "invalid"

3. Column Ratio

Validates the ratio of values in a column:

column_ratio:
  - column: status
    value: "active"
    min_ratio: 0.95
    max_ratio: 1.0

4. Row Count

Verifies the exact number of rows:

row_count: 100

5. Columns Exist

Ensures required columns are present:

columns_exists:
  - column1
  - column2

🔄 Data Sources

CSV Integration

Load test data from CSV files:

contains_rows:
  source: csv
  path: tests/expected_data.csv

Cross-Provider Testing

Compare data across different providers:

contains_rows:
  source: provider
  provider: other_athena
  query: SELECT * FROM reference_table
  map: # Optional column mapping
    source_col: target_col

📝 Output Formats

The framework supports multiple output formats:

# Single format
aqueductus config.yaml --format json

# Multiple formats
aqueductus config.yaml --format console,json,junit

Available formats:

console: Human-readable console output
json: JSON file output
junit: JUnit XML for CI/CD integration
markdown: Markdown report

🛠️ Development

By default, Aqueductus searches for providers.py, testers.py, and reporters.py files in the root directory. Each file will automatically detect and load subclasses for the corresponding provider, test type, or reporter, making it easy to add new components without additional configuration.

Adding a New Provider

Reference to provider implementation:

class DataProvider(ABC):
    @abstractmethod
    def __init__(self, config: dict[str, Any]) -> None:
        pass

    @abstractmethod
    def execute_query(self, query: str) -> list[dict[str, Any]]:
        pass

Adding a New Test Type

Reference to test type implementation:

class DataTest(ABC):
    def __init__(
        self,
        query_results: list[dict[str, Any]],
        config: Any,
        providers: dict[str, DataProvider],
    ):
        self.query_results = query_results
        self.config = config
        self.providers = providers

    @abstractmethod
    def run(self) -> dict[str, Any]:
        pass

Adding a New Reporter

Reference to reporter implementation:

class Reporter(ABC):

    @abstractmethod
    def generate_report(self, test_results: list[dict[str, Any]]) -> None:
        pass

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
aqueductus		aqueductus
tests		tests
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aqueductus

✨ Key Features

🚀 Quick Start

Installation

Quick Start

🛠️ Using Placeholders

How It Works

📚 Test Types

1. Contains Rows

2. Not Contains Rows

3. Column Ratio

4. Row Count

5. Columns Exist

🔄 Data Sources

CSV Integration

Cross-Provider Testing

📝 Output Formats

🛠️ Development

Adding a New Provider

Adding a New Test Type

Adding a New Reporter

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

RemoYukoff/aqueductus

Folders and files

Latest commit

History

Repository files navigation

Aqueductus

✨ Key Features

🚀 Quick Start

Installation

Quick Start

🛠️ Using Placeholders

How It Works

📚 Test Types

1. Contains Rows

2. Not Contains Rows

3. Column Ratio

4. Row Count

5. Columns Exist

🔄 Data Sources

CSV Integration

Cross-Provider Testing

📝 Output Formats

🛠️ Development

Adding a New Provider

Adding a New Test Type

Adding a New Reporter

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages