This repository contains deployment configurations and automation scripts for the MODERATE Data Integrity and Validation Architecture. The deployment includes Apache Kafka for data streaming, Keycloak for identity and access management, Apache NiFi for data processing workflows, and supporting infrastructure components. The deployment is orchestrated using Docker Compose and Ansible, with configuration templates for environment-specific parameters.
Note
Upstream Git repositories that were previously cloned during deployment are now vendored into this repository (e.g., ansible-configurator and NiFi processors) to ensure reproducible deployments. No cloning occurs at deploy time; see "Updating Vendored Repositories" below for how to refresh them.
- Docker
- Docker Compose
- Ansible
- Python 3
- OpenSSL
keytoolenvsubst- Taskfile
The validate-config task performs essential configuration validation to ensure all required environment variables are properly set before proceeding with deployment. It checks for the presence of the .env file, validates SSH key paths, and verifies that mandatory variables like MACHINE_URL are configured correctly.
cp .env.default .env
# Edit .env - update MACHINE_URL, passwords, and environment values
# Set Caddy proxy basic auth: CADDY_BASIC_AUTH_USER and
# CADDY_BASIC_AUTH_PASSWORD (hash computed automatically by the
# start-caddy task for Caddy basic auth on kafka-rest.<MACHINE_URL> and reporter.<MACHINE_URL>)
task validate-configAll code and Docker images required for deployment are already available (vendored code in this repository, public Docker images from Docker Hub and Quay.io). Optionally verify that required tools are installed:
task check-dependenciesThis section handles SSL certificate management for secure communication across services. Caddy automatically obtains and renews Let's Encrypt certificates for the root domain and subdomains defined in caddy/Caddyfile (e.g., keycloak.$MACHINE_URL, grafana.$MACHINE_URL, reporter.$MACHINE_URL, nifi.$MACHINE_URL, kafka.$MACHINE_URL). The certificates must then be copied and converted to Java keystore formats for Kafka, NiFi, and other Java-based services.
For the initial deployment, run these tasks to set up certificates:
task copy-caddy-certificates
task setup-letsencrypt-truststore
task convert-letsencrypt-to-java-storesThis will:
- Start Caddy server and obtain Let's Encrypt certificates
- Copy certificates from Caddy's data directory to expected locations
- Download Let's Encrypt root certificates and create a Java truststore
- Convert PEM certificates to Java keystore format (JKS and PKCS12)
While Caddy automatically renews Let's Encrypt certificates (typically every 60-90 days), the copied certificates and Java keystores must be updated when renewals occur. The update-certificates task runs both copy-caddy-certificates and convert-letsencrypt-to-java-stores, which are idempotent and will only update if Caddy's certificates are newer:
task update-certificatesDeploy and configure Keycloak before the rest of the stack so OAuth client credentials are available to other services.
task start-keycloakAccess Keycloak at https://keycloak.<MACHINE_URL> (Caddy terminates TLS and reverse-proxies to Keycloak). Log in with KEYCLOAK_USER / KEYCLOAK_PASSWORD from .env, create your realm, and configure OAuth clients for Kafka, NiFi, and Grafana.
| Setting | NiFi | Kafka |
|---|---|---|
| Root URL | https://nifi.<MACHINE_URL> |
https://kafka.<MACHINE_URL> |
| Home URL | https://nifi.<MACHINE_URL> |
https://kafka.<MACHINE_URL> |
| Valid redirect URIs | https://nifi.<MACHINE_URL>* |
https://kafka.<MACHINE_URL>* and http://kafka.<MACHINE_URL>* |
| Valid post-logout redirect URIs | + |
+ |
| Web origins | + |
+ |
Note
Grafana uses pre-configured usernames and passwords and does not seem to integrate with Keycloak for authentication.
Copy the client secrets to .env:
KAFKA_KEYCLOAK_SECRETNIFI_KEYCLOAK_SECRETGRAFANA_KEYCLOAK_SECRET
Important
Make sure the client IDs in Keycloak match the values defined in KAFKA_KEYCLOAK_ID, NIFI_KEYCLOAK_ID and GRAFANA_KEYCLOAK_ID.
You can stop Keycloak later with:
task stop-keycloakThe diva task orchestrates deployment of Kafka, NiFi, Quality Reporter, and supporting components using Ansible playbooks. It verifies SSL certificates, processes configuration templates, and runs the playbooks. Keycloak is deployed separately and should be configured beforehand (see above).
task divaCustom NiFi processors are vendored under ansible-configurator/NiFi_Processors/vendored/ to keep deployments reproducible (no external clones at deploy time). They cover schema normalization, rule generation, data quality checks, and schema-registry validation so that datasets stay consistent as they flow through NiFi and Kafka.
- Suggested flow: Encapsulate → Build rules → Validate quality → Validate schema (see below for details and when to enable each step).
- Purpose: Wraps incoming payloads into a consistent envelope with metadata fields such as
sourceType,sourceID,infoType,dataType,dataItemID, andmetricTypeID. - Properties: The six metadata fields (required); set literal values or NiFi Expression Language to read FlowFile attributes.
- Inputs: FlowFile content is passed through untouched; attributes optionally used to fill properties.
- Outputs: Same content, with metadata injected into the JSON envelope.
- Configure: Set each property to an attribute expression (e.g.,
${source.id}) or static string so every record leaves with a full envelope before validation.
- Purpose: Samples CSV/JSON content (auto-detected or forced) to infer lightweight DQA rules (exists, datatype, numeric domain, categorical values, optional regex).
- Properties:
Sample Size,Max Categories,Regex Derivation(bool),Dataset ID Attribute,Fingerprint Attribute,Format(AUTO/CSV/JSON). - Inputs: FlowFile content (CSV/JSON) and attributes holding dataset id/fingerprint (if present).
- Outputs: Attributes
dqa.rules(YAML),dqa.version(fingerprint), dataset id, fingerprint,dqa.format; relationshipssuccess/failure. - Configure (dataset id): Defaults to attribute
dataset.id; if missing, the processor uses"default-dataset". Set this attribute upstream (e.g., UpdateAttribute) so caching and rule grouping are stable. - Configure (fingerprint): Defaults to attribute
dataset.fingerprint; if missing, the processor computes one from content: JSON → hashes top-level keys (first element for arrays); CSV → hashes header line; otherwise hashes content prefix. Reusing a provided fingerprint keeps cache hits consistent across JVMs. - Configure (general): Leave
FormatasAUTOunless you want to force CSV/JSON, tuneSample Size/Max Categories/Regex Derivationto control inferred rules, and sendfailureto DLQ/alerting.
- Purpose: Applies YAML-defined validation rules (domain, datatype, categorical, string length, missing, regex) using JMESPath feature paths.
- Properties:
Validator ID,Validation Rules(YAML string; typically${dqa.rules}). - Inputs: FlowFile content as JSON sample;
Validation Rulesproperty or attribute. - Outputs: FlowFile content replaced with validation result JSON; relationships
valid/invalid/failure. - Configure: Set
Validation Rulesto${dqa.rules}(Rule Builder output) or a static YAML; setValidator IDfor traceability; routeinvalidseparately fromfailure.
- Purpose: Checks incoming JSON against schemas from a Kafka Schema Registry; can learn new schemas (unless
Strict Checkis true) once seen a minimum number of times. - Properties:
Validator ID,Kafka URI,Kafka_topic, optionalKafka schema ids,Minimum Threshold,Strict Check,Messages History. - Inputs: FlowFile content JSON with
metricValuepayload; properties may use Expression Language to pull topic/ids from attributes. - Outputs: FlowFile content replaced with validation result JSON; relationships
valid/invalid/failure; may register schemas when allowed. - Configure: Set
Kafka URIto the registry endpoint (e.g.,http://kafka.${MACHINE_URL}:8081), pre-seed known schemas viaKafka schema ids, enableStrict Check=trueto block unknown schemas, and tuneMinimum Threshold/Messages Historyto balance learning speed vs noise.
This deployment uses Ansible with Jinja2 templates to generate environment-specific configuration files. Templates contain placeholders that are substituted with actual values during deployment, ensuring consistent and reproducible configurations across all services.
The .env.default and .env files serve as the single source of truth for all configuration values. These environment variables flow through the system in the following sequence:
-
Environment Variables (
.envfile)- Copy
.env.defaultto.envand customize values for your environment - Contains all passwords, URLs, client IDs, and deployment-specific settings
- Variables include
MACHINE_URL,GENERIC_PSW,KAFKA_USER,NIFI_KEYCLOAK_SECRET, etc.
- Copy
-
Taskfile Reads Environment (Taskfile.yml)
- Taskfile automatically loads
.envand.env.defaultusing thedotenvdirective - All Task commands have access to these environment variables
- Tasks like
process-configuration-templatesuse these variables
- Taskfile automatically loads
-
Generate Ansible Parameters (Taskfile.yml)
- The
process-configuration-templatestask usesenvsubstto substitute${VARIABLE}placeholders - Reads template files from
config/*.params.yml.tpl - Generates
params.ymlfiles for each component inansible-configurator/ - Example:
${MACHINE_URL}in template becomesexample.tailscale.netin generated file
- The
-
Ansible Loads Parameters (ansible-configurator/*/ansible-plb.yml)
- Each Ansible playbook loads its
params.ymlusingvars_files - Variables become available as
{{ general_vars.machine_url }},{{ kafka_cred.kafka_user }}, etc.
- Each Ansible playbook loads its
-
Template Substitution (Ansible
templatemodule)- Ansible processes Jinja2 templates (
.j2files) with loaded variables - Generates final configuration files (Docker Compose, properties files, etc.)
- Docker Compose uses these generated files to start containers
- Ansible processes Jinja2 templates (
Visual Flow:
flowchart TD
A[".env file<br/>(source of truth)"] --> B["Taskfile.yml<br/>(dotenv loader)"]
B --> C["envsubst<br/>(substitutes ${VARS})"]
C --> D["config/*.params.yml.tpl"]
D --> E["ansible-configurator/*/params.yml<br/>(generated)"]
E --> F["Ansible playbooks<br/>(load params.yml via vars_files)"]
F --> G["Jinja2 templates<br/>(*.j2 files with {{ vars }})"]
G --> H["Generated configs<br/>(docker-compose.yml, nifi.properties, etc.)"]
H --> I["Docker Compose<br/>(deployment)"]
style A fill:#e1f5ff
style E fill:#fff4e1
style H fill:#e8f5e9
style I fill:#f3e5f5
Configuration templates are organized by component:
moderate-diva-deployment/
├── kafka/templates/
│ ├── docker-compose.yml.j2 # Kafka Docker Compose configuration
│ ├── client.config.j2 # Kafka client configuration
│ └── kafka-ui-config.yml.j2 # Kafka UI configuration
├── nifi/templates/
│ ├── docker-compose.yml.j2 # NiFi Docker Compose configuration
│ └── nifi.properties.j2 # NiFi properties file
└── quality_reporter/templates/
└── docker-compose.yml.j2 # Quality Reporter Docker Compose configuration
During deployment, Ansible playbooks use the ansible.builtin.template module to process Jinja2 templates:
- Load Variables: Ansible reads configuration from
params.ymlfiles (generated from.envvalues) - Process Templates: The template module reads
.j2files and substitutes all{{ variable_name }}placeholders - Generate Configs: Processed files are written to component directories (without
.j2extension) - Deploy Services: Docker Compose uses the generated configuration files to start containers
Variables are organized hierarchically in params.yml files:
- General Variables (
ansible-configurator/params.yml): Shared across all components (machine URL, project name, Keycloak URL, passwords) - Component Variables (
ansible-configurator/{Kafka,NiFi,Quality_Reporter}/params.yml): Component-specific settings (usernames, client IDs, secrets)
These params.yml files are generated from your .env file during deployment and are not version-controlled (only the .tpl templates are tracked in Git).