PICORI to OMOP ETL

A comprehensive ETL pipeline for converting PCORnet/PICORI CDM data to OMOP CDM v5.4.2 format, compatible with OHDSI tools and analyses.

Overview

This project provides a complete solution for transforming PCORnet/PICORI Common Data Model (CDM) datasets stored as Parquet files into a fully standards-compliant OMOP CDM v5.4.2 instance. The ETL process is designed to be compatible with OHDSI tools including Achilles, Data Quality Dashboard (DQD), Patient-Level Prediction, and CohortMethod.

Features

Complete ETL Pipeline: Transforms all major PCORnet tables to OMOP format
Standards Compliant: Follows OMOP CDM v5.4.2 specifications exactly
OHDSI Compatible: Works with all major OHDSI tools and packages
Comprehensive Validation: Includes DQD, Achilles, and custom validation checks
Scalable Processing: Uses PySpark for efficient large-scale data processing
Quality Assurance: Built-in data quality checks and validation framework
Documentation: Complete mapping specifications and data dictionary

Architecture

ETL Engine: PySpark for distributed data processing
Target Database: PostgreSQL with OMOP CDM v5.4.2 schema
Vocabularies: Athena standardized vocabularies (SNOMED, RxNorm, LOINC, etc.)
Validation: OHDSI Data Quality Dashboard and Achilles
Configuration: YAML-based configuration with environment variable support

Quick Start

Prerequisites

Python 3.10+
Java 8 or 11
Apache Spark 3.4+
PostgreSQL 14+
R 4.3+ (for validation tools)

Installation

Clone the repository:

git clone <repository-url>
cd PICORI2OMOP

Install Python dependencies:

pip install -r requirements.txt

Set up environment variables:

export OMOP_DB_PASSWORD="your_postgres_password"
export OMOP_ID_SALT="your_strong_random_salt"

Database Setup

Create PostgreSQL database:

createdb omop

Run bootstrap script:

./etl/scripts/bootstrap.sh

Download and load OMOP CDM DDLs from OHDSI CommonDataModel
Download and load Athena vocabularies from OHDSI Athena

Running the ETL

Place your PCORnet Parquet files in ~/datasets/stroke_data/
Run the complete ETL process:

./etl/scripts/run_etl.sh

Run validation checks:

./etl/scripts/run_validation.sh

Project Structure

PICORI2OMOP/
├── plan.md                          # Comprehensive ETL plan
├── README.md                        # This file
├── requirements.txt                 # Python dependencies
├── etl/
│   ├── config/
│   │   ├── etl_config.yml          # ETL configuration
│   │   └── secrets.example.yml     # Example secrets
│   ├── spark/
│   │   ├── common/                 # Common utilities
│   │   │   ├── io_utils.py
│   │   │   ├── mapping_utils.py
│   │   │   ├── ids.py
│   │   │   └── validation.py
│   │   ├── load_person.py          # Person data loader
│   │   ├── load_visits.py          # Visit data loader
│   │   ├── load_condition.py       # Condition data loader
│   │   └── ...                     # Other domain loaders
│   ├── mappings/                   # Mapping files
│   │   ├── encounter_type.csv
│   │   ├── dx_type.csv
│   │   ├── units.csv
│   │   └── drug_type.csv
│   ├── sql/
│   │   ├── create_schemas.sql      # Schema creation
│   │   ├── vocab_load.sql          # Vocabulary loading
│   │   ├── eras/                   # Era building scripts
│   │   └── checks/                 # Validation scripts
│   └── scripts/
│       ├── bootstrap.sh            # Database setup
│       ├── run_etl.sh              # ETL orchestration
│       └── run_validation.sh       # Validation orchestration
└── docs/
    ├── decisions_log.md            # ETL decisions log
    └── data_dictionary.md          # Data dictionary

Configuration

The ETL process is configured via etl/config/etl_config.yml:

source:
  parquet_root: "/home/asadr/datasets/stroke_data"

target:
  jdbc_url: "jdbc:postgresql://localhost:5432/omop"
  db_user: "postgres"
  db_password_env: "OMOP_DB_PASSWORD"
  cdm_schema: "cdm"
  staging_schema: "staging"
  results_schema: "results"

vocabulary:
  snapshot_date: "2025-09-30"
  enforce_standard_only: true

etl:
  spark_master: "local[*]"
  partitions: 8
  batch_size_rows: 50000
  write_mode: "append"
  timezone: "UTC"

Data Mapping

The ETL process maps PCORnet tables to OMOP tables:

DEMOGRAPHIC → person + observation_period
ENCOUNTER → visit_occurrence
DIAGNOSIS → condition_occurrence
PROCEDURES → procedure_occurrence
PRESCRIBING/DISPENSING → drug_exposure
LAB_RESULT_CM → measurement
VITAL → measurement
OBS_CLIN/OBS_GEN → observation
IMMUNIZATION → drug_exposure
DEATH → death
ENROLLMENT → observation_period

Validation

The ETL process includes comprehensive validation:

Integrity Checks: Primary key uniqueness, foreign key integrity
Constraint Validation: Not-null constraints, data type validation
Data Quality Dashboard: OHDSI DQD for comprehensive quality assessment
Achilles: Data characterization and profiling
Custom Validation: Row count reconciliation, concept mapping quality

Quality Assurance

Standards Compliance: Follows OMOP CDM v5.4.2 specifications exactly
OHDSI Compatibility: Tested with OHDSI tools and packages
Data Quality: Comprehensive validation and quality checks
Documentation: Complete mapping specifications and decisions log
Error Handling: Robust error handling and logging

Troubleshooting

Common Issues

Database Connection: Ensure PostgreSQL is running and credentials are correct
Memory Issues: Adjust Spark memory settings for large datasets
Vocabulary Loading: Ensure Athena vocabularies are properly loaded
Permission Issues: Check file permissions for Parquet data and scripts

Logs

ETL logs are written to console
Validation results are stored in the results schema
Error logs include detailed error messages and stack traces

Contributing

Follow the existing code structure and patterns
Update documentation for any changes
Add tests for new functionality
Update the decisions log for significant changes

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

Support

For issues and questions:

Check the documentation in the docs/ directory
Review the decisions log for known issues
Check the validation results for data quality issues
Create an issue in the repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PICORI to OMOP ETL

Overview

Features

Architecture

Quick Start

Prerequisites

Installation

Database Setup

Running the ETL

Project Structure

Configuration

Data Mapping

Validation

Quality Assurance

Troubleshooting

Common Issues

Logs

Contributing

License

References

Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
etl		etl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
plan.md		plan.md
requirements.txt		requirements.txt
simple_etl_test.py		simple_etl_test.py
test_data_exploration.py		test_data_exploration.py

License

TheDecodeLab/PICORI2OMOP

Folders and files

Latest commit

History

Repository files navigation

PICORI to OMOP ETL

Overview

Features

Architecture

Quick Start

Prerequisites

Installation

Database Setup

Running the ETL

Project Structure

Configuration

Data Mapping

Validation

Quality Assurance

Troubleshooting

Common Issues

Logs

Contributing

License

References

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages