Skip to content

bhklab/hdd-data-pipeline

Repository files navigation

HDD_v1 Data Pipeline

Authors: James Bannon, Michael Tran, Matthew Boccalon, Sisira Kadambat Nair

Contact: [email protected]

Description: Pipeline to build the Harmonized Drug Dataset Version 1 (HDD_v1) as a MultiAssayExperiment that harmonizes drug measurements, annotations, and fingerprints across multiple sources.


pixi-badge Ruff Built with Material for MkDocs

GitHub last commit GitHub issues GitHub pull requests GitHub contributors GitHub release (latest by date)

What the pipeline produces

  • data/results/HDD_v1.RDS: the Harmonized Drug Dataset Version 1 as a MultiAssayExperiment.
  • data/results/HDD_v1_csvs/: MAE-derived CSV exports for parity with the RDS output.
  • data/procdata/colData.csv: compound metadata assembled from AnnotationDB, LINCS, JUMP-CP, and DeepChem.
  • data/procdata/experiments/: assay matrices for BindingDB, bioassays, DeepChem tasks, and fingerprint features.
  • qc/hdd_quality_control.html: quality control report (rendered from qc/hdd_quality_control.Rmd).

Quickstart

Prerequisites

Pixi is required to run this project. If you have not installed it yet, follow the instructions at https://pixi.sh/latest/.

Install dependencies

pixi install

Run the pipeline

pixi run snakemake -c 1

Increase -c for more cores. The default Snakemake target builds data/results/HDD_v1.RDS.

Generate the QC report

pixi run knit_qc

Configuration

Data sources, versions, and filtering rules are controlled in config/pipeline.yaml. Update this file to:

  • Pin different dataset versions or URLs.
  • Adjust BindingDB filtering logic (organisms, columns, assays).
  • Change Morgan fingerprint radii and dimensions.

Repository layout

  • config/: pipeline configuration.
  • workflow/: Snakemake rules and scripts for data preparation and assembly.
  • data/rawdata/: raw downloads (not tracked in Git).
  • data/procdata/: processed intermediate datasets (not tracked in Git).
  • data/results/: final HDD_v1 output (not tracked in Git).
  • data/results/HDD_v1_csvs/: MAE-derived CSV exports (colData plus one CSV per assay).
  • qc/: QC notebook and rendered report.
  • docs/: project documentation (this file, usage notes, data sources, dev notes).

Additional documentation

  • docs/usage.md: how to configure and run the pipeline.
  • docs/data_sources.md: data source registry for HDD_v1 inputs.
  • docs/devnotes.md: engineering notes and decisions.

About

Pipeline to create the Harmonized Drug Dataset (HDD)

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages