HDD_v1 Data Pipeline

Authors: James Bannon, Michael Tran, Matthew Boccalon, Sisira Kadambat Nair

Description: Pipeline to build the Harmonized Drug Dataset Version 1 (HDD_v1) as a MultiAssayExperiment that harmonizes drug measurements, annotations, and fingerprints across multiple sources.

What the pipeline produces

data/results/HDD_v1.RDS: the Harmonized Drug Dataset Version 1 as a MultiAssayExperiment.
data/results/HDD_v1_csvs/: MAE-derived CSV exports for parity with the RDS output.
data/procdata/colData.csv: compound metadata assembled from AnnotationDB, LINCS, JUMP-CP, and DeepChem.
data/procdata/experiments/: assay matrices for BindingDB, bioassays, DeepChem tasks, and fingerprint features.
qc/hdd_quality_control.html: quality control report (rendered from qc/hdd_quality_control.Rmd).

Quickstart

Prerequisites

Pixi is required to run this project. If you have not installed it yet, follow the instructions at https://pixi.sh/latest/.

Install dependencies

pixi install

Run the pipeline

pixi run snakemake -c 1

Increase -c for more cores. The default Snakemake target builds data/results/HDD_v1.RDS.

Generate the QC report

pixi run knit_qc

Configuration

Data sources, versions, and filtering rules are controlled in config/pipeline.yaml. Update this file to:

Pin different dataset versions or URLs.
Adjust BindingDB filtering logic (organisms, columns, assays).
Change Morgan fingerprint radii and dimensions.

Repository layout

config/: pipeline configuration.
workflow/: Snakemake rules and scripts for data preparation and assembly.
data/rawdata/: raw downloads (not tracked in Git).
data/procdata/: processed intermediate datasets (not tracked in Git).
data/results/: final HDD_v1 output (not tracked in Git).
data/results/HDD_v1_csvs/: MAE-derived CSV exports (colData plus one CSV per assay).
qc/: QC notebook and rendered report.
docs/: project documentation (this file, usage notes, data sources, dev notes).

Additional documentation

docs/usage.md: how to configure and run the pipeline.
docs/data_sources.md: data source registry for HDD_v1 inputs.
docs/devnotes.md: engineering notes and decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
config		config
data		data
docs		docs
qc		qc
workflow		workflow
.bhklab-template-builder-answers.yml		.bhklab-template-builder-answers.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
Snakefile		Snakefile
pixi.lock		pixi.lock
pixi.toml		pixi.toml
ruff.toml		ruff.toml
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HDD_v1 Data Pipeline

What the pipeline produces

Quickstart

Prerequisites

Install dependencies

Run the pipeline

Generate the QC report

Configuration

Repository layout

Additional documentation

About

Uh oh!

Releases 1

Packages

Contributors 4

Uh oh!

Languages

bhklab/hdd-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

HDD_v1 Data Pipeline

What the pipeline produces

Quickstart

Prerequisites

Install dependencies

Run the pipeline

Generate the QC report

Configuration

Repository layout

Additional documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Uh oh!

Languages

Packages