Authors: James Bannon, Michael Tran, Matthew Boccalon, Sisira Kadambat Nair
Contact: [email protected]
Description: Pipeline to build the Harmonized Drug Dataset Version 1 (HDD_v1) as a MultiAssayExperiment that harmonizes drug measurements, annotations, and fingerprints across multiple sources.
data/results/HDD_v1.RDS: the Harmonized Drug Dataset Version 1 as aMultiAssayExperiment.data/results/HDD_v1_csvs/: MAE-derived CSV exports for parity with the RDS output.data/procdata/colData.csv: compound metadata assembled from AnnotationDB, LINCS, JUMP-CP, and DeepChem.data/procdata/experiments/: assay matrices for BindingDB, bioassays, DeepChem tasks, and fingerprint features.qc/hdd_quality_control.html: quality control report (rendered fromqc/hdd_quality_control.Rmd).
Pixi is required to run this project. If you have not installed it yet, follow the instructions at https://pixi.sh/latest/.
pixi installpixi run snakemake -c 1Increase -c for more cores. The default Snakemake target builds data/results/HDD_v1.RDS.
pixi run knit_qcData sources, versions, and filtering rules are controlled in config/pipeline.yaml. Update this file to:
- Pin different dataset versions or URLs.
- Adjust BindingDB filtering logic (organisms, columns, assays).
- Change Morgan fingerprint radii and dimensions.
config/: pipeline configuration.workflow/: Snakemake rules and scripts for data preparation and assembly.data/rawdata/: raw downloads (not tracked in Git).data/procdata/: processed intermediate datasets (not tracked in Git).data/results/: final HDD_v1 output (not tracked in Git).data/results/HDD_v1_csvs/: MAE-derived CSV exports (colData plus one CSV per assay).qc/: QC notebook and rendered report.docs/: project documentation (this file, usage notes, data sources, dev notes).
docs/usage.md: how to configure and run the pipeline.docs/data_sources.md: data source registry for HDD_v1 inputs.docs/devnotes.md: engineering notes and decisions.