Radiogenomics: Integrative Analysis Pipeline

Overview

This repository contains an end-to-end pipeline to discover, visualize, and validate associations between quantitative imaging features (radiomics) and molecular profiles (genomics) across cancer types. It provides:

Radiomics feature processing and feature-level clustering to select compact “medoid” feature sets
Dimensionality reduction utilities for fast 2D/3D visualization (PCA, t-SNE, UMAP)
Radiogenomic correlation analysis (Spearman and Pearson) with ID harmonization
Survival modeling and enrichment analyses (legacy/optional; see project pivot below)

Research hypothesis

Quantitative imaging features capture phenotypes driven by underlying molecular programs. Therefore, radiomic features should correlate with genomic pathway signatures. Note: clinical prediction (treatment response/survival) was investigated initially but is currently de‑emphasized after weak associations were observed (see pivot).

Project history and pivot

Original aim: evaluate whether radiomic features can serve as surrogate predictive markers for cancer diagnosis and survival outcomes, and whether radiomics/genomics complement each other for clinical prediction.
Findings: across explored cohorts, associations between (i) radiomics and genomics, and (ii) either modality and clinical outcomes, were weak.
Pivot: narrow the scope to characterizing radiomic–genomic relationships only, with emphasis on pathway-level signatures (MSigDB Hallmarks), and drop clinical analyses from the main line of inquiry.
Feature extraction change: moved from hand-crafted PyRadiomics to deep learning features using FMCIB (Foundation Model for Cancer Imaging Biomarkers). Current analyses center on these deep features.

Current objectives

Identify robust, reproducible associations between deep learning FMCIB radiomic features and Hallmarks-of-cancer gene signatures.
Reduce radiomic feature redundancy via spectral clustering and medoid selection to improve interpretability.
Characterize cross-cohort consistency (e.g., NSCLC and CPTAC cohorts) of radiogenomic signals.

Repository organization

Radiogenomics/
├── Data_Analysis/                           # Analysis notebooks, scripts and helper utilities
│   ├── Correlations/                        # Radiogenomic correlation & selection pipelines
│   │   ├── LASSO_corr.py                    # Per-signature LASSO -> test-set Spearman correlations + heatmaps
│   │   ├── PCA_correlation.py               # PCA -> correlation pipeline (PCA selection + correlations/heatmaps)
│   │   ├── lasso_per_signature.py           # helpers to run per-signature LASSO jobs
│   │   ├── radiogenomic_correlation.py      # Multi-task LASSO & correlation pipeline (residualization, variance filtering)
│   │   └── radiomics_self_correlation.R     # small R utilities for radiomics self-correlation
│   ├── Machine_Learning_Models/             # Linear/LASSO/Ridge/Elastic Net models (R implementations)
│   ├── Enrichment/                          # Pathway/gene set enrichment (R scripts)
│   ├── radiomics/                           # radiomics analysis helpers and feature-selection kernels
│   │   └── feature-selection/
│   │       ├── spectral_cluster.py
│   │       ├── hierarchical_medoid_clustering.py
│   │       ├── feature_PCA.py
│   │       ├── feature_vif_pca.py
│   │       ├── feature_volume_corr.py
│   │       └── feature_volume_sPCA.py
│   ├── misc_tasks.ipynb                     # utility notebook: heatmaps, PCA plotting, LASSO presence scans
│   └── README.md
├── Preprocessing/                           # Data harmonization & preprocessing scripts
│   ├── extract_tpm_by_symbols.py
│   ├── Data_sampler.R
│   └── archived/                            # older preprocessing scripts
├── Radiomics/                               # FMCIB feature extraction + run scripts and configs
│   ├── readii-fmcib/
	│   ├── run_fmcib.py
	│   └── config/                          # per-dataset FMCIB configs (TCIA, CPTAC, etc.)
│   ├── readii_2_roqc/
│   └── bash_scripts/                        # dataset-specific run scripts
├── Snakemake/                               # Workflow definitions and Snakefiles (Correlations, Enrichment, Clinical)
│   ├── Correlations/
│   ├── Enrichment/
│   └── clinical_associations/
├── Old_script_version/                      # legacy scripts kept for reference (includes older visualization kernels)
│   └── Visualization_scripts/
│       ├── UMAP_cancer.R
│       ├── clinical_correlation_viz.R
│       ├── correlative_heatmaps.R
│       └── signature_distribution.R
└── data/                                    # Raw and processed data (see data/procdata and data/rawdata)

If a folder has its own README, start there for details (e.g., Data_Analysis/Enrichment, Machine_Learning_Models).

Data modalities and conventions

Radiomics (current): deep learning features from FMCIB (columns prefixed with pred_), with SampleID formatted like R01-###_#### where the suffix identifies an instance/region.
Radiomics (historical): traditional PyRadiomics features were evaluated earlier but are not the current focus.
Genomics: gene signatures (e.g., MSigDB Hallmarks) with samples as rows indexed by base IDs (e.g., R01-###).
Clinical: outcomes and covariates for survival analyses.

Best-practice conventions:

Keep radiomics rows filtered to permutation == original and region == full (where applicable).
Use standardized feature matrices for downstream correlation to reduce redundancy.
Maintain consistent sample ID schemes across modalities; derive base IDs by stripping the trailing _#### from SampleID when aligning to genomics.

Installation

You can use any Python 3.9+ environment and R ≥ 4.0. A minimal Python stack is listed below.

R packages (typical)

Bioconductor core: DESeq2, GSVA, biomaRt, org.Hs.eg.db
Stats/ML: survival, survminer, glmnet, caret
Viz: ggplot2, pheatmap, plotly, heatmaply

Python packages (typical)

numpy, pandas, scipy
scikit-learn
matplotlib, seaborn
umap-learn (only for UMAP plots)
snakemake (if running workflows)

Datasets and signatures

Datasets: CPTAC-CCRCC, CPTAC-PDA, CPTAC-HNSCC, TCGA-KIRC, NSCLC-Radiogenomics, OCTANE
Gene signatures: MSigDB Hallmarks (CSV matrices with samples as rows and signatures as columns). Example files live under data/procdata/gene_signatures/.

Data syncing and curation are environment-specific. Keep signature matrices and radiomics matrices versioned and documented.

Best practices

Reproducibility: set seeds (e.g., --seed 10) and record package versions.
ID hygiene: consistently maintain R01-###_#### for radiomics and base IDs R01-### for genomics.
Scaling: visualization utilities standardize features; avoid re-scaling inputs twice.
Missing data: numeric coercion + median imputation is applied in visualization and clustering steps; for correlations, pairwise NaNs are skipped.
Plots at scale: t-SNE perplexity must be < number of observations; the script will auto-adjust when needed.
Storage: keep derived matrices (standardized, medoid) alongside cluster assignments to ease reproducibility.

Legacy components

These parts of the repo reflect the original broader scope and are kept for reference:

Data_Analysis/CoxPH_models, Snakemake/clinical_associations: clinical outcome modeling and workflows.
Old_script_version/ and Preprocessing/archived/: earlier enrichment and clinical preprocessing utilities.
PyRadiomics-derived feature scripts/outputs (historical); current analyses prioritize FMCIB deep features.

File formats

Inputs

Genomics: CSV, samples as rows (index), signatures as columns
Radiomics: CSV, rows=samples with SampleID and pred_* columns
Clinical: CSV/TSV with outcomes and covariates

Outputs

Correlations: CSV matrices and PNG heatmaps
Clustering: cluster assignments, Gap-statistic tables, spectral/t-SNE embeddings, medoid matrices
Models: RData objects and figures

Contributing

Fork → 2) Branch → 3) Commit → 4) Push → 5) PR —

Last updated: November 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Radiogenomics: Integrative Analysis Pipeline

Overview

Research hypothesis

Project history and pivot

Current objectives

Repository organization

Data modalities and conventions

Installation

R packages (typical)

Python packages (typical)

Datasets and signatures

Best practices

Legacy components

File formats

Contributing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
Data_Analysis		Data_Analysis
Old_script_version		Old_script_version
Preprocessing		Preprocessing
Radiomics		Radiomics
Snakemake		Snakemake
.gitignore		.gitignore
README.md		README.md

bhklab/CT-Radiogenomics

Folders and files

Latest commit

History

Repository files navigation

Radiogenomics: Integrative Analysis Pipeline

Overview

Research hypothesis

Project history and pivot

Current objectives

Repository organization

Data modalities and conventions

Installation

R packages (typical)

Python packages (typical)

Datasets and signatures

Best practices

Legacy components

File formats

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages