Merge pull request #1273 from cmlakhan/main

gaow · web-flow · commit 6bcdc815afdc · 2025-12-02T10:25:20.000-05:00
adding the updated text for Katie's python notebook
diff --git a/code/xqtl_modifier_score/ems_training.ipynb b/code/xqtl_modifier_score/ems_training.ipynb
@@ -7,8 +7,7 @@
    "source": [
     "# scEEMS Model Training Tutorial\n",
     "\n",
-    "This notebook implements the scEEMS (Single-Cell Enhanced Expression Modifier Scores) training methodology as outlined in the manuscript.  The scEEMS model predicts the probability of a genetic variant is a cell-specific eQTL based on fine-mapping data from single cell eQTL data and as a function of thousands of variant and gene annotations.\n",
-    "\n"
+    "This notebook implements the scEEMS (single-cell Enhanced Expression Modifier Scores) training methodology as outlined in the manuscript. scEEMS is a machine learning framework that predicts whether a genetic variant is a causal cell-type-specific eQTL as a function of 4,839 variant, gene, and variant-gene pair features. The models are trained on fine-mapped single-cell eQTL data from six brain cell types (astrocytes, excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, and oligodendrocyte precursor cells) from the ROSMAP cohort, with applications to identifying functional variants relevant to Alzheimer's disease.\n"
    ]
   },
   {
@@ -18,9 +17,32 @@
    "source": [
     "## Motivation\n",
     "\n",
-    "Traditional genetic studies only explain 5% of Alzheimer's disease heritability. The remaining 95% \"missing heritability\" occurs because bulk tissue approaches cannot capture regulatory effects that occur in specific cell types. Our machine learning approach identifies genetic variants that affect gene expression specifically in brain cell types like microglia, potentially explaining previously undetected disease mechanisms.\n",
+    "### The \"Missing Regulation\" Problem\n",
     "\n",
-    "- Our Motivation is to create a reproducible pipeline that accurately predicts functional genetic variants using cell-type specific genomic annotations, transforming complex research findings into a standardized tool for the broader scientific community."
+    "Most disease-associated GWAS variants lie in non-coding regions of the genome, where they likely modulate gene expression. However, bulk-tissue eQTL studies fail to explain the majority of these variants, a phenomenon termed \"missing regulation\" [(Connally et al., 2022)](https://elifesciences.org/articles/74970v1). This gap exists because there are systematic differences between variants identified in eQTL studies versus disease GWAS [(Mostafavi et al., 2022)](https://www.nature.com/articles/s41588-023-01529-1):\n",
+    "\n",
+    "- **eQTLs** are enriched in promoter regions and affect genes under weaker selective constraint\n",
+    "- **GWAS variants** are enriched in distal enhancer regions and affect genes under stronger selective constraint\n",
+    "\n",
+    "Understanding how non-coding GWAS variants modulate gene expression is critical for uncovering disease mechanisms, but several challenges limit our ability to make these connections:\n",
+    "\n",
+    "1. **Cell-type specificity**: Bulk tissue approaches cannot capture regulatory effects that occur in specific cell types, particularly rare but disease-relevant populations like microglia in Alzheimer's disease\n",
+    "2. **Enhancer variants**: Disease-associated variants in distal enhancers often have weaker eQTL signals that fail to reach statistical significance, especially in underpowered single-cell studies\n",
+    "3. **Limited sample sizes**: Single-cell eQTL mapping has reduced statistical power compared to bulk studies, making it difficult to detect true regulatory signals in rare cell types\n",
+    "\n",
+    "### scEEMS Solution\n",
+    "\n",
+    "scEEMS addresses these challenges by predicting causal cell-type-specific eQTLs using machine learning trained on 4,839 genomic features, including:\n",
+    "- Deep learning-based variant effect predictions\n",
+    "- Cell-type-specific regulatory annotations  \n",
+    "- Activity-by-Contact (ABC) enhancer-gene linkages\n",
+    "- Distance and evolutionary constraint features\n",
+    "\n",
+    "By identifying functional variants in cell-type-specific contexts—particularly in enhancer regions—scEEMS aims to bridge the gap between non-coding GWAS variants and their target genes, improving our understanding of disease mechanisms in Alzheimer's disease.\n",
+    "\n",
+    "### Tutorial Objective\n",
+    "\n",
+    "This notebook provides a reproducible pipeline for training scEEMS models, demonstrating the complete methodology from data preparation through model evaluation. The goal is to enable the broader scientific community to apply this approach to their own cell-type-specific eQTL datasets and disease contexts."
    ]
   },
   {
@@ -30,21 +52,50 @@
    "source": [
     "## Methods Overview\n",
     "\n",
-    "### Feature-Weighted CatBoost Algorithm\n",
-    "We use a single **CatBoost** (Categorical Boosting) gradient boosting model optimized for genomic data. CatBoost builds an ensemble of decision trees sequentially, where each tree learns from previous errors, making it particularly effective for high-dimensional biological datasets with mixed data types.\n",
+    "### CatBoost Algorithm\n",
+    "We use [CatBoost](https://github.com/catboost/catboost), a gradient boosting framework that builds an ensemble of decision trees sequentially. CatBoost is effective for high-dimensional biological datasets with mixed data types.\n",
     "\n",
-    "**Our Training Strategy**: Single feature-weighted model (Model 5) that emphasizes biology-informed features while maintaining comprehensive genomic context.\n",
+    "### Model Training Strategy\n",
+    "\n",
+    "We train a CatBoost model with 10x upweighting of deep learning features (feature weight = 10 for DL-VEP features vs. 1 for other features). This model was selected as optimal based on external validation and heritability analysis described in the manuscript.\n",
     "\n",
     "### Training Data Construction\n",
-    "**Data Source**: Chromosome 2 variants with non-overlapping splits\n",
-    "- **Training set**: 3,056 variants (80% of available data)\n",
-    "- **Testing set**: 761 variants (20% of available data)\n",
-    "- **Critical validation**: Zero variant overlap between training and testing sets\n",
-    "\n",
-    "**Label Definition (Y)**:\n",
-    "- Y = 1: Functional eQTL (PIP > 0.1) - variant significantly affects gene expression\n",
-    "- Y = 0: Non-functional variant (PIP < 0.01) - no detectable expression effect\n",
-    "- **Class distribution**: 9% positive rate reflects biological reality that most variants are non-functional"
+    "\n",
+    "**Data Source**: Fine-mapped single-cell eQTLs from six brain cell types (astrocytes, excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, and oligodendrocyte precursor cells) in the ROSMAP cohort.\n",
+    "\n",
+    "**Positive Class (Y=1)**:\n",
+    "- Variants with PIP > 0.05 in a credible set where the maximum PIP exceeds 0.1, OR\n",
+    "- Variants with PIP > 0.5 regardless of credible set membership\n",
+    "\n",
+    "**Negative Class (Y=0)**: \n",
+    "- For each positive variant, we sample 10 negative variants from the same gene with PIP < 0.01, matched on variant type (SNP, insertion, deletion)\n",
+    "\n",
+    "**Test Set**:\n",
+    "- Positive variants: PIP > 0.90\n",
+    "- Negative variants: 10 matched variants per positive variant with PIP < 0.01\n",
+    "- Restricted to MEGA genes only\n",
+    "\n",
+    "### Sample Weighting\n",
+    "\n",
+    "- Negative variants: weight = 1\n",
+    "- Positive variants: weighted proportional to their PIP values\n",
+    "- Total weight balanced between positive and negative classes\n",
+    "\n",
+    "### Cross-Validation: Leave-One-Chromosome-Out (LOCO)\n",
+    "\n",
+    "For each of the 22 autosomes:\n",
+    "1. Train on variants from all other 21 chromosomes\n",
+    "2. Test on the held-out chromosome\n",
+    "3. Aggregate predictions from all 22 held-out chromosomes for final performance metrics\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### Toy Dataset Note\n",
+    "\n",
+    "This tutorial uses chromosome 2 data only for demonstration:\n",
+    "- Training: 3,056 variants \n",
+    "- Testing: 761 variants (non-overlapping)\n",
+    "- The full study trained models across all 22 chromosomes for each of 6 cell types"
    ]
   },
   {