Skip to content

eBM: Enhanced BRAIN-MAGNET convolutional neural network modeling non-coding regulatory elements enhancer activity in human neural stem cells

Notifications You must be signed in to change notification settings

caenrigen/enhanced_brain_magnet

Repository files navigation

Model on HF Dataset on HF

DOI

eBM: Enhanced BRAIN-MAGNET convolutional neural network modeling non-coding regulatory elements enhancer activity in human neural stem cells


Author's original graphical abstract

Enhanced-BRAIN-MAGNET is an improved BRAIN-MAGNET convolutional neural network (CNN) model trained to predict the enhancer activity of human non-coding regulatory elements (NCREs) from DNA sequence alone. Coupled with explainable-AI techniques, it facilitates clinical interpretation of disease relevant variants in non-coding regulatory elements (NCREs).

The training dataset was obtained from wet-lab experiments measuring the enhancer activity of NCREs. Shortly, the NCREs were identified from (H9-derived) human neural stem cells (NSCs) and transfected into NSCs and embryonic stem cells (ESCs), via the massively parallel reporter assay (MPRA) ChIP-STARR-seq technique.

For more context and experimental details please consult the peer-reviewed paper Deng et al., BRAIN-MAGNET: A functional genomics atlas for interpretation of non-coding variants, Cell (2026).

Compared to the original BRAIN-MAGNET model, eBM model architecture is ~10,000 times smaller (in terms of trainable parameters) achieving slightly better generalization performance. Additionally, it allows to derive sharper motifs (thanks to exponential activations).

In this repository I strived to provide reproducible and fairly performant code for data loading, data splitting, training models, loading trained models, making predictions (inference), calculating contribution score and running the motifs discovery algorithm (tfmodisco-lite).

Alongside the code, this README links to all the data necessary to run the code (and some intermediate datasets that are somewhat computation-intensive to generate). Similarly to the original repository, a "genomic atlas" of per-nucleotide contribution scores (cb) for ~148k NCREs is provided as an easy-to-query tabix-indexed dataset.

Table of contents

On the GitHub page, on the top right side of the README, there is a button that allows to list the outline of this README and to search/navigate to the contents of interest.

Repository background

In the beginning of 2025 I decided to explore new career avenues. I have been interested in machine learning and rare diseases (epilepsies in particular) for a while. My search eventually narrowed down to bioinformatics applied to (non-coding) genetics. My search lead me to Dr. Barakat's clinical genetics lab. He shared the latest preprint that came out from his lab (the BRAIN-MAGNET preprint). Besides the excellent experimental results, I got fascinated by the computational results and their potential impact on rare diseases.

During the spring and summer of 2025 I was on a quest to learn what was necessary in order to understand the contents of the preprint and to reproduce the machine learning and other computational aspects of their work.

This repository is the result of these efforts to reproduce (and improve) the CNN model and (part of the) associated data processing described in the original preprint (to which I did not contribute in significant ways).

A word of caution

The code and methods used in this repository produce results that are very compatible with the results reported in the original preprint, nonetheless this repository has not been peer-reviewed. At the time of writing (September 2025), I (@caenrigen) am still relatively new to the fields of genomics, machine learning and bioinformatics. Feel free to contact me if you spot any bugs, etc..

The git commit history of this repository reflects my learning journey and CNN training experiments, mixed with the mistakes I made along the journey! Therefore, I do not recommend to try to make sense of its history.

Original repository

This repository was initially forked from the original repository referenced in the paper. However, my code diverged completely and the few parts of the original repository that I did not touch have been removed from this one.

If you are looking for the R code that was used to generate the figures in the preprint/paper, please consult the original repository.

All the data used in my work here was obtained only from what the authors publicly made available (preprint, original GitHub repository, supplementary tables, cb scores atlas derived from NSC-trained model).

Focus of this repository

My repository focuses on reproducibility, CNN performance and architecture; organized, reusable and performant code (less RAM-hungry, can run on a laptop); and the contributions scores (cb) [aka importance scores , attribution scores, or simply attributions] derived from the CNN models via DeepLiftShap (a flavour of explainable-AI techniques).

Quick start for clinicians and alike: variant prioritization in NCREs

To start with, the authors of the paper provide some general recommendations:

  1. Population allele frequency (AF): Verify that your variants are rare by checking their AF in population databases such as gnomAD, GEL or similar. Prioritize variants with low (depending on disease frequency) or absent frequency in the general population.
  2. Regulatory region annotation: Determine whether your variants are located within the (enhanced) BRAIN-MAGNET NCREs atlas. Pay special attention to variants that fall within high activity NCREs category.
  3. Functional impact prediction: Assess the potential regulatory impact of the variants. Prioritize those with high contribution (cb) scores or disruption of predicted crucial motif sites. You could trace the motif sites via the authors-provided UCSC Genome Browser tracks (the cb scores of my repo have not yet been uploaded to the UCSC Genome Browser).
  4. Gene-Phenotype relevance: Evaluate whether the associated genes of the NCREs explain the patient's clinical phenotype.

Step 2 and 3 are detailed in the nb_query_cb.ipynb notebook (via command line or python). It shows how to query and, importantly, how to interpret the cb scores.

Visualize the dataset in UCSC

The original paper and repository provide UCSC Genome Browser tracks corresponding to the contribution scores. I did not upload tracks for eBM-derived cb scores. Contact me if you would find it useful.

Enhanced BRAIN-MAGNET vs BRAIN-MAGNET

In this brief comparison, I do not dive into detailed technical/methodological improvements over the original model and code, but instead I focus mainly on the final results.

BM model: ~65 million trainable parameters.

eBM model: ~5600 trainable parameters (~x10,000 smaller!) and slightly better performance see the nb_benchmark.ipynb for model performance.

The eBM-derived contribution (cb) scores match the ones published by the authors, are more robust, less noisy, and "more sharp". You can find a comparison plot in nb_cb_scores_sample_old_vs_new.ipynb. The eBM cb scores were derived by averaging over an ensemble of 5 k-folds cross-validation models, maximizing the knowledge extraction from the enhancers activity dataset.

Additionally, the eBM model architecture and the training structure was designed to be more robust against issues that might stem from training a model on padded one-hot encoded DNA sequences (the BRAIN-MAGNET dataset has NCREs varying from 500 to 1000 nucleotides-long).

Repository overview

  • Install the ebm package: pip install -e . (installs all dependencies).
  • Python 3.13 recommended (>=3.9 might work too). Tested on macOS 13.7.6 (M1 Pro CPU/GPU), PyTorch 2.8.0, Lightning 2.5.0.post0.
  • Notebooks live in notebooks. Paired .py files are generated via Jupytext for clean diffs. Most notebooks expose a device variable (CPU/GPU).
  • R is used only in nb_get_seqs_from_hg38.ipynb to extract hg38 sequences; you can skip it (datasets are provided below). Everything else is Python.
  • Training uses PyTorch Lightning framework; logging and live visualization via TensorBoard.

Dependencies for attribution scores and motif discovery

Key extras on top of standard scientific packages and PyTorch/Lightning:

  • modisco-lite: motif discovery from attribution scores.
  • tangermeme==0.5.1: DeepLiftShap attributions (replaces the unmaintained shap fork). Note: tangermeme==1.0.0 currently lacks zero-padding support.
  • dinuc_shuf: fast dinucleotide shuffles for reference sequences used by DeepLift/SHAP.
  • logomaker: sequence logos for motifs and attribution visualization.

Notebooks overview

A practical path through the project:

Model architecture

Small 1D CNN (~5.6k parameters) with reverse-complement awareness and stable training:

  • First layer: reverse-complement convolution (WSConv1dRevComp) + max pooling. Activations bounding for stability of exponential activation (next layer).
  • Exponential activation (expm1) after the first layer for sharper attributions.
  • Weight standardization for stability and offset-free activations (replacing BatchNorm).
  • Two additional WSConv1d blocks with max pooling, followed by a fully connected dense head.
  • See models.py, function make_bm_cnn, for the exact spec.

Model weights

The model weights can be downloaded from the Hugging Face model repository. Includes per-fold checkpoints (5-fold CV) on every epoch; TensorBoard logs mirror the runs.

Two model versions are provided: 0% test set data split and 10% test set data split are provided. The first proves the model generalization capabilities. The later maximizes the knowledge extract and was used to derive the cb scores.

Datasets

The training dataset along with intermediate and final datasets are available from the Hugging Face dataset repository. It includes, preprocessed hg38 sequences and labels, plus tabix-indexed predictions and cb scores for efficient downstream queries.

Note that the dataset containing the one-hot encoded sequences is automatically generated the first time the DataModule.prepare_data() (from the data_module.py) is invoked. You don't have to necessarily download it, it is provided for reproducibility.

Citation

If you found the contents of my repository useful please cite my work (click Cite this repository on the top right side of the GitHub page) and the original paper.

If appropriate, you can cite specifically the trained model weights and the datasets by clicking on the DOI url displayed on the Hugging Face repositories: huggingface.co/caenrigen/enhanced_brain_magnet and huggingface.co/datasets/caenrigen/enhanced_brain_magnet_hg38, respectively.

For disambiguation from the original model and publication, you can refer to my work as eBM (which stands for enhanced BRAIN-MAGNET).

Contact and feedback

For any inquiries or suggestions about this repository you can get in touch with me by opening a new discussion or a new issue for code- or dataset-related topics. Tag my username (@caenrigen) to ensure I get an email notification. Alternatively find me on LinkedIn.