eBM: Enhanced BRAIN-MAGNET convolutional neural network modeling non-coding regulatory elements enhancer activity in human neural stem cells
Enhanced-BRAIN-MAGNET is an improved BRAIN-MAGNET convolutional neural network (CNN) model trained to predict the enhancer activity of human non-coding regulatory elements (NCREs) from DNA sequence alone. Coupled with explainable-AI techniques, it facilitates clinical interpretation of disease relevant variants in non-coding regulatory elements (NCREs).
The training dataset was obtained from wet-lab experiments measuring the enhancer activity of NCREs. Shortly, the NCREs were identified from (H9-derived) human neural stem cells (NSCs) and transfected into NSCs and embryonic stem cells (ESCs), via the massively parallel reporter assay (MPRA) ChIP-STARR-seq technique.
For more context and experimental details please consult the peer-reviewed paper Deng et al., BRAIN-MAGNET: A functional genomics atlas for interpretation of non-coding variants, Cell (2026).
Compared to the original BRAIN-MAGNET model, eBM model architecture is ~10,000 times smaller (in terms of trainable parameters) achieving slightly better generalization performance. Additionally, it allows to derive sharper motifs (thanks to exponential activations).
In this repository I strived to provide reproducible and fairly performant code for data loading, data splitting, training models, loading trained models, making predictions (inference), calculating contribution score and running the motifs discovery algorithm (tfmodisco-lite).
Alongside the code, this README links to all the data necessary to run the code (and some intermediate datasets that are somewhat computation-intensive to generate). Similarly to the original repository, a "genomic atlas" of per-nucleotide contribution scores (cb) for ~148k NCREs is provided as an easy-to-query tabix-indexed dataset.
On the GitHub page, on the top right side of the README, there is a button that allows to list the outline of this README and to search/navigate to the contents of interest.
In the beginning of 2025 I decided to explore new career avenues. I have been interested in machine learning and rare diseases (epilepsies in particular) for a while. My search eventually narrowed down to bioinformatics applied to (non-coding) genetics. My search lead me to Dr. Barakat's clinical genetics lab. He shared the latest preprint that came out from his lab (the BRAIN-MAGNET preprint). Besides the excellent experimental results, I got fascinated by the computational results and their potential impact on rare diseases.
During the spring and summer of 2025 I was on a quest to learn what was necessary in order to understand the contents of the preprint and to reproduce the machine learning and other computational aspects of their work.
This repository is the result of these efforts to reproduce (and improve) the CNN model and (part of the) associated data processing described in the original preprint (to which I did not contribute in significant ways).
The code and methods used in this repository produce results that are very compatible with the results reported in the original preprint, nonetheless this repository has not been peer-reviewed. At the time of writing (September 2025), I (@caenrigen) am still relatively new to the fields of genomics, machine learning and bioinformatics. Feel free to contact me if you spot any bugs, etc..
The git commit history of this repository reflects my learning journey and CNN training experiments, mixed with the mistakes I made along the journey! Therefore, I do not recommend to try to make sense of its history.
This repository was initially forked from the original repository referenced in the paper. However, my code diverged completely and the few parts of the original repository that I did not touch have been removed from this one.
If you are looking for the R code that was used to generate the figures in the preprint/paper, please consult the original repository.
All the data used in my work here was obtained only from what the authors publicly made available (preprint, original GitHub repository, supplementary tables, cb scores atlas derived from NSC-trained model).
My repository focuses on reproducibility, CNN performance and architecture; organized, reusable and performant code (less RAM-hungry, can run on a laptop); and the contributions scores (cb) [aka importance scores , attribution scores, or simply attributions] derived from the CNN models via DeepLiftShap (a flavour of explainable-AI techniques).
To start with, the authors of the paper provide some general recommendations:
- Population allele frequency (AF): Verify that your variants are rare by checking their AF in population databases such as gnomAD, GEL or similar. Prioritize variants with low (depending on disease frequency) or absent frequency in the general population.
- Regulatory region annotation: Determine whether your variants are located within the (enhanced) BRAIN-MAGNET NCREs atlas. Pay special attention to variants that fall within high activity NCREs category.
- Functional impact prediction: Assess the potential regulatory impact of the variants. Prioritize those with high contribution (
cb) scores or disruption of predicted crucial motif sites. You could trace the motif sites via the authors-provided UCSC Genome Browser tracks (thecbscores of my repo have not yet been uploaded to the UCSC Genome Browser). - Gene-Phenotype relevance: Evaluate whether the associated genes of the NCREs explain the patient's clinical phenotype.
Step 2 and 3 are detailed in the nb_query_cb.ipynb notebook (via command line or python). It shows how to query and, importantly, how to interpret the cb scores.
The original paper and repository provide UCSC Genome Browser tracks corresponding to the contribution scores. I did not upload tracks for eBM-derived cb scores. Contact me if you would find it useful.
In this brief comparison, I do not dive into detailed technical/methodological improvements over the original model and code, but instead I focus mainly on the final results.
BM model: ~65 million trainable parameters.
eBM model: ~5600 trainable parameters (~x10,000 smaller!) and slightly better performance see the nb_benchmark.ipynb for model performance.
The eBM-derived contribution (cb) scores match the ones published by the authors, are more robust, less noisy, and "more sharp". You can find a comparison plot in nb_cb_scores_sample_old_vs_new.ipynb. The eBM cb scores were derived by averaging over an ensemble of 5 k-folds cross-validation models, maximizing the knowledge extraction from the enhancers activity dataset.
Additionally, the eBM model architecture and the training structure was designed to be more robust against issues that might stem from training a model on padded one-hot encoded DNA sequences (the BRAIN-MAGNET dataset has NCREs varying from 500 to 1000 nucleotides-long).
- Install the
ebmpackage:pip install -e .(installs all dependencies). - Python 3.13 recommended (>=3.9 might work too). Tested on macOS 13.7.6 (M1 Pro CPU/GPU), PyTorch 2.8.0, Lightning 2.5.0.post0.
- Notebooks live in notebooks. Paired
.pyfiles are generated via Jupytext for clean diffs. Most notebooks expose adevicevariable (CPU/GPU). Ris used only innb_get_seqs_from_hg38.ipynbto extract hg38 sequences; you can skip it (datasets are provided below). Everything else is Python.- Training uses PyTorch Lightning framework; logging and live visualization via TensorBoard.
Key extras on top of standard scientific packages and PyTorch/Lightning:
modisco-lite: motif discovery from attribution scores.tangermeme==0.5.1: DeepLiftShap attributions (replaces the unmaintainedshapfork). Note:tangermeme==1.0.0currently lacks zero-padding support.dinuc_shuf: fast dinucleotide shuffles for reference sequences used by DeepLift/SHAP.logomaker: sequence logos for motifs and attribution visualization.
A practical path through the project:
nb_dataset_assembly.ipynb: assemble training dataset.nb_dataset_overview.ipynb: inspect labels and sequence lengths.nb_trainining.ipynb,nb_training_results.ipynb: train eBM with Lightning and review metrics.nb_benchmark.ipynb: eBM test-set performance.nb_predictions_for_full_dataset_and_dinuc_shuffled.ipynb: predictions on full set and dinucleotide-shuffled references.nb_inference_with_trained_model.ipynb,nb_inspect_predictions.ipynb: load weights, run predictions, inspect outputs.nb_checks_before_cb_score.ipynb: validate and choose number of dinuc-shuffled references.nb_cb_scores.ipynb: compute contribution (cb) scores.nb_make_preds_tabix.ipynb,nb_make_cb_tabix.ipynb: package predictions/cbinto tabix-indexed files for fast queries.nb_query_cb.ipynb: query and interpretcbscores for genetic variants/regions (CLI and Python).nb_select_seqs_for_modisco.ipynb,nb_modisco.ipynb: motif discovery from attributions usingmodisco-lite.nb_modisco_results.ipynb: inspect modisco results and match them against JASPAR database via tomtom (memesuite-litefast implementation).nb_get_seqs_from_hg38.ipynb: extract hg38 sequences (R-only; optional, output provided below).
Small 1D CNN (~5.6k parameters) with reverse-complement awareness and stable training:
- First layer: reverse-complement convolution (
WSConv1dRevComp) + max pooling. Activations bounding for stability of exponential activation (next layer). - Exponential activation (
expm1) after the first layer for sharper attributions. - Weight standardization for stability and offset-free activations (replacing BatchNorm).
- Two additional
WSConv1dblocks with max pooling, followed by a fully connected dense head. - See
models.py, functionmake_bm_cnn, for the exact spec.
The model weights can be downloaded from the Hugging Face model repository. Includes per-fold checkpoints (5-fold CV) on every epoch; TensorBoard logs mirror the runs.
Two model versions are provided: 0% test set data split and 10% test set data split are provided. The first proves the model generalization capabilities. The later maximizes the knowledge extract and was used to derive the cb scores.
The training dataset along with intermediate and final datasets are available from the Hugging Face dataset repository. It includes, preprocessed hg38 sequences and labels, plus tabix-indexed predictions and cb scores for efficient downstream queries.
Note that the dataset containing the one-hot encoded sequences is automatically generated the first time the DataModule.prepare_data() (from the data_module.py) is invoked. You don't have to necessarily download it, it is provided for reproducibility.
If you found the contents of my repository useful please cite my work (click Cite this repository on the top right side of the GitHub page) and the original paper.
If appropriate, you can cite specifically the trained model weights and the datasets by clicking on the DOI url displayed on the Hugging Face repositories: huggingface.co/caenrigen/enhanced_brain_magnet and huggingface.co/datasets/caenrigen/enhanced_brain_magnet_hg38, respectively.
For disambiguation from the original model and publication, you can refer to my work as eBM (which stands for enhanced BRAIN-MAGNET).
For any inquiries or suggestions about this repository you can get in touch with me by opening a new discussion or a new issue for code- or dataset-related topics. Tag my username (@caenrigen) to ensure I get an email notification. Alternatively find me on LinkedIn.
