DeepSeMS

DeepSeMS is a Transformer-based large language model designed to reveal the hidden biosynthetic potential of microbial genomes. It predicts chemical structures (SMILES) directly from biosynthetic gene clusters (BGCs) identified by antiSMASH or DeepBGC.

Key Features:

Predicts natural product structures directly from BGC sequences.
Supports BGC annotations generated by antiSMASH and DeepBGC as input sources.
Provides top-ranked SMILES, model confidence, and consensus scoring.
User friendly web server and fully reproducible environment via Docker.

Publication

DeepSeMS: a large language model reveals hidden biosynthetic potential of the global ocean microbiome. bioRxiv 2025.03.02.641084; doi: https://doi.org/10.1101/2025.03.02.641084

Web Server

DeepSeMS provides an online web server that allows users to run the model without installing any software: https://biochemai.cstspace.cn/deepsems/

Installation & Setup

DeepSeMS can be run through Docker (recommended) or a local Conda environment.
Before running DeepSeMS, please complete the Required Files & Project Setup section below.

Required Files & Project Setup

DeepSeMS requires several data files prior to running predictions:

Step 1: Clone the repository

git clone https://github.com/lab-of-biochemai/DeepSeMS.git
cd DeepSeMS

Step 2: Download required support files

All required support files are hosted on Figshare (http://doi.org/10.6084/m9.figshare.29680658) and must be downloaded before running predictions, as GitHub does not support large file storage. Download and place them accordingly:

1. Model checkpoints:

Place the checkpoint files (checkpoint0.ckpt ... checkpoint9.ckpt) into the ./checkpoints/ directory.

2. Pfam database:

Unzip pfam.zip and copy all Pfam files (e.g., Pfam-A.hmm, Pfam-A.hmm.h3f, etc.) into the ./data/pfam/ directory.
Note: You may need to run hmmpress ./data/pfam/Pfam-A.hmm if index files are missing.

Step 3: Verify directory structure

Ensure your local directory is organized as follows before running the model:

DeepSeMS/
├── checkpoints/       # Place model weights here (.ckpt)
├── data/
│   ├── pfam/          # Place Pfam database files here
│   ├── data_set.csv   # Data set for training
├── vocabs/            # Vocabulary files
├── test/              # Input files for prediction
│   ├── outputs/       # Annotation and output result files
├── tokenizer/
│   ├── tokenizer.py   # Tokenizer
├── models/            # Model architecture code
├── calculate_molecular_properties.py   # Result post-processing
├── data_processing.py # Data processing for training
├── predict.py         # Prediction script
├── train.py           # Training script
├── environment.yml    # Code environment file
└── README.md

Prepare input BGC files

DeepSeMS supports two input types (Examples are provided in the repository):

Source Tool	File Format	Example
antiSMASH	`GENBANK`	`./test/antiSMASH_example.gbk`
DeepBGC	`FASTA`	`./test/DeepBGC_example.fa`

⚠️ Raw genome FASTA cannot be used directly. You must first annotate BGCs using antiSMASH or DeepBGC.

Run DeepSeMS with Docker (Recommended)

Docker provides the simplest and most reproducible way to run DeepSeMS.

Step 1: Pull the DeepSeMS docker image

docker pull tingjunxu2022/deepsems:v1

Step 2: Start the container

Mount your current directory (containing code and data) to /deepsems inside the container.

# Assuming the DeepSeMS project directory is /home/user/deepsems.
docker run -it -v /home/user/DeepSeMS:/deepsems tingjunxu2022/deepsems:v1 /bin/bash

Or run the container with GPUs. (NVIDIA Container Toolkit)

docker run --gpus all -it -v /home/user/DeepSeMS:/deepsems tingjunxu2022/deepsems:v1 /bin/bash

Step 3: Run predictions

Inside the container, use predict.py to generate SMILES strings from BGC file.

cd /deepsems
python predict.py

Arguments:
- --input: Path to the input file.
- --type: Input format. Options: antismash (default) or deepbgc.
- --output: Directory to save annotation and output result files (default: ./test/outputs/).
- --pfam: Directory to pfam database files (default: ./data/pfam/).

Example 1: Predict from antiSMASH results (GenBank)

This is the default mode.

python predict.py --input ./test/antiSMASH_example.gbk --type antismash

Example 2: Predict from DeepBGC results (FASTA)

python predict.py --input ./test/DeepBGC_example.fa --type deepbgc

Step 4: Understanding the output

The output result will be directly printed and saved to an .csv file in output directory with the same name as the input file (e.g., ./test/outputs/antiSMASH_example/antiSMASH_example_result.csv).
Results are ranked by consensus across the top-10 models and predicted scores, with the top-ranked structure being the one most consistently predicted.

Typical output:
------------------------------
Rank: 1 
Predicted SMILES: CC(C)C1C=CC(=O)NCCC=CC(NC(=O)C(NC(=O)O)C(C)C)C(=O)N1
Predicted score: 87.86
Consensus count: 5
------------------------------
Rank: 2
Predicted SMILES: CCCCCCC=CC=CC(=O)NC(C(=O)NC1CCCCNC(=O)C=CC(CC)NC1=O)C(C)O
Predicted score: 85.68
Consensus count: 3
------------------------------
Rank: 3
Predicted SMILES: CCCCCCCCC=CCC(=O)NC(C(=O)NC1CC(=O)C=CC(C(C)C)NC1=O)C(C)C
Predicted score: 83.41
Consensus count: 2
...

Explanation:

Field	Meaning
Predicted SMILES	The predicted valid chemical structure
Predicted score	Model confidence (higher = better)
Consensus count	Consistency among 10 submodels (higher = more reliable)

Advanced usage

Calculate molecular properties

You must manually download SAScorer from RDKit's official GitHub repository (https://github.com/rdkit/rdkit/tree/master/Contrib/SA_Score), and place the package files (e.g., sascorer.py, fpscores.pkl.gz) into the ./sascorer directory.
Run calculate_molecular_properties.py to calculate molecular properties from a DeepSeMS result file:

# For example:
python calculate_molecular_properties.py --input_file ./test/outputs/antiSMASH_example/antiSMASH_example_result.csv

Arguments:
- --input_file: Path to the DeepSeMS result file.
- --output_dir: (Optional) Directory to save the output file. If not specified, the output file will be saved in the same directory as the input file.

Batch running

predict.py currently supports processing a single BGC per run.
To efficiently run predictions on a large number of BGC files, you can use GNU parallel or a simple Bash loop.

Install GNU parallel

# Ubuntu / Debian
sudo apt-get install parallel
# macOS
brew install parallel
# Conda (recommended, no sudo required)
conda install -c conda-forge parallel

Assume your input BGC files are stored in ./inputs/. Run antiSMASH .gbk files in parallel.

parallel -j N     'python predict.py --input {} --type antismash'     ::: ./inputs/*.gbk

Run DeepBGC .fa files in parallel.

parallel -j N     'python predict.py --input {} --type deepbgc'     ::: ./inputs/*.fa

Notes:
- -j N: number of parallel jobs
- Outputs will be written to the default output directory (or specify via --output).

Alternative: Bash loop (if GNU parallel is not available)

for f in ./inputs/*.gbk; do
    python predict.py --input "$f" --type antismash
done

Local Installation (Conda)

If you prefer to run it locally without Docker, we recommend using Conda.
Create a Conda environment from environment.yml

conda env create -f environment.yml
conda activate deepsems

Or configure a Conda environment step by step

conda create -n deepsems python=3.10
conda activate deepsems
conda install -c bioconda hmmer=3.3.2
pip install torch==2.1.0 torchtext==0.16.0
pip install biopython==1.79 pandas==2.0.3 rdkit==2023.03.1 numpy==1.26.0

Model Construction

👉 Please refer to the detailed model construction documentation: Model Construction.

Requirements

Annotated versions are tested, later versions should generally work.

Language: Python 3.10
Deep Learning: PyTorch 2.1.0, TorchText 0.16.0
Bioinformatics: HMMER3 (v3.3.2), Biopython (v1.79), Pfam Database (v36.0)
Chemistry: RDKit (v2023.03.1)
Data Handling: Pandas (v2.0.3), numpy (v1.26.0)

Preferred Hardware

CUDA 12.0 (tested)
GPU VRAM: 24 GB (NVIDIA GeForce RTX 4090 tested)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeepSeMS

Table of Contents

Publication

Web Server

Installation & Setup

Required Files & Project Setup

Step 1: Clone the repository

Step 2: Download required support files

1. Model checkpoints:

2. Pfam database:

Step 3: Verify directory structure

Prepare input BGC files

Run DeepSeMS with Docker (Recommended)

Step 1: Pull the DeepSeMS docker image

Step 2: Start the container

Step 3: Run predictions

Example 1: Predict from antiSMASH results (GenBank)

Example 2: Predict from DeepBGC results (FASTA)

Step 4: Understanding the output

Advanced usage

Calculate molecular properties

Batch running

Install GNU parallel

Alternative: Bash loop (if GNU parallel is not available)

Local Installation (Conda)

Model Construction

Requirements

Preferred Hardware

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
checkpoints		checkpoints
data		data
models		models
test		test
tokenizer		tokenizer
vocabs		vocabs
LICENSE		LICENSE
Model_Training.md		Model_Training.md
ReadMe.md		ReadMe.md
calculate_molecular_properties.py		calculate_molecular_properties.py
data_augmentation.png		data_augmentation.png
data_processing.py		data_processing.py
deepsems_architecture.png		deepsems_architecture.png
environment.yml		environment.yml
image.png		image.png
model_construction.md		model_construction.md
predict.py		predict.py
train.py		train.py

License

lab-of-biochemai/DeepSeMS

Folders and files

Latest commit

History

Repository files navigation

DeepSeMS

Table of Contents

Publication

Web Server

Installation & Setup

Required Files & Project Setup

Step 1: Clone the repository

Step 2: Download required support files

1. Model checkpoints:

2. Pfam database:

Step 3: Verify directory structure

Prepare input BGC files

Run DeepSeMS with Docker (Recommended)

Step 1: Pull the DeepSeMS docker image

Step 2: Start the container

Step 3: Run predictions

Example 1: Predict from antiSMASH results (GenBank)

Example 2: Predict from DeepBGC results (FASTA)

Step 4: Understanding the output

Advanced usage

Calculate molecular properties

Batch running

Install GNU parallel

Alternative: Bash loop (if GNU parallel is not available)

Local Installation (Conda)

Model Construction

Requirements

Preferred Hardware

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages