DeepSeMS is a Transformer-based large language model designed to reveal the hidden biosynthetic potential of microbial genomes. It predicts chemical structures (SMILES) directly from biosynthetic gene clusters (BGCs) identified by antiSMASH or DeepBGC.
Key Features:
- Predicts natural product structures directly from BGC sequences.
- Supports BGC annotations generated by antiSMASH and DeepBGC as input sources.
- Provides top-ranked SMILES, model confidence, and consensus scoring.
- User friendly web server and fully reproducible environment via Docker.

- Publication
- Web Server
- Installation & Setup
- Model Construction
- Requirements
- Preferred Hardware
DeepSeMS: a large language model reveals hidden biosynthetic potential of the global ocean microbiome. bioRxiv 2025.03.02.641084; doi: https://doi.org/10.1101/2025.03.02.641084
DeepSeMS provides an online web server that allows users to run the model without installing any software: https://biochemai.cstspace.cn/deepsems/
DeepSeMS can be run through Docker (recommended) or a local Conda environment.
Before running DeepSeMS, please complete the Required Files & Project Setup section below.
DeepSeMS requires several data files prior to running predictions:
git clone https://github.com/lab-of-biochemai/DeepSeMS.git
cd DeepSeMS
All required support files are hosted on Figshare (http://doi.org/10.6084/m9.figshare.29680658) and must be downloaded before running predictions, as GitHub does not support large file storage. Download and place them accordingly:
Place the checkpoint files (checkpoint0.ckpt ... checkpoint9.ckpt) into the ./checkpoints/ directory.
Unzip pfam.zip and copy all Pfam files (e.g., Pfam-A.hmm, Pfam-A.hmm.h3f, etc.) into the ./data/pfam/ directory.
Note: You may need to run hmmpress ./data/pfam/Pfam-A.hmm if index files are missing.
Ensure your local directory is organized as follows before running the model:
DeepSeMS/
├── checkpoints/ # Place model weights here (.ckpt)
├── data/
│ ├── pfam/ # Place Pfam database files here
│ ├── data_set.csv # Data set for training
├── vocabs/ # Vocabulary files
├── test/ # Input files for prediction
│ ├── outputs/ # Annotation and output result files
├── tokenizer/
│ ├── tokenizer.py # Tokenizer
├── models/ # Model architecture code
├── calculate_molecular_properties.py # Result post-processing
├── data_processing.py # Data processing for training
├── predict.py # Prediction script
├── train.py # Training script
├── environment.yml # Code environment file
└── README.md
DeepSeMS supports two input types (Examples are provided in the repository):
| Source Tool | File Format | Example |
|---|---|---|
| antiSMASH | GENBANK |
./test/antiSMASH_example.gbk |
| DeepBGC | FASTA |
./test/DeepBGC_example.fa |
Docker provides the simplest and most reproducible way to run DeepSeMS.
docker pull tingjunxu2022/deepsems:v1Mount your current directory (containing code and data) to /deepsems inside the container.
# Assuming the DeepSeMS project directory is /home/user/deepsems.
docker run -it -v /home/user/DeepSeMS:/deepsems tingjunxu2022/deepsems:v1 /bin/bashOr run the container with GPUs. (NVIDIA Container Toolkit)
docker run --gpus all -it -v /home/user/DeepSeMS:/deepsems tingjunxu2022/deepsems:v1 /bin/bashInside the container, use predict.py to generate SMILES strings from BGC file.
cd /deepsems
python predict.py- Arguments:
--input: Path to the input file.--type: Input format. Options: antismash (default) or deepbgc.--output: Directory to save annotation and output result files (default: ./test/outputs/).--pfam: Directory to pfam database files (default: ./data/pfam/).
This is the default mode.
python predict.py --input ./test/antiSMASH_example.gbk --type antismashpython predict.py --input ./test/DeepBGC_example.fa --type deepbgcThe output result will be directly printed and saved to an .csv file in output directory with the same name as the input file (e.g., ./test/outputs/antiSMASH_example/antiSMASH_example_result.csv).
Results are ranked by consensus across the top-10 models and predicted scores, with the top-ranked structure being the one most consistently predicted.
Typical output:
------------------------------
Rank: 1
Predicted SMILES: CC(C)C1C=CC(=O)NCCC=CC(NC(=O)C(NC(=O)O)C(C)C)C(=O)N1
Predicted score: 87.86
Consensus count: 5
------------------------------
Rank: 2
Predicted SMILES: CCCCCCC=CC=CC(=O)NC(C(=O)NC1CCCCNC(=O)C=CC(CC)NC1=O)C(C)O
Predicted score: 85.68
Consensus count: 3
------------------------------
Rank: 3
Predicted SMILES: CCCCCCCCC=CCC(=O)NC(C(=O)NC1CC(=O)C=CC(C(C)C)NC1=O)C(C)C
Predicted score: 83.41
Consensus count: 2
...Explanation:
| Field | Meaning |
|---|---|
| Predicted SMILES | The predicted valid chemical structure |
| Predicted score | Model confidence (higher = better) |
| Consensus count | Consistency among 10 submodels (higher = more reliable) |
You must manually download SAScorer from RDKit's official GitHub repository (https://github.com/rdkit/rdkit/tree/master/Contrib/SA_Score), and place the package files (e.g., sascorer.py, fpscores.pkl.gz) into the ./sascorer directory.
Run calculate_molecular_properties.py to calculate molecular properties from a DeepSeMS result file:
# For example:
python calculate_molecular_properties.py --input_file ./test/outputs/antiSMASH_example/antiSMASH_example_result.csv- Arguments:
--input_file: Path to the DeepSeMS result file.--output_dir: (Optional) Directory to save the output file. If not specified, the output file will be saved in the same directory as the input file.
predict.py currently supports processing a single BGC per run.
To efficiently run predictions on a large number of BGC files, you can use GNU parallel or a simple Bash loop.
# Ubuntu / Debian
sudo apt-get install parallel
# macOS
brew install parallel
# Conda (recommended, no sudo required)
conda install -c conda-forge parallelAssume your input BGC files are stored in ./inputs/. Run antiSMASH .gbk files in parallel.
parallel -j N 'python predict.py --input {} --type antismash' ::: ./inputs/*.gbkRun DeepBGC .fa files in parallel.
parallel -j N 'python predict.py --input {} --type deepbgc' ::: ./inputs/*.fa- Notes:
-j N: number of parallel jobs- Outputs will be written to the default output directory (or specify via
--output).
for f in ./inputs/*.gbk; do
python predict.py --input "$f" --type antismash
doneIf you prefer to run it locally without Docker, we recommend using Conda.
Create a Conda environment from environment.yml
conda env create -f environment.yml
conda activate deepsemsOr configure a Conda environment step by step
conda create -n deepsems python=3.10
conda activate deepsems
conda install -c bioconda hmmer=3.3.2
pip install torch==2.1.0 torchtext==0.16.0
pip install biopython==1.79 pandas==2.0.3 rdkit==2023.03.1 numpy==1.26.0👉 Please refer to the detailed model construction documentation: Model Construction.
Annotated versions are tested, later versions should generally work.
- Language:
Python 3.10 - Deep Learning:
PyTorch 2.1.0, TorchText 0.16.0 - Bioinformatics:
HMMER3 (v3.3.2), Biopython (v1.79), Pfam Database (v36.0) - Chemistry:
RDKit (v2023.03.1) - Data Handling:
Pandas (v2.0.3), numpy (v1.26.0)
CUDA 12.0(tested)GPU VRAM: 24 GB(NVIDIA GeForce RTX 4090 tested)