📄 Paper 📦 Models 🚀 Quick Start
PanSpace is a library for creating and querying vector-based indexes of bacterial genome assemblies. It enables fast similarity search across massive bacterial databases by learning compact embedding representations of genomes.
- FCGR Generation: Genomes are represented as Frequency matrices of Chaos Game Representations (FCGR)
- Embedding: FCGRs are mapped to n-dimensional vectors using a CNN encoder (CNNFCGR)
- Indexing & Search: Embeddings are indexed with FAISS for efficient similarity queries
- 🚀 Fast Queries: Millisecond-scale searches across millions of genomes
- 📊 FCGR-Based: Uses Chaos Game Representation for genome encoding
- 🧠 Deep Learning: CNN-based encoders for learning compact representations
- 🔍 FAISS Integration: Efficient similarity search at scale
- 📦 Pre-trained Models: Ready-to-use encoders and indexes available
- ⚙️ Flexible Training: Supports metric learning (with labels) or autoencoders (unsupervised)
- 🔄 Snakemake Pipelines: Automated workflows for batch processing
- Python 3.9 or 3.10 (TensorFlow compatibility)
- Conda or Mamba (recommended)
pip install panspace[cpu]pip install panspace[gpu]pip install "panspace[cpu] @ git+https://github.com/pg-space/panspace.git"pip install "panspace[gpu] @ git+https://github.com/pg-space/panspace.git"Clone the repository
git clone https://github.com/pg-space/panspace.git
cd panspaceconda env create -f envs/cpu.yml
conda activate panspace-cpuconda env create -f envs/gpu.yml
conda activate panspace-gpupanspace app- Download pre-trained encoder and index from Zenodo
- Extract the files:
. ├── checkpoints/ │ └── weights-CNNFCGR_Levels.keras └── index/ ├── panspace.index ├── labels.json └── *.json - Run queries:
panspace query-smk \ --dir-sequences "path/to/assemblies/" \ --path-encoder "checkpoints/weights-CNNFCGR_Levels.keras" \ --path-index "index/panspace.index"
Download from Zenodo
| K-mer | Embedding Size | Model File | Status |
|---|---|---|---|
| 8 | 256 | triplet_semihard_loss-ranger-0.5-hq-256-CNNFCGR_Levels-level1-clip80.zip |
⭐ Recommended |
| 6 | 128 | Available in Zenodo | ✓ |
| 7 | 256 | Available in Zenodo | ✓ |
| 8 | 512 | Available in Zenodo | ✓ |
Each .zip contains:
- Encoder:
checkpoints/<model-name>.keras - Index:
index/panspace.index - Metadata: Label mappings and configurations
- Docker and Docker Compose installed
sudo docker-compose up --buildThe app will be available at http://localhost:8501
To run in the background (detached mode):
sudo docker-compose up --build -dsudo docker-compose downPlace your files in the local folders:
./indexes/— your FAISS indexes and metadata./sequences/— your FASTA files for querying
Perfect for querying existing databases without training.
# Download from Zenodo
wget https://zenodo.org/records/17402877/files/triplet_semihard_loss-ranger-0.5-hq-256-CNNFCGR_Levels-level1-clip80.zip
unzip triplet_semihard_loss-ranger-0.5-hq-256-CNNFCGR_Levels-level1-clip80.zipOrganize your FASTA files in a directory:
assemblies/
├── sample1.fa.gz
├── sample2.fa
└── sample3.fna
Using the Snakemake wrapper (recommended):
panspace query-smk \
--dir-sequences "assemblies/" \
--path-encoder "checkpoints/weights-CNNFCGR_Levels.keras" \
--path-index "index/panspace.index" \
--outdir "results/" \
--cores 8 \
--kmc-threads 2With fast FCGR generation (requires FCGR extension):
panspace query-smk \
--dir-sequences "assemblies/" \
--path-encoder "checkpoints/weights-CNNFCGR_Levels.keras" \
--path-index "index/panspace.index" \
--fast-version \
--fcgr-bin <path/bin/fcgr> \
--outdir "results/" \
[--gpu]Using Snakemake directly:
-
Configure
scripts/config_query.yml:dir_sequences: "assemblies/" outdir: "results/" device: "cpu" # or "gpu" path_encoder: "checkpoints/weights-CNNFCGR_Levels.keras" path_index: "index/panspace.index" kmer: 8
-
Run pipeline:
snakemake -s scripts/query.smk --cores 8 --use-conda
Create custom encoders and indexes for your specific dataset.
Option A: From FASTA files (single file)
panspace fcgr from-fasta \
--path-fasta assembly.fa \
--kmer 8 \
--path-save fcgr.npyOption B: From k-mer counts
# First count k-mers with KMC3
kmc -k8 -fm assembly.fa assembly.kmc tmp/
# Then create FCGR
panspace fcgr from-kmer-counts \
--kmer 8 \
--path-kmer-counts assembly.kmc \
--path-save fcgr.npyOption C: Batch processing with Snakemake (recommended for large datasets)
- Configure
scripts/config_fcgr.yml - Run:
snakemake -s scripts/create_fcgr.smk --cores 8 --use-conda
For faster processing with FCGR extension:
snakemake -s scripts/create_fcgr_fast.smk --cores 8 --use-conda \
--config fcgr_bin=/path/to/fcgrSplit your data into train/validation/test sets:
panspace trainer split-dataset \
--data-dir fcgr_data/ \
--output-dir splits/ \
--train-ratio 0.7 \
--val-ratio 0.15 \
--test-ratio 0.15Output structure:
splits/
├── train/
├── val/
└── test/
Choose a training strategy based on your data:
Triplet Loss (best for large datasets):
panspace trainer metric-learning \
--train-dir splits/train/ \
--val-dir splits/val/ \
--kmer 8 \
--embedding-dim 256 \
--batch-size 32 \
--epochs 100 \
--learning-rate 1e-4 \
--output-dir models/triplet/Contrastive Loss (one-shot learning):
panspace trainer one-shot \
--train-dir splits/train/ \
--val-dir splits/val/ \
--kmer 8 \
--embedding-dim 256 \
--margin 1.0 \
--epochs 100 \
--output-dir models/contrastive/Extract the encoder:
panspace trainer extract-backbone-one-shot \
--model-path models/contrastive/model.keras \
--output-path models/contrastive/encoder.kerasAutoencoder:
panspace trainer autoencoder \
--train-dir splits/train/ \
--val-dir splits/val/ \
--kmer 8 \
--embedding-dim 256 \
--epochs 100 \
--output-dir models/autoencoder/Extract the encoder:
panspace trainer split-autoencoder \
--model-path models/autoencoder/autoencoder.keras \
--output-encoder models/autoencoder/encoder.keras \
--output-decoder models/autoencoder/decoder.kerasBuild a FAISS index from your trained encoder:
panspace index create \
--data-dir fcgr_data/ \
--encoder-path models/triplet/encoder.keras \
--output-index index/panspace.index \
--output-metadata index/metadata.json \Index types:
Flat: Exact search, slower but accurateIVF1024,Flat: Inverted file index, faster with slight approximationHNSW32: Hierarchical graph, very fast
From FCGR files:
panspace index query \
--query-fcgr query.npy \
--encoder-path models/triplet/encoder.keras \
--index-path index/panspace.index \
--metadata-path index/metadata.json \
--n-neighbors 10From FASTA files (use Snakemake wrapper shown above)
panspace --help╭─ Commands ──────────────────────────────────────────────────────────────╮
│ app Run streamlit app for interactive queries │
│ fcgr Create FCGRs from fasta files or k-mer counts │
│ trainer Train encoders using metric learning or autoencoders │
│ index Create and query FAISS indexes │
│ query-smk Run Snakemake query pipeline │
│ data-curation Find outliers and mislabeled samples │
│ stats-assembly Compute assembly statistics (N50, contigs, etc.) │
│ utils Extract info from logs and text files │
│ what-to-do Step-by-step guide for new users │
│ docs Open documentation webpage │
╰─────────────────────────────────────────────────────────────────────────╯
# Create FCGR from FASTA
panspace fcgr from-fasta \
--path-fasta <file.fa> \
--kmer <k> \
--path-save <output.npy>
# Create FCGR from k-mer counts
panspace fcgr from-kmer-counts \
--kmer <k> \
--path-kmer-counts <kmc_output> \
--path-save <output.npy>
# Save FCGR as image
panspace fcgr to-image \
--path-fcgr <input.npy> \
--path-save <output.png># Split dataset
panspace trainer split-dataset \
--data-dir <dir> \
--output-dir <output> \
[--train-ratio 0.7] [--val-ratio 0.15]
# Metric learning (Triplet loss)
panspace trainer metric-learning \
--train-dir <train/> \
--val-dir <val/> \
--kmer <k> \
--embedding-dim <dim> \
[--batch-size 32] [--epochs 100]
# One-shot learning (Contrastive loss)
panspace trainer one-shot \
--train-dir <train/> \
--val-dir <val/> \
--kmer <k> \
--embedding-dim <dim> \
[--margin 1.0] [--epochs 100]
# Autoencoder (Unsupervised)
panspace trainer autoencoder \
--train-dir <train/> \
--val-dir <val/> \
--kmer <k> \
--embedding-dim <dim> \
[--epochs 100]
# Extract encoder from trained models
panspace trainer extract-backbone-one-shot \
--model-path <model.keras> \
--output-path <encoder.keras>
panspace trainer split-autoencoder \
--model-path <autoencoder.keras> \
--output-encoder <encoder.keras> \
--output-decoder <decoder.keras># Create index
panspace index create \
--data-dir <fcgr_data/> \
--encoder-path <encoder.keras> \
--output-index <panspace.index> \
--output-metadata <metadata.json> \
# Query index
panspace index query \
--query-fcgr <query.npy> \
--encoder-path <encoder.keras> \
--index-path <panspace.index> \
--metadata-path <metadata.json> \
[--n-neighbors 10]
# Test index performance
panspace index test \
--test-dir <test/> \
--encoder-path <encoder.keras> \
--index-path <panspace.index> \
--metadata-path <metadata.json># Query with Snakemake wrapper
panspace query-smk \
--dir-sequences <assemblies/> \
--path-encoder <encoder.keras> \
--path-index <panspace.index> \
[--outdir results/] \
[--cores 8] \
[--fast-version] # requires FCGR extension
# See all options
panspace query-smk --helpFor very large datasets (e.g., AllTheBacteria), use specialized k-mer counters:
With KMC3:
# Count k-mers
kmc -k8 -m64 -t8 -fm assembly.fa output tmp/
# Create FCGR
panspace fcgr from-kmer-counts \
--kmer 8 \
--path-kmer-counts output \
--path-save fcgr.npyWith FCGR Extension (faster):
# Install from https://github.com/pg-space/fcgr
fcgr -k 8 -i assembly.fa -o fcgr.npyProcess AllTheBacteria dataset:
# See scripts/allthebacteria_*.smk
snakemake -s scripts/allthebacteria_fcgr.smk \
--config input_dir=/path/to/allthebacteria \
--cores 32 \
--use-condaFind outliers and potential mislabeling:
panspace data-curation \
--embeddings-path embeddings.npy \
--labels-path labels.json \
--output-dir curation_results/Compute N50, contig counts, and more:
panspace stats-assembly \
--fasta-path assembly.fa \
--output stats.jsonpanspace/
├── panspace/ # Core Python package
│ ├── cli/ # Command-line interface
│ ├── models/ # TensorFlow models (CNNFCGR)
│ ├── trainers/ # Training logic
│ ├── indexing/ # FAISS index management
│ └── streamlit_app/ # Interactive visualization
├── scripts/ # Snakemake workflows
│ ├── query.smk # Query pipeline
│ ├── query_fast.smk # Fast query with FCGR extension
│ ├── create_fcgr.smk # FCGR generation
│ └── config_*.yml # Configuration files
├── envs/ # Conda environments
│ ├── cpu.yml # CPU version
│ └── gpu.yml # GPU version
├── tests/ # Unit tests
└── docs/ # Documentation
-
Use GPU: 10-100x faster for encoding
conda activate panspace-gpu
-
Use FCGR Extension: 5-10x faster FCGR generation
panspace query-smk --fast-version
-
Parallel Processing: Increase cores for Snakemake
snakemake -s scripts/query.smk --cores 32
-
Batch Queries: Process multiple files at once with Snakemake
- Use appropriate index types for large databases
- Process large datasets in batches
- Configure KMC3 memory limits in
config_*.yml
If you use PanSpace in your research, please cite:
@article{cartes2025panspace,
title={PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases},
author={Cartes, Jorge Avila and Ciccolella, Simone and Denti, Luca and Dandinasivara, Raghuram and Vedova, Gianluca Della and Bonizzoni, Paola and Sch{\"o}nhuth, Alexander},
journal={bioRxiv},
pages={2025--03},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}TensorFlow installation problems:
# Ensure correct Python version (3.9-3.10)
python --version
# Reinstall with conda
conda install -c conda-forge tensorflowFCGR extension not found:
# Install from source
git clone https://github.com/pg-space/fcgr
cd fcgr && make installSnakemake fails:
# Clear cache and retry
snakemake --unlock
rm -rf .snakemake/
snakemake -s scripts/query.smk --use-conda --cores 8- AllTheBacteria: Comprehensive bacterial genome database
- KMC3: Fast k-mer counting
- FCGR Extension: Optimized FCGR generation
- ComplexCGR: Python FCGR library
- FAISS: Efficient similarity search
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- 📧 Email: [email protected]
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
This project is licensed under the GPL-3.0 License - see the LICENSE file for details.
- Built with TensorFlow and FAISS
- FCGR generation powered by KMC3 and custom extensions
- Inspired by deep metric learning approaches
PanSpace is developed and maintained by Jorge Avila Cartes
⭐ Star us on GitHub if PanSpace helps your research!


