GitHub - pg-space/panspace: Embedding-based indexing for compact storage and rapid querying of bacterial pan-genomes

Fast and Scalable Indexing for Massive Bacterial Databases

Overview

PanSpace is a library for creating and querying vector-based indexes of bacterial genome assemblies. It enables fast similarity search across massive bacterial databases by learning compact embedding representations of genomes.

How It Works

FCGR Generation: Genomes are represented as Frequency matrices of Chaos Game Representations (FCGR)
Embedding: FCGRs are mapped to n-dimensional vectors using a CNN encoder (CNNFCGR)
Indexing & Search: Embeddings are indexed with FAISS for efficient similarity queries

Key Features

🚀 Fast Queries: Millisecond-scale searches across millions of genomes
📊 FCGR-Based: Uses Chaos Game Representation for genome encoding
🧠 Deep Learning: CNN-based encoders for learning compact representations
🔍 FAISS Integration: Efficient similarity search at scale
📦 Pre-trained Models: Ready-to-use encoders and indexes available
⚙️ Flexible Training: Supports metric learning (with labels) or autoencoders (unsupervised)
🔄 Snakemake Pipelines: Automated workflows for batch processing

Installation

Requirements

Python 3.9 or 3.10 (TensorFlow compatibility)
Conda or Mamba (recommended)

Quick Install: from pypi

CPU Version

pip install panspace[cpu]

GPU Version

pip install panspace[gpu]

Install from github repository

CPU

pip install "panspace[cpu] @ git+https://github.com/pg-space/panspace.git"

GPU

pip install "panspace[gpu] @ git+https://github.com/pg-space/panspace.git"

Install from source

Clone the repository

git clone https://github.com/pg-space/panspace.git
cd panspace

CPU Version

conda env create -f envs/cpu.yml
conda activate panspace-cpu

GPU Version

conda env create -f envs/gpu.yml
conda activate panspace-gpu

Quick Start

Try the Interactive App

panspace app

Query with Pre-trained Models

Download pre-trained encoder and index from Zenodo

Extract the files:

.
├── checkpoints/
│   └── weights-CNNFCGR_Levels.keras
└── index/
    ├── panspace.index
    ├── labels.json
    └── *.json

Run queries:

panspace query-smk \
    --dir-sequences "path/to/assemblies/" \
    --path-encoder "checkpoints/weights-CNNFCGR_Levels.keras" \
    --path-index "index/panspace.index"

Available Pre-trained Models

Download from Zenodo

K-mer	Embedding Size	Model File	Status
8	256	`triplet_semihard_loss-ranger-0.5-hq-256-CNNFCGR_Levels-level1-clip80.zip`	⭐ Recommended
6	128	Available in Zenodo	✓
7	256	Available in Zenodo	✓
8	512	Available in Zenodo	✓

Each .zip contains:

Encoder: checkpoints/<model-name>.keras
Index: index/panspace.index
Metadata: Label mappings and configurations

Running the App with Docker

Prerequisites

Docker and Docker Compose installed

Run the App

sudo docker-compose up --build

The app will be available at http://localhost:8501

To run in the background (detached mode):

sudo docker-compose up --build -d

Stop the App

sudo docker-compose down

Using Your Own Data

Place your files in the local folders:

./indexes/ — your FAISS indexes and metadata
./sequences/ — your FASTA files for querying

Complete Workflow

Option 1: Using Pre-trained Models (Recommended)

Perfect for querying existing databases without training.

Step 1: Download Models

# Download from Zenodo
wget https://zenodo.org/records/17402877/files/triplet_semihard_loss-ranger-0.5-hq-256-CNNFCGR_Levels-level1-clip80.zip
unzip triplet_semihard_loss-ranger-0.5-hq-256-CNNFCGR_Levels-level1-clip80.zip

Step 2: Prepare Query Sequences

Organize your FASTA files in a directory:

assemblies/
├── sample1.fa.gz
├── sample2.fa
└── sample3.fna

Step 3: Query the Index

Using the Snakemake wrapper (recommended):

panspace query-smk \
    --dir-sequences "assemblies/" \
    --path-encoder "checkpoints/weights-CNNFCGR_Levels.keras" \
    --path-index "index/panspace.index" \
    --outdir "results/" \
    --cores 8 \
    --kmc-threads 2

With fast FCGR generation (requires FCGR extension):

panspace query-smk \
    --dir-sequences "assemblies/" \
    --path-encoder "checkpoints/weights-CNNFCGR_Levels.keras" \
    --path-index "index/panspace.index" \
    --fast-version \
    --fcgr-bin <path/bin/fcgr> \
    --outdir "results/" \
    [--gpu]

Using Snakemake directly:

Configure scripts/config_query.yml:

dir_sequences: "assemblies/"
outdir: "results/"
device: "cpu"  # or "gpu"
path_encoder: "checkpoints/weights-CNNFCGR_Levels.keras"
path_index: "index/panspace.index"
kmer: 8

Run pipeline:

snakemake -s scripts/query.smk --cores 8 --use-conda

Option 2: Training Your Own Models

Create custom encoders and indexes for your specific dataset.

Step 1: Generate FCGRs

Option A: From FASTA files (single file)

panspace fcgr from-fasta \
    --path-fasta assembly.fa \
    --kmer 8 \
    --path-save fcgr.npy

Option B: From k-mer counts

# First count k-mers with KMC3
kmc -k8 -fm assembly.fa assembly.kmc tmp/

# Then create FCGR
panspace fcgr from-kmer-counts \
    --kmer 8 \
    --path-kmer-counts assembly.kmc \
    --path-save fcgr.npy

Option C: Batch processing with Snakemake (recommended for large datasets)

Configure scripts/config_fcgr.yml

Run:

snakemake -s scripts/create_fcgr.smk --cores 8 --use-conda

For faster processing with FCGR extension:

snakemake -s scripts/create_fcgr_fast.smk --cores 8 --use-conda \
    --config fcgr_bin=/path/to/fcgr

Step 2: Prepare Dataset

Split your data into train/validation/test sets:

panspace trainer split-dataset \
    --data-dir fcgr_data/ \
    --output-dir splits/ \
    --train-ratio 0.7 \
    --val-ratio 0.15 \
    --test-ratio 0.15

Output structure:

splits/
├── train/
├── val/
└── test/

Step 3: Train Encoder

Choose a training strategy based on your data:

Option A: Metric Learning with Labels (Recommended)

Triplet Loss (best for large datasets):

panspace trainer metric-learning \
    --train-dir splits/train/ \
    --val-dir splits/val/ \
    --kmer 8 \
    --embedding-dim 256 \
    --batch-size 32 \
    --epochs 100 \
    --learning-rate 1e-4 \
    --output-dir models/triplet/

Contrastive Loss (one-shot learning):

panspace trainer one-shot \
    --train-dir splits/train/ \
    --val-dir splits/val/ \
    --kmer 8 \
    --embedding-dim 256 \
    --margin 1.0 \
    --epochs 100 \
    --output-dir models/contrastive/

Extract the encoder:

panspace trainer extract-backbone-one-shot \
    --model-path models/contrastive/model.keras \
    --output-path models/contrastive/encoder.keras

Option B: Unsupervised Learning (No Labels)

Autoencoder:

panspace trainer autoencoder \
    --train-dir splits/train/ \
    --val-dir splits/val/ \
    --kmer 8 \
    --embedding-dim 256 \
    --epochs 100 \
    --output-dir models/autoencoder/

Extract the encoder:

panspace trainer split-autoencoder \
    --model-path models/autoencoder/autoencoder.keras \
    --output-encoder models/autoencoder/encoder.keras \
    --output-decoder models/autoencoder/decoder.keras

Step 4: Create Index

Build a FAISS index from your trained encoder:

panspace index create \
    --data-dir fcgr_data/ \
    --encoder-path models/triplet/encoder.keras \
    --output-index index/panspace.index \
    --output-metadata index/metadata.json \

Index types:

Flat: Exact search, slower but accurate
IVF1024,Flat: Inverted file index, faster with slight approximation
HNSW32: Hierarchical graph, very fast

Step 5: Query Your Index

From FCGR files:

panspace index query \
    --query-fcgr query.npy \
    --encoder-path models/triplet/encoder.keras \
    --index-path index/panspace.index \
    --metadata-path index/metadata.json \
    --n-neighbors 10

From FASTA files (use Snakemake wrapper shown above)

CLI Reference

Main Commands

panspace --help

╭─ Commands ──────────────────────────────────────────────────────────────╮
│ app              Run streamlit app for interactive queries              │
│ fcgr             Create FCGRs from fasta files or k-mer counts         │
│ trainer          Train encoders using metric learning or autoencoders   │
│ index            Create and query FAISS indexes                        │
│ query-smk        Run Snakemake query pipeline                          │
│ data-curation    Find outliers and mislabeled samples                  │
│ stats-assembly   Compute assembly statistics (N50, contigs, etc.)      │
│ utils            Extract info from logs and text files                 │
│ what-to-do       Step-by-step guide for new users                      │
│ docs             Open documentation webpage                             │
╰─────────────────────────────────────────────────────────────────────────╯

FCGR Commands

# Create FCGR from FASTA
panspace fcgr from-fasta \
    --path-fasta <file.fa> \
    --kmer <k> \
    --path-save <output.npy>

# Create FCGR from k-mer counts
panspace fcgr from-kmer-counts \
    --kmer <k> \
    --path-kmer-counts <kmc_output> \
    --path-save <output.npy>

# Save FCGR as image
panspace fcgr to-image \
    --path-fcgr <input.npy> \
    --path-save <output.png>

Training Commands

# Split dataset
panspace trainer split-dataset \
    --data-dir <dir> \
    --output-dir <output> \
    [--train-ratio 0.7] [--val-ratio 0.15]

# Metric learning (Triplet loss)
panspace trainer metric-learning \
    --train-dir <train/> \
    --val-dir <val/> \
    --kmer <k> \
    --embedding-dim <dim> \
    [--batch-size 32] [--epochs 100]

# One-shot learning (Contrastive loss)
panspace trainer one-shot \
    --train-dir <train/> \
    --val-dir <val/> \
    --kmer <k> \
    --embedding-dim <dim> \
    [--margin 1.0] [--epochs 100]

# Autoencoder (Unsupervised)
panspace trainer autoencoder \
    --train-dir <train/> \
    --val-dir <val/> \
    --kmer <k> \
    --embedding-dim <dim> \
    [--epochs 100]

# Extract encoder from trained models
panspace trainer extract-backbone-one-shot \
    --model-path <model.keras> \
    --output-path <encoder.keras>

panspace trainer split-autoencoder \
    --model-path <autoencoder.keras> \
    --output-encoder <encoder.keras> \
    --output-decoder <decoder.keras>

Index Commands

# Create index
panspace index create \
    --data-dir <fcgr_data/> \
    --encoder-path <encoder.keras> \
    --output-index <panspace.index> \
    --output-metadata <metadata.json> \

# Query index
panspace index query \
    --query-fcgr <query.npy> \
    --encoder-path <encoder.keras> \
    --index-path <panspace.index> \
    --metadata-path <metadata.json> \
    [--n-neighbors 10]

# Test index performance
panspace index test \
    --test-dir <test/> \
    --encoder-path <encoder.keras> \
    --index-path <panspace.index> \
    --metadata-path <metadata.json>

Query Pipeline

# Query with Snakemake wrapper
panspace query-smk \
    --dir-sequences <assemblies/> \
    --path-encoder <encoder.keras> \
    --path-index <panspace.index> \
    [--outdir results/] \
    [--cores 8] \
    [--fast-version]  # requires FCGR extension

# See all options
panspace query-smk --help

Advanced Usage

Custom FCGR Generation

For very large datasets (e.g., AllTheBacteria), use specialized k-mer counters:

With KMC3:

# Count k-mers
kmc -k8 -m64 -t8 -fm assembly.fa output tmp/

# Create FCGR
panspace fcgr from-kmer-counts \
    --kmer 8 \
    --path-kmer-counts output \
    --path-save fcgr.npy

With FCGR Extension (faster):

# Install from https://github.com/pg-space/fcgr
fcgr -k 8 -i assembly.fa -o fcgr.npy

Batch Processing Examples

Process AllTheBacteria dataset:

# See scripts/allthebacteria_*.smk
snakemake -s scripts/allthebacteria_fcgr.smk \
    --config input_dir=/path/to/allthebacteria \
    --cores 32 \
    --use-conda

Data Curation

Find outliers and potential mislabeling:

panspace data-curation \
    --embeddings-path embeddings.npy \
    --labels-path labels.json \
    --output-dir curation_results/

Assembly Statistics

Compute N50, contig counts, and more:

panspace stats-assembly \
    --fasta-path assembly.fa \
    --output stats.json

Project Structure

panspace/
├── panspace/              # Core Python package
│   ├── cli/              # Command-line interface
│   ├── models/           # TensorFlow models (CNNFCGR)
│   ├── trainers/         # Training logic
│   ├── indexing/         # FAISS index management
│   └── streamlit_app/    # Interactive visualization
├── scripts/              # Snakemake workflows
│   ├── query.smk         # Query pipeline
│   ├── query_fast.smk    # Fast query with FCGR extension
│   ├── create_fcgr.smk   # FCGR generation
│   └── config_*.yml      # Configuration files
├── envs/                 # Conda environments
│   ├── cpu.yml          # CPU version
│   └── gpu.yml          # GPU version
├── tests/               # Unit tests
└── docs/                # Documentation

Performance Tips

Speed Optimization

Use GPU: 10-100x faster for encoding
```
conda activate panspace-gpu
```
Use FCGR Extension: 5-10x faster FCGR generation
```
panspace query-smk --fast-version
```
Parallel Processing: Increase cores for Snakemake
```
snakemake -s scripts/query.smk --cores 32
```
Batch Queries: Process multiple files at once with Snakemake

Memory Optimization

Use appropriate index types for large databases
Process large datasets in batches
Configure KMC3 memory limits in config_*.yml

Citation

If you use PanSpace in your research, please cite:

@article{cartes2025panspace,
  title={PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases},
  author={Cartes, Jorge Avila and Ciccolella, Simone and Denti, Luca and Dandinasivara, Raghuram and Vedova, Gianluca Della and Bonizzoni, Paola and Sch{\"o}nhuth, Alexander},
  journal={bioRxiv},
  pages={2025--03},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}

Troubleshooting

Common Issues

TensorFlow installation problems:

# Ensure correct Python version (3.9-3.10)
python --version

# Reinstall with conda
conda install -c conda-forge tensorflow

FCGR extension not found:

# Install from source
git clone https://github.com/pg-space/fcgr
cd fcgr && make install

Snakemake fails:

# Clear cache and retry
snakemake --unlock
rm -rf .snakemake/
snakemake -s scripts/query.smk --use-conda --cores 8

Related Tools

AllTheBacteria: Comprehensive bacterial genome database
KMC3: Fast k-mer counting
FCGR Extension: Optimized FCGR generation
ComplexCGR: Python FCGR library
FAISS: Efficient similarity search

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Support

📧 Email: [email protected]
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.

Acknowledgments

Built with TensorFlow and FAISS
FCGR generation powered by KMC3 and custom extensions
Inspired by deep metric learning approaches

Author

PanSpace is developed and maintained by Jorge Avila Cartes

⭐ Star us on GitHub if PanSpace helps your research!

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
envs		envs
img		img
notebooks		notebooks
scripts		scripts
sequences		sequences
src/panspace		src/panspace
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

License

pg-space/panspace

Folders and files

Latest commit

History

Repository files navigation

Overview

How It Works

Key Features

Installation

Requirements

Quick Install: from pypi

CPU Version

GPU Version

Install from github repository

CPU

GPU

Install from source

CPU Version

GPU Version

Quick Start

Try the Interactive App

Query with Pre-trained Models

Available Pre-trained Models

Running the App with Docker

Prerequisites

Run the App

Stop the App

Using Your Own Data

Complete Workflow

Option 1: Using Pre-trained Models (Recommended)

Step 1: Download Models

Step 2: Prepare Query Sequences

Step 3: Query the Index

Option 2: Training Your Own Models

Step 1: Generate FCGRs

Step 2: Prepare Dataset

Step 3: Train Encoder

Option A: Metric Learning with Labels (Recommended)

Option B: Unsupervised Learning (No Labels)

Step 4: Create Index

Step 5: Query Your Index

CLI Reference

Main Commands

FCGR Commands

Training Commands

Index Commands

Query Pipeline

Advanced Usage

Custom FCGR Generation

Batch Processing Examples

Data Curation

Assembly Statistics

Project Structure

Performance Tips

Speed Optimization

Memory Optimization

Citation

Troubleshooting

Common Issues

Related Tools

Contributing

Support

License

Acknowledgments

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages