Campylobacter CRISPRscape

Automated bioinformatics workflow for the identification and characterisation of Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) arrays from bacterial genomes, with a focus on Campylobacter coli and C. jejuni.

For more detailed documentation, please look at the project website. Here you also find the user manual, which includes a quick start guide as well as a detailed step-by-step description.

Index

Roadmap
Workflow description
- future suggestions
Output files
Project (file) organisation
Licence
Citation

Release roadmap

Version 0.2: tidy code
- remove outdated steps
- move long commands in Snakefile to separate script
- (re)apply linting (Black, Styler)
- apply 'bash strict mode' and suppress R messages
- move parts of Snakefile to separate scripts?
Version 0.3: solid foundation
- validate proper functioning of CCTyper + CRISPRidentify
  - adjust helper scripts where necessary
- correct scripts for making tables, integrate with Snakemake and test!
  - CRISPR spacer table (#24)
  - CRISPR-Cas locus table
  - (make sure every analysis part produces an output: include in 'rule all')
Version 0.4: clear documentation
- review and update README and docs
Future additions:
- genome deduplication (dRep)
- CRISPR spacer target prediction
  - map to
    - masked ATB genomes
    - PLSDB
    - PhageScope
    - VIRE
    - MEGAISurv metagenomes
  - mini-benchmark different mapping algorithms
    - Sassy
    - KMA
    - SpacePHARER
  - (where feasible) connect spacer hits with functional annotations!
- Integrate downstream analyses with Snakemake?
  - run RMarkdown/Quarto notebooks automatically
- build a database like this spacerdb?

Workflow description

The analysis is recorded as a Snakemake workflow. Its dependencies (bioinformatics tools) are handled by Snakemake using the conda package manager, or rather its successor mamba. If you have not yet done so, please install mamba following the instructions found here: https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html.

After installing mamba, snakemake can be installed using their instructions: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#full-installation (Note: the workflow was tested with Snakemake version 8.20.3 and is expected to work with any version since 5.)

When Snakemake has been set up, you can test if the workflow is ready to be run (dry-run) with:

snakemake --profile config -n

If that returns no errors, run the workflow by removing the -n (dry-run) option:

snakemake --profile config

Note that the workflow is currently configured to run on the local machine (not on a high-performance computing (HPC) cluster or grid) and uses a maximum of 60 CPU threads. The number of threads to use can be configured in: config/config.yaml (overall workflow) and config/parameters.yaml (per step/tool).

In its current state, the workflow:

Determines MultiLocus Sequence Types for Campylobacter jejuni/coli using the public typing scheme from pubMLST and pyMLST (version 2.2.1)
Identifies CRISPR-Cas loci with CCTyper (version 1.8.0) and resulting loci are processed with CRISPRidentify (forked from version 1.2.1) to reduce false positives.
- this includes extra scripts to collect CRISRP-Cas information and extract sequences from the genome fasta files
Collects all CRISPR spacers and creates clusters of identical spacers using CD-HIT-EST (version 4.8.1)
Predicts whether contigs of the species of interest derive from chromosomal DNA, plasmids or viruses using both geNomad (version 1.8.0) and Jaeger (version 1.1.26).
Predicts the potential targets of spacers and whether they target chromosomal DNA (of input genomes), plasmid or viruses using Spacepharer (version 5.c2e680a) and kma (version 1.5.0).

Further steps are added to the workflow after testing!

Preparing input and databases

Before running the workflow, the user needs to prepare input genomes and databases. Also see the corresponding documentation pages for details:

Suggestions of programs/analyses to test

Mash with CRISPR loci, and whole genomes (compare all-vs-all)
Map CRISPR spacers to all downloaded genomes (bowtie, KMA, and Sassy?), metagenome assemblies, other databases to predict targets
SpacerPlacer (see input file format in https://github.com/fbaumdicker/SpacerPlacer?tab=readme-ov-file#spacer_fasta-input-format (also requires an extra conversion script?)

Output files

Ticked boxes indicate that documentation is available.

AllTheBacteria metadata
- ENA metadata
  - Cleaned-up and filtered metadata of included genomes
- Species classifications
  - Taxonomic classification by Sylph, as collected from AllTheBacteria
Contig chromosome/plasmid/virus predictions
CRISPR-Cas overview table
- Output from CCTyper, collected and combined in one CSV file
- Combine with CRISPRidentify, create filtered crispr by adding Cas and orientation data onto crispridentify csv
CRISPR spacer table
MLST
- Sequence Types (ST) of all included genomes
List of spacer-putative targets
- Output from mapping unique spacers to possible targets seperated by plasmid or phage and merged with database metadata
List of anti-phage systems per genome
- Output from PADLOC, combined in single CSV file
Genome comparison all-vs-all
- by whole-genome MLST, average nucleotide identity (ANI) or similar(?)

Project organisation

.
├── .gitignore
├── CITATION.cff
├── LICENSE
├── README.md
├── Snakefile          <- Python-based workflow description
├── bin                <- Code and programs used in this project/experiment
├── config             <- Configuration of Snakemake workflow
├── data               <- All project data, divided in subfolders
│   ├── processed      <- Final data, used for visualisation (e.g. tables)
│   ├── raw            <- Raw data, original, should not be modified (e.g. fastq files)
│   └── tmp            <- Intermediate data, derived from the raw data, but not yet ready for visualisation
├── doc                <- Project documentation, notes and experiment records
├── envs               <- Conda environments necessary to run the project/experiment
├── log                <- Log files from programs
└── results            <- Figures or reports generated from processed data

Licence

This project is licensed under the terms of the New BSD licence.

Citation

Please cite this project as described in the citation file.

Name		Name	Last commit message	Last commit date
Latest commit History 277 Commits
.github		.github
bin		bin
config		config
doc		doc
envs		envs
renv		renv
workflow		workflow
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.renvignore		.renvignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yaml		mkdocs.yaml
renv.lock		renv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Campylobacter CRISPRscape

Index

Release roadmap

Workflow description

Preparing input and databases

Suggestions of programs/analyses to test

Output files

Project organisation

Licence

Citation

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

UtrechtUniversity/campylobacter-crisprscape

Folders and files

Latest commit

History

Repository files navigation

Campylobacter CRISPRscape

Index

Release roadmap

Workflow description

Preparing input and databases

Suggestions of programs/analyses to test

Output files

Project organisation

Licence

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages