Skip to content

Module to build local snpEff database#9967

Open
pmoris wants to merge 3 commits intonf-core:masterfrom
pmoris:snpeff-build-config
Open

Module to build local snpEff database#9967
pmoris wants to merge 3 commits intonf-core:masterfrom
pmoris:snpeff-build-config

Conversation

@pmoris
Copy link
Contributor

@pmoris pmoris commented Feb 11, 2026

My attempt at improving the snpEff module(s) to make it possible to generate a snpEff config file and database directory from reference fasta/annotation/cds/protein files.

Motivation

For the use-cases of my lab (small pathogen genomes), we've up until now always relied on creating a custom snpEff database ourselves, rather than using the pre-built databases as recommended by snpEff. We reasoned this makes it easier to control the versions of reference genomes and ensure we are using the same ones for alignment and annotation.

I can imagine there are other use cases where custom genomes not present in the database are required as well.

Note I haven't dived into https://annotation-cache.github.io/ yet, but I guess even there it makes sense to allow users to create a completely custom local database if they desire to do so.

Caveats and todos

  • The current behaviour of the annotation module (specifically the cache) is not entirely clear to me yet. So it's possible that the changes I've suggested to make the annotation module compatible with the output of the build module, break existing functionality. If there's no obvious way to consolidate the two approaches, we could just keep the existing cache approach and make the custom database directory + config file optional?
    • I also removed the task.process output since I didn't need it, but it can likewise be added back.
  • Several comments / optional todos need to be cleaned.
  • Add tests for build module.
  • Clean up meta.yml (there were some mistakes in there from before, e.g. meta map descriptions describe vcf files).
  • Check how the module behaves when running it on multiple references sequentially (there might be some workflows that want to process different samples with different references), or multiple references simultaneously.

Related info:

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Broadcast software version numbers to topic: versions - See version_topics
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

This module can build a snpEff config file and database from an
input  reference fasta, annotation and optionally cds/protein file.
It outputs the database and config file separately.

The snpEff config is created based on an existing template file
(user-provided, but could alternatively be found inside snpEff's
install directory), and appends appropriate genome lines for
the all provided genome.

The snpEff build command makes use of the dataDir flag in
combination with a relative path, which snpEff resolves relative
to the provided config file. This will override any datadir definition
in the config file.
Makes snpeff annotation module compatible with the build module.
The module expects a VCF, snpeff config and snpeff database as its inputs.

To be determined whether or not this presents a breaking change compared
to the previous implementation (e.g. cache_command vs datadir).
@pmoris pmoris requested a review from maxulysse as a code owner February 11, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant