This Nextflow workflow automates many different metagenomics analyses steps from quality filtering to the generation and curation of metagenomics assembled genomes (MAGs). It uses diverse strategies to mitigate the number and size of temporary/intermediate files.
The pipeline includes several state-of-the-art programs in the field of metagenomics such as MetaBAT, dRep, CheckM2, QUAST, PhyloPhlAn, etc!
It is easy to dive into the code of this project as it includes four main components :
-
nextflow.config: allows one to specify the profile and parameters of the analyses, the computational requirements of each task, etc.
-
.env file: contains additional parameters necessary for the execution of the workflow.
-
main.nf: describes the workflow logic.
-
modules/local folder: contains the Nextflow code for each bioinformatics program.
It has been tested to run properly on at least 2 Slurm-based High-performance computing (HPC) systems.
Boolean options allow the user to include or skip some components of the workflow : the Kaiju branch, the Kraken2/Bracken branch, the co-assembly branch.
As illustrated in the Workflow diagram, the pipeline is made up of distinct branches of which some are optional.
The primary input data are raw paired-end short-read FASTQ files, but the workflow also offers alternative entry points: it offers the possibility to start from prepared reads (reads that have been trimmed and decontaminated) or to specify already obtained individual assemblies (with Megahit).
- Overview
- Data
- Parameters
- Usage
- Output
- Credits
- CONTRIBUTION
- LICENSE
- Publications and additional resources
The pipeline processes metagenomic sequencing data. The different input data types include:
- Raw sequencing reads (short-read, paired-end, FASTQ format)
- These reads undergo quality control, trimming, and host contamination removal before downstream analyses.
- Prepared reads (pre-processed reads after quality control)
- Users may provide already cleaned and decontaminated reads to bypass the initial preprocessing steps.
- Individual assemblies (assembled contigs from tools like MEGAHIT)
- The workflow supports skipping the assembly step by accepting pre-assembled contigs.
- Reference genomes (for host genome decontamination)
- Bowtie2 indexes of host genomes (e.g., pig, cow) are required for removing host DNA contamination.
- Database files (for functional, taxonomic, and phylogenetic profiling)
- Several external databases are used, including:
- Kraken2 and Kaiju (taxonomic classification)
- GTDB-Tk and PhyloPhlAn (phylogenetic analysis)
- HUMAnN (functional profiling)
- CheckM2 (MAGs quality control)
- Several external databases are used, including:
Pipeline execution is configured using the nextflow.config file and command-line parameters. Key configurable parameters include:
--reads- Path to raw sequencing reads (
*_R1.fastq.gz,*_R2.fastq.gz).
- Path to raw sequencing reads (
--prepared_reads- Path to pre-processed reads, if available.
--indiv_assemblies- Path to pre-assembled contigs (e.g., MEGAHIT output).
--map_file- Sample metadata file mapping sample names to sequencing files.
--coassembly_file- Metadata file defining sample groups for co-assembly.
--skip_kraken- Skip taxonomic classification using Kraken2 (default: false).
--skip_humann- Skip functional profiling with HUMAnN (default: false).
--skip_kaiju- Skip taxonomic classification using Kaiju (default: false).
--skip_coassembly- Skip co-assembly of metagenomic samples (default: false).
--use_prepared_reads- Use pre-filtered reads instead of raw reads (default: false).
--use_megahit_individual_assemblies- Use precomputed individual assemblies instead of running MEGAHIT (default: false).
--cpus- Number of CPU cores allocated per process.
--memory- Amount of memory allocated per process.
--profile- Specifies the computing environment profile (
hpc, or your own).
- Specifies the computing environment profile (
This pipeline is built with apptainer usage in mind, in order to utilize certain workflows. To build the proper apptainer images read further here.
This pipeline is built with the Nextflow language. If you are not familiar with this, you can read more here. The documentation is extensive and updated regularly. A good approach is to install Nextflow itself in a conda environment.
Several databases are required to perform all steps of the pipeline.
| Database | Module | Program |
|---|---|---|
| Chocophlan | humman_run.nf | Humann3 |
| Uniref | humman_run.nf | Humann3 |
| Metaphlan | humman_run.nf | Humann3 |
| CheckM2 | checkm.nf | checkm2 |
| gtdb | gtdb_tk.nf | gtdbtk |
| Kraken | kraken2.nf | kraken2 |
| Kaiju | kaiju.nf | kaiju |
| Phylophlan | phylophlan.nf | phylophlan |
You can read more about how databases are set-up in our databases documentation.
A Bowtie2 index of the host genome plus the phiX genome (included in data/genomes folder) is needed for the decontamination step of the raw reads.
First, download the latest pig reference genome here:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/GCF_000003025.6_Sscrofa11.1_genomic.fna.gzThen unzip the file:
gunzip GCF_000003025.6_Sscrofa11.1_genomic.fna.gzCombine the pig and PhiX genomes:
cat phiX.fa GCF_000003025.6_Sscrofa11.1_genomic.fna > Pig_PhiX_genomes.fnaThen, build the Bowtie2 index:
mkdir pig
conda activate bowtie2
sbatch \
-D $PWD \
--output $PWD/bowtie2-%j.out \
--export=ALL \
-J bowtie2-build \
-c 8 \
-p your_partition \
--account=your_account \
-t 300 \
--wrap="bowtie2-build Pig_PhiX_genomes.fna pig/pig"Finally, in the .env file, simply set the environment variable "GENOME" to the path of the folder containing the index files for the reference genome. In this example, if your bowtie index files are /share/pig/pig.1.bt2, /share/pig/pig.2.bt2, etc. set the GENOME env variable to /share/pig.
Raw reads (paired-end, FASTQ format) must be located in the folder data/reads and the map file in data/map_files. The latter allows you to give a relevant sample name to the corresponding metagenomics reads.
The map file is a .tsv file with at least 2 mandatory columns : sample_read and file_name. The idea is to have an expression corresponding to the desired sampleID plus _R1 or _R2 in the first column and the basename of the corresponding fastq files in the second column. Here is an example :
| sample_read | file_name |
|---|---|
| sampleName1_R1 | SampleName1_S1_L001_R1_001 |
| sampleName1_R2 | SampleName1_S1_L001_R2_001 |
| sampleName2_R1 | SampleName2_S1_L001_R1_001 |
| sampleName2_R2 | SampleName2_S1_L001_R2_001 |
When setting the rename parameters to 'yes', the renamed sub-workflow will rename the reads according to the values present in the sample_read column of the map file.
A good starting point to produce a compliant map_file for this workflow is to use the following snippet with your raw input files!
Depending on how your raw reads are named, you may have to adjust the cut command however.
cd data/reads
printf "sample_read\tfile_name\n" > ../map_files/map_file.tsv
for i in `ls *.fastq.gz`; do n=$(basename $i ".fastq.gz"); id=$(echo $i | cut -f 5 -d '.'); printf "$id\t$n\n"; done >> ../map_files/map_file.tsvTrying to co-assemble a large number of metagenomic samples (>10) with MEGAHIT is a tedious task that can take weeks to complete or not complete at all! To overcome this issue you have the option to perform several co-assemblies each containing a reduced number of samples (~2-10 samples).
If you want to perform co-assemblies, you will have to prepare a distinct map file specifying the desired groups for co-assemblies, e.g. :
| sample_read | file_name | project | sample_type | coassembly_group |
|---|---|---|---|---|
| sampleName1_R1 | SampleName1_S1_L001_R1_001 | project_x | type_x | group_x |
| sampleName1_R2 | SampleName1_S1_L001_R2_001 | project_x | type_x | group_x |
| sampleName2_R1 | SampleName2_S1_L001_R1_001 | project_y | type_y | group_y |
| sampleName2_R2 | SampleName2_S1_L001_R2_001 | project_y | type_y | group_y |
The co-assembly map file has 3 mandatory columns : sample_read, file_name and coassembly_group.
Edit the run.sh script to set up the --time parameter to an estimate of the pipeline duration.
The .env file is a text file essential to the pipeline operation. It contains various parameters necessary for the execution of the processes. These parameters are mainly paths to databases and Slurm settings specific to your HPC environment. Make sure to fill in all parameters before running the analysis. A template of the .env file is provided in the project directory.
| variable | Description |
|---|---|
| CONDA_SRC | The path to the conda.sh file of your miniconda3 installation. Typically, it is ~/miniconda3/etc/profile.d/conda.sh |
| NXF_APPTAINER_CACHEDIR | Nextflow caches Apptainer images in the apptainer or singularity directory, in the pipeline work directory, by default. Set this to the same value as WORKDIR to be on the safe side. |
| SLURM_ACCT | Your Slurm account |
| PARTITION | The partition for tasks that require less than 512GB |
| PARTITION_HIGH | Set this parameter to a partition with more than 1TB of memory |
| PARTITION_SUPER | Set this parameter to a partition with more than 2TB of memory |
| CLUSTERS | Specify the Slurm cluster |
| APPTAINER_IMGS | The location of your apptainer images |
| WORKDIR | The location of Nextflow's work directory. Can growth to several terabytes! Consider the option of using a folder located on a scratch partition. |
| GENOME | Path of the folder containing the index files of a reference genome for decontamination |
| CHOCOPHLAN_DB | database location |
| UNIREF_DB | database location |
| METAPHLAN_DB | database location |
| CHECKM2_DB | database location |
| GTDB_DB | database location |
| KAIJU_DB | database location |
| KRAKEN2_DB | database location |
| PHYLO_DB | database location |
This file is read by Nextflow and indicates some important parameters, for example the location of input reads, the name and location of the map_file, output folder, etc. Default values should work for most situations, but, as HPC systems differ greatly between organizations, it is recommended to review and adjust them according to your HPC specifications and your data.
This script will launch or resume a run. Execute it with this simple command from the working directory:
bash run.shThree files located in the output directory (results/logs by default) can be consulted to follow the progress of the analysis:
nextflow.logstdout.outstderr.err
At the end of a run, a detailed report in html format named report-YYYY-MM-DD-HHMMSS.html in the output directory (results by default) is also produced.
Pipeline outputs are directed to the folder specified in the results parameter from the profile used in nextflow.config. By default this is the output directory.
The metagenomic_nf workflow was written in the Nextflow language by Jean-Simon Brouard (AAFC/AAC Sherbrooke RDC). The main building blocks of this workflow come from the work of Devin Holman (AAFC/AAC Lacombe RDC) whereas the original scripts were written in Bash by Arun Kommadath (AAFC/AAC Lacombe). Sara Ricci, from the team of Renee Petri (AAFC/AAC Sherbrooke RDC) has also contributed to adapt this workflow for being used with cow samples. Mario Laterriere (AAFC/AAC Quebec RDC) has contributed by writing documentation and adapting the code in order that it can run seamlessly on any Slurm-based HPC infrastructures.
If you would like to contribute to this project, please review the guidelines in CONTRIBUTING.md and ensure you adhere to our CODE_OF_CONDUCT.md to foster a respectful and inclusive environment.
See the LICENSE file for details. Visit LicenseHub or tl;drLegal to view a plain-language summary of this license.
Copyright (c) His Majesty the King in Right of Canada, as represented by the Minister of Agriculture and Agri-Food, 2025.
The Metagenomic_nf workflow integrates various state-of-the-art tools for metagenomics analysis. Below are key references and resources for the tools used in the pipeline:
- Kraken2: https://ccb.jhu.edu/software/kraken2/
- Bracken: https://ccb.jhu.edu/software/bracken/
- Kaiju: https://github.com/bioinformatics-centre/kaiju
- MEGAHIT: https://github.com/voutcn/megahit
- MetaBAT2: https://bitbucket.org/berkeleylab/metabat/src/master/
- QUAST: http://bioinf.spbau.ru/quast
- GTDB-Tk: https://github.com/Ecogenomics/GTDBTk
- PhyloPhlAn: https://github.com/biobakery/phylophlan
For additional inquiries or troubleshooting, consult the Issues section of this repository.
