A consensus calling pipeline provided in Bash.
Initially, the pipeline was used to make quick alignments of fastq reads against a reference using smalt, now it’s mainly used for HIV and HCV consensus generation for diagnostics.
It does the following:
- Subsample reads with seqtk (optional).
- Make de novo alignment of sampled reads with velvet.
- If more than one reference is provided into
<reference_file>, then it runsselect_ref_whole.py, which selects the best reference sequence for each FASTQ file. - Align sampled reads (and only in the first iteration de novo contigs in triplicate) against reference with smalt.
- Create consensus with freebayes.
- Create vcf with lofreq.
- Calculate depth with samtools.
- If max number of iteration reached, call the final consensus sequence using final vcf file and the given ambiguity threshold. Otherwise, repeat from step 3.
cov_plot.Rplots the coverage of all iterations.wts.Rcombines consensus sequence, variants and coverage for the last iteration.
All the necessary references are in the References directory. Further information about cov_plot.R and wts.R is provided below.
To ensure you have all dependencies needed for SmaltAlign installed you can use the environment.yml file.
First you need to have Conda installed).
With the command conda env create -f <path>/environment.yml you will create a copy of the smaltalign environment.
You enter the environment with the command conda activate smaltalign (and leave it with conda deactivate).
For more information visit following link to Managing environments.
smaltalign.sh -r <reference_file> [options] <fastq_file/directory>
Options:
[-h or --help]
[-n or --numreads]
[-i or --iterations]
[-t or --varthres]
[-c or --mincov]
[-o or --outdir]
[-d or --indels]
[-m or --mergecov]
Run smaltalign.sh -h for detailed information of the options and the default parameters.
If you would like to run the script using one specific reference sequence, you can specify this reference into <reference_file> (in fasta format).
However, if you are not sure about the closest reference sequence, you can also specify several probable reference sequences in <reference_file> (only one file with all the sequences in fasta format). In this case, the script chooses the closest reference sequence from the set of given sequences and constructs the consensus sequence based on that.
If you would like to analyse only one FASTQ file at a time, you can specify it into <fastq_file/directory>. In this case, you can specify either the folder where the file is located, or the file itself.
However, if you would like to anlayse multiple FASTQ files at a time, you can specify the folder to all the FASTQ files into <fastq_file/directory>. In this case, you need to have all your FASTQ files in the same folder.
By default, the tool subsamples the sequencing reads to 200000. However, this number can be changed by the users.
If you would like to analyse all sequencing reads without subsampling, then use -n "all".
batch_influenza.sh [options]
Options:
[-h or --help]
[-r or --refdir]
[-s or --sampledir]
[-n or --numreads]
[-i or --iterations]
[-t or --varthres]
[-c or --mincov]
[-o or --outdir]
This script processes Influenza sequences with SmaltAlign:
- Iteration over all
.fastq.gzfiles in the current directory. - Create a folder for each sample containing segment 1-8 subfolders.
- If a reference folder containing a reference per segment is not provided or does not exist, then it runs
select_ref_segments.py, which selects the best reference sequence for each segment from a Influenza reference database (selected sequences from the NCBI Influenza Virus Database)
- Using the best reference sequence, run
smaltalign.shfor each segment, includingcov_plot.Randwts.R.
wts.R is an R script to combine consensus sequence, variants and coverage for the last iteration of all lofreq.vcf files in a directory.
It saves a _x_WTS.fasta file containing the consensus sequence with wobbles (at a certain threshold x) and a .csv file containing coverage and variant frequencies for every position.
The variant threshold and the minimal coverage have to be provided by the user either when using smaltalign.sh or batch_influenza.sh.
cov_plot.R is an R script to plot and save the coverage of all iterations of all .depth files in the working directory.
- Judith Bergadà-Pijuan*
- Maryam Zaheri
- Stefan Schmutz
- Osvaldo Zagordi
- Michael Huber**
*maintainer ; **group leader