These set of scripts is designed to align multiple samples of same species to a reference genome, do preprocessing, and then call variants. The scripts use the following software for given purposes in the given order.
trimgaloreto trim adapters, clip the ends of the reads and generating fastqc reportsbwa memfor aligningsamtools sort -nfor sorting by namesamtools fixmatefor fixing mate informationsamtools sortfor sorting by coordinatessamtools markdupfor marking duplicatespicard-tools AddOrReplaceReadGroupsfor addding and replacing RGtagspicard-tools CleanSamfor setting Mapping Quality 0 for the sequences that are not aligned.samtools indexfor indexingsamtools coveragefor coverage reportsbamtools statsfor alignment reportsbcftoolsfor variant calling
git clone https://github.com/evolozzy/NGS-Pipeline.git
- Make a subdirectory named
Datain the folder containing your scripts and copy your files there, or change the line containingDATASOURCEin yourPARAMETERSfile, and set it to the folder that contains your data. - If you have two or more sets of reads to merge keep them in separate directories in
Datadirectory. - Make sure you have your reference file.
- Edit
RGTAGSfile carefully, the files belonging the same sample should have the same SM (sample name).
- Carefully change the
PARAMETERS.- Set the
REFERENCEFILEto the path to reference. - If you are running on multiple threads set
THREADSto number of cores you want to use.
- Set the
- Set the directories to be used in
DIRECTORIESfile.- If you're not running the scripts in the directory you have the scripts change the line containing
WDto the path that contains your scripts.
- If you're not running the scripts in the directory you have the scripts change the line containing
- Install required software, and set
PROGRAMPATHS.
Inside the folder:
./runall.sh
Or outside the folder:
/path/to/scripts/runall.sh
If you encounter any errors during the process and clean all the files created by the script:
./resetanalysis.sh
- Before running
runall.sh, usetrimall.shto quality control the trimming process. Checkout the fastqc reports after trimming and setPARAMETERSaccordingly. - Make sure that the core numbers are set properly. Try to use parallel more, but it depends on the number of files. For low numbers of files
- The script checks
- if the files are in place
- if the software is installed
- calculates a good way to use the cores available
- builds references from reference file
- Trimming is done with
trimgalore. - Aligning is done with
bwa - Preprocessing is done with
samtoolsandpicard-tools.- First, the files are sorted by name and mate info is fixed.
- Second, the files are sorted by coordinate and duplicates are marked.
- Third, the files are cleaned from reads that were not aligned.
- Last, RG tags are added.
- Variants are called with
bcftools.
- The middle files can be kept, deleted, or archived to another location.
- The code also generates reports of trimming (fastqc reports), alignment, and coverage.
Note: I haven't developed this project for some time, but I have plans to convert it to a snakemake pipeline when I have some extra time. If you need help with it, send me an email.