Skip to content

LSHillary/FecalViromeOptimisation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fecal Virome Optimisation

Overview

Central repository for all code used in Hillary et al. 2025 (DNA extraction and virome processing methods strongly influence recovered human gut viral community characteristics). A link to bioRxiv will be posted here once generated. This repository is split into the following folders:

  • data - csv and tsv outputs generated by the snakemake data processing pipeline and required for running the R scripts that carry out the data analysis.
  • figures - figures featured in the manuscript and supplementary information. Some of these figures were tweaked in Affinity Designer to adjust overlapping labels
  • scripts - contains both snakemake and python scripts used to process raw sequencing data, and R scripts used to analyse the processed tsvs and csvs, and to generate the figures for the manuscript
  • manuscript - manuscript Word files
  • _targets.R - Targets script that defines the R data analysis workflow

For sequencing data, all raw fastq files have been deposited in the NCBI SRA under BioProject PRJNA1331857 (SRR35523994 - SRR35524005). Assembled viral contigs and vOTU sequences are hosted here and currently undergoing GenBank deposition under BioProject PRJNA1331857.

data

  • Biogeography - UHGV clustering and mapping analysis (note that the file votus_metadata_extended.csv is available from https://portal.nersc.gov/UHGV/metadata/votus_metadata_extended.tsv (accessed June 12th 2025).
  • HostPrediction - iPhop host predictions
  • Kmers - Khmer k-mer count data
  • Lifestyle - BACPHLIP predictions
  • MetaData - Anonymised donor metadata
  • MicrobialProfiling - SingleM microbial read data
  • Pharokka - Pharokka annotations
  • Quast - Quast reports of assembled contigs
  • RawReads - MultiQC report of raw reads
  • ReadMapping - CoverM outputs
  • Subsampling - subsampling analysis outputs
  • ViralContigs - Genomad reports and Quast reports

figures

Figures 1-4 and S1-S7, including Affinity Designer files where text required adjustment. Also included is the DAG of the targets data analysis workflow.

Manuscript

Manuscript Word files for the main text and for the supplementary information.

scripts

Snakemake Pipeline Parent Scripts

  • FecalViromeDataProcessingStage1.smk - Snakemake pipeline for running FastQC on raw sequencing data
  • FecalViromeDataProcessingStage2.smk - Snakemake pipeline for running read QC on raw sequencing data and FastQC on the processed reads
  • FecalViromeDataProcessingStage3.smk - Snakemake pipeline for running the main data processing steps of the processed reads
  • BiogeographyMappingFecalUHGV.smk - An additional snakemake pipeline for clustering/ mapping reads to the Unified Human Gut Virome database

Snakemake Child Scripts

  • 2-QC.smk - QCs raw reads
  • 2.2-EC_Dedup.smk - Error corrects and deduplicates QC'ed reads
  • 2.4-Khmer.smk - Quantifies exact k-mer abundance
  • 2.5-SortMeRna.smk - Quantifies the number of Ribosomal RNA reads in samples
  • 3-IndividualAssembly.smk - Assembles reads into contigs
  • 3.2-AssemblyQC.smk - Runs Quast on assembled contigs
  • 4-VirusIdentificationGenomad.smk - Uses Genomad to identify viral contigs
  • 5-Dereplication.smk - Dereplicates viral contigs into vOTU clusters
  • 6-Mapping_vOTUs.smk - Maps reads to dereplicated vOTUs
  • 7-HostPrediction.smk - Predicts hosts with iPhop
  • 8-Annotation.smk - Annotates viral proteins/ contigs with Pharokka and Defensefinder
  • 9-Biogeography.smk - Clusters viral contigs with UHGV vOTUs
  • 11-SingleM.smk - Characterises microbial reads with SingleM
  • 12-BacPhlip.smk - Predicts viral lifestyle using BACPHLIP

Note - the child script for step 1 was incorporated into the parent script for stage 1, as this step only involved running FastQC on raw reads.

Miscilaneous scripts for HPC data processing

  • subsample_analysis_virus_finding_G.smk - Snakemake pipeline for rerunning the pipeline at a given depth X
  • subsample_analysis_assembly.smk - Snakemake pipeline for assembling contigs when subsampled
  • subsample_reads.smk - pipeline for subsampling reads
  • RunFecalViromeProcessing.sh - commands for running snakemake pipelines
  • RunSubsampling.sh - log of commands used to run the subsampling, as this was not fully automated so as to not swamp the cluster.
  • vOTU_clustering - folder containing scripts originally taken from Carmago et al 2023 (doi: 10.1093/nar/gkac1037) used in clustering viral contigs into vOTUs

R scripts

  • Functions_FVO.R - core functions used in the main analysis workflow
  • Run_FVO_manuscript.R - Parent script for running data analysis in R

stats

Supplementary tables S5 and S6 including revelant statistical outputs.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published