Welcome! This pipeline analyzes off-target effects in CRISPR or Prime Editing experiments using paired-end NGS sequencing data with UMI (Unique Molecular Identifier) deduplication.
- π UMI-based deduplication for accurate quantification
- π― Dual alignment strategy against both on-target and genome-wide references
- π§Ή Local alignment filtering to remove sequences similar to known contaminants
- βοΈ Soft-clipping analysis to identify and remove primer artifacts
- π Automated batch processing for multiple samples
- R1:
[Illumina Adapter] - [S + 8bp UMI] - [Genome-specific sequence] - R2: Complementary sequence with target region
For detailed oligo design and library preparation protocol, please refer to: π GUIDE-seq Simplified Library Preparation Protocol
- Linux/Unix environment (tested on Ubuntu 22.04)
- Python 3.7+
- Bash shell
- At least 16GB RAM (32GB recommended for human genome alignment)
- ~50GB free disk space for reference genome and analysis
Install the following tools with specific versions:
# Create conda environment
conda create -n offtarget python=3.8
conda activate offtarget
# Install core dependencies with specific versions
conda install -c bioconda umi_tools=1.1.6
conda install -c bioconda minimap2=2.28
conda install -c bioconda samtools=1.18
conda install -c bioconda pysam=0.21.0
conda install -c conda-forge biopython=1.83
# Or use the provided environment file
conda env create -f environment.ymlDownload the human reference genome (GRCh38):
# Download GRCh38 reference genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
# Decompress
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
# Create minimap2 index (optional but recommended for speed π)
minimap2 -d GRCh38.mmi GCA_000001405.15_GRCh38_no_alt_analysis_set.fnaThe pipeline requires paired-end NGS sequencing data in FASTQ format:
- R1 file:
*_R1_001.fastq[.gz]- Contains UMI and target sequence - R2 file:
*_R2_001.fastq[.gz]- Contains complementary sequence
File naming convention: {sample_id}_S{sample_id}_L001_R{1,2}_001.fastq[.gz]
Example:
85_S85_L001_R1_001.fastq.gz
85_S85_L001_R2_001.fastq.gz
For homology filtering, provide a tab/space-separated file:
name sequence alignment_score_threshold
donor_1000bp ATCGATCGATCG 130
control_seq GCTAGCTAGCTA 100
- Clone the repository:
git clone https://github.com/yourusername/offtarget-analysis.git
cd offtarget-analysis- Make scripts executable:
chmod +x run_umi_pipeline_v5.sh- Configure paths (Important! π―):
# Copy and edit the configuration template
cp config.template config.sh
nano config.sh
# Update these paths to match your system:
# - SCRIPT_DIR: Path to Python scripts
# - REF_DIR: Path to reference genomes
# - GRCH38_REF: Path to GRCh38 reference./run_umi_pipeline_v5.sh <input_folder>./run_umi_pipeline_v5.sh <input_folder> <reference_sequences.txt># Process all samples in the data directory
./run_umi_pipeline_v5.sh ./data/
# Process with homology filtering
./run_umi_pipeline_v5.sh ./data/ ./homology_sequences.txt- Filters reads containing specified indicator sequences
- Default indicators:
- R1:
CTACAAGAGCGGTGAGt - R2:
GGTGCCAGAGGTATTGGCGCTAGGGTCA
- R1:
- Removes reads similar to known sequences using local alignment
- Checks both forward and reverse complement
- Extracts 9bp UMI from R1 reads
- Adds UMI to read headers
- Aligns to on-target reference
- Aligns to human genome (GRCh38)
- Retains properly paired reads
- Removes PCR duplicates based on UMI
- Removes primer artifacts
- Filters soft-clipped reads
- Counts editing events at each position
- Generates summary statistics
For each sample, the pipeline generates:
dedup_on_target_{sample}_filtered_r1_filtered.bam- Final on-target alignmentsdedup_off_target_{sample}_filtered_r1_filtered.bam- Final off-target alignments*_align_sites.txt- Junction count tables
The *_align_sites.txt files contain:
chromosome position count
chr1 12345 10
chr2 67890 5
Edit run_umi_pipeline_v5.sh:
indicator_seq_r1="YOUR_R1_SEQUENCE"
indicator_seq_r2="YOUR_R2_SEQUENCE" - Alignment scoring: Edit
filter_similar_sequences.py - Soft-clipping threshold: Edit
soft_clip_filter_out.py - UMI pattern: Modify
--bc-patternin the pipeline script
π΄ "command not found" errors
- Check all dependencies are installed
- Activate conda environment:
conda activate offtarget
π΄ Memory errors during alignment
- Reduce thread count (default: 8)
- Use pre-built minimap2 index
π΄ Low read counts after filtering
- Verify indicator sequences match your library
- Check reference genome paths
- Review filtering thresholds
MIT License
For questions or issues:
- π Open an issue on GitHub
- π§ Contact: [bbakgosu@snu.ac.kr]
This pipeline leverages these excellent tools: