🧬 Off-target Analysis Pipeline for CRISPR/Prime Editing

📋 Overview

Welcome! This pipeline analyzes off-target effects in CRISPR or Prime Editing experiments using paired-end NGS sequencing data with UMI (Unique Molecular Identifier) deduplication.

✨ Key Features

🔍 UMI-based deduplication for accurate quantification
🎯 Dual alignment strategy against both on-target and genome-wide references
🧹 Local alignment filtering to remove sequences similar to known contaminants
✂️ Soft-clipping analysis to identify and remove primer artifacts
🚀 Automated batch processing for multiple samples

🧬 Sequencing Library Design

Read Structure

R1: [Illumina Adapter] - [S + 8bp UMI] - [Genome-specific sequence]
R2: Complementary sequence with target region

For detailed oligo design and library preparation protocol, please refer to: 📖 GUIDE-seq Simplified Library Preparation Protocol

🔧 Requirements

System Requirements

Linux/Unix environment (tested on Ubuntu 22.04)
Python 3.7+
Bash shell
At least 16GB RAM (32GB recommended for human genome alignment)
~50GB free disk space for reference genome and analysis

📦 Dependencies

Install the following tools with specific versions:

# Create conda environment
conda create -n offtarget python=3.8
conda activate offtarget

# Install core dependencies with specific versions
conda install -c bioconda umi_tools=1.1.6
conda install -c bioconda minimap2=2.28
conda install -c bioconda samtools=1.18
conda install -c bioconda pysam=0.21.0
conda install -c conda-forge biopython=1.83

# Or use the provided environment file
conda env create -f environment.yml

🗺️ Reference Genome

Download the human reference genome (GRCh38):

# Download GRCh38 reference genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

# Decompress
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

# Create minimap2 index (optional but recommended for speed 🚀)
minimap2 -d GRCh38.mmi GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

📁 Input Data Format

Sequencing Data

The pipeline requires paired-end NGS sequencing data in FASTQ format:

R1 file: *_R1_001.fastq[.gz] - Contains UMI and target sequence
R2 file: *_R2_001.fastq[.gz] - Contains complementary sequence

File naming convention: {sample_id}_S{sample_id}_L001_R{1,2}_001.fastq[.gz]

Example:

85_S85_L001_R1_001.fastq.gz
85_S85_L001_R2_001.fastq.gz

📄 Reference Sequences File (Optional)

For homology filtering, provide a tab/space-separated file:

name    sequence    alignment_score_threshold
donor_1000bp    ATCGATCGATCG    130
control_seq     GCTAGCTAGCTA    100

⚙️ Installation

Clone the repository:

git clone https://github.com/yourusername/offtarget-analysis.git
cd offtarget-analysis

Make scripts executable:

chmod +x run_umi_pipeline_v5.sh

Configure paths (Important! 🎯):

# Copy and edit the configuration template
cp config.template config.sh
nano config.sh

# Update these paths to match your system:
# - SCRIPT_DIR: Path to Python scripts
# - REF_DIR: Path to reference genomes  
# - GRCH38_REF: Path to GRCh38 reference

🚀 Usage

Basic Usage

./run_umi_pipeline_v5.sh <input_folder>

With Homology Filtering

./run_umi_pipeline_v5.sh <input_folder> <reference_sequences.txt>

💡 Examples

# Process all samples in the data directory
./run_umi_pipeline_v5.sh ./data/

# Process with homology filtering
./run_umi_pipeline_v5.sh ./data/ ./homology_sequences.txt

🔄 Pipeline Workflow

1. Parse meaningful reads 📝

Filters reads containing specified indicator sequences
Default indicators:
- R1: CTACAAGAGCGGTGAGt
- R2: GGTGCCAGAGGTATTGGCGCTAGGGTCA

2. Homology filtering 🔍 (Optional)

Removes reads similar to known sequences using local alignment
Checks both forward and reverse complement

3. UMI extraction 🏷️

Extracts 9bp UMI from R1 reads
Adds UMI to read headers

4. Alignment 🎯

Aligns to on-target reference
Aligns to human genome (GRCh38)

5. Filter & Deduplicate 🧹

Retains properly paired reads
Removes PCR duplicates based on UMI

6. Artifact removal ✂️

Removes primer artifacts
Filters soft-clipped reads

7. Quantification 📊

Counts editing events at each position
Generates summary statistics

📊 Output Files

For each sample, the pipeline generates:

✅ Final Analysis Files

dedup_on_target_{sample}_filtered_r1_filtered.bam - Final on-target alignments
dedup_off_target_{sample}_filtered_r1_filtered.bam - Final off-target alignments
*_align_sites.txt - Junction count tables

📈 Output Format

The *_align_sites.txt files contain:

chromosome    position    count
chr1          12345       10
chr2          67890       5

🛠️ Customization

Modifying Indicator Sequences

Edit run_umi_pipeline_v5.sh:

indicator_seq_r1="YOUR_R1_SEQUENCE"
indicator_seq_r2="YOUR_R2_SEQUENCE"

Adjusting Filtering Parameters

Alignment scoring: Edit filter_similar_sequences.py
Soft-clipping threshold: Edit soft_clip_filter_out.py
UMI pattern: Modify --bc-pattern in the pipeline script

🐛 Troubleshooting

Common Issues & Solutions

🔴 "command not found" errors

Check all dependencies are installed
Activate conda environment: conda activate offtarget

🔴 Memory errors during alignment

Reduce thread count (default: 8)
Use pre-built minimap2 index

🔴 Low read counts after filtering

Verify indicator sequences match your library
Check reference genome paths
Review filtering thresholds

📄 License

MIT License

📧 Contact

For questions or issues:

🐛 Open an issue on GitHub
📧 Contact: [bbakgosu@snu.ac.kr]

🙏 Acknowledgments

This pipeline leverages these excellent tools:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
off_target		off_target
README.md		README.md

BaeLab/PrimeAssemblyAnalysis

Folders and files

Latest commit

History

Repository files navigation