Skip to content

Codes for analyze long-read genotyping and off-target finding

Notifications You must be signed in to change notification settings

BaeLab/PrimeAssemblyAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 Off-target Analysis Pipeline for CRISPR/Prime Editing

πŸ“‹ Overview

Welcome! This pipeline analyzes off-target effects in CRISPR or Prime Editing experiments using paired-end NGS sequencing data with UMI (Unique Molecular Identifier) deduplication.

✨ Key Features

  • πŸ” UMI-based deduplication for accurate quantification
  • 🎯 Dual alignment strategy against both on-target and genome-wide references
  • 🧹 Local alignment filtering to remove sequences similar to known contaminants
  • βœ‚οΈ Soft-clipping analysis to identify and remove primer artifacts
  • πŸš€ Automated batch processing for multiple samples

🧬 Sequencing Library Design

Read Structure

  • R1: [Illumina Adapter] - [S + 8bp UMI] - [Genome-specific sequence]
  • R2: Complementary sequence with target region

For detailed oligo design and library preparation protocol, please refer to: πŸ“– GUIDE-seq Simplified Library Preparation Protocol

πŸ”§ Requirements

System Requirements

  • Linux/Unix environment (tested on Ubuntu 22.04)
  • Python 3.7+
  • Bash shell
  • At least 16GB RAM (32GB recommended for human genome alignment)
  • ~50GB free disk space for reference genome and analysis

πŸ“¦ Dependencies

Install the following tools with specific versions:

# Create conda environment
conda create -n offtarget python=3.8
conda activate offtarget

# Install core dependencies with specific versions
conda install -c bioconda umi_tools=1.1.6
conda install -c bioconda minimap2=2.28
conda install -c bioconda samtools=1.18
conda install -c bioconda pysam=0.21.0
conda install -c conda-forge biopython=1.83

# Or use the provided environment file
conda env create -f environment.yml

πŸ—ΊοΈ Reference Genome

Download the human reference genome (GRCh38):

# Download GRCh38 reference genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

# Decompress
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

# Create minimap2 index (optional but recommended for speed πŸš€)
minimap2 -d GRCh38.mmi GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

πŸ“ Input Data Format

Sequencing Data

The pipeline requires paired-end NGS sequencing data in FASTQ format:

  • R1 file: *_R1_001.fastq[.gz] - Contains UMI and target sequence
  • R2 file: *_R2_001.fastq[.gz] - Contains complementary sequence

File naming convention: {sample_id}_S{sample_id}_L001_R{1,2}_001.fastq[.gz]

Example:

85_S85_L001_R1_001.fastq.gz
85_S85_L001_R2_001.fastq.gz

πŸ“„ Reference Sequences File (Optional)

For homology filtering, provide a tab/space-separated file:

name    sequence    alignment_score_threshold
donor_1000bp    ATCGATCGATCG    130
control_seq     GCTAGCTAGCTA    100

βš™οΈ Installation

  1. Clone the repository:
git clone https://github.com/yourusername/offtarget-analysis.git
cd offtarget-analysis
  1. Make scripts executable:
chmod +x run_umi_pipeline_v5.sh
  1. Configure paths (Important! 🎯):
# Copy and edit the configuration template
cp config.template config.sh
nano config.sh

# Update these paths to match your system:
# - SCRIPT_DIR: Path to Python scripts
# - REF_DIR: Path to reference genomes  
# - GRCH38_REF: Path to GRCh38 reference

πŸš€ Usage

Basic Usage

./run_umi_pipeline_v5.sh <input_folder>

With Homology Filtering

./run_umi_pipeline_v5.sh <input_folder> <reference_sequences.txt>

πŸ’‘ Examples

# Process all samples in the data directory
./run_umi_pipeline_v5.sh ./data/

# Process with homology filtering
./run_umi_pipeline_v5.sh ./data/ ./homology_sequences.txt

πŸ”„ Pipeline Workflow

1. Parse meaningful reads πŸ“

  • Filters reads containing specified indicator sequences
  • Default indicators:
    • R1: CTACAAGAGCGGTGAGt
    • R2: GGTGCCAGAGGTATTGGCGCTAGGGTCA

2. Homology filtering πŸ” (Optional)

  • Removes reads similar to known sequences using local alignment
  • Checks both forward and reverse complement

3. UMI extraction 🏷️

  • Extracts 9bp UMI from R1 reads
  • Adds UMI to read headers

4. Alignment 🎯

  • Aligns to on-target reference
  • Aligns to human genome (GRCh38)

5. Filter & Deduplicate 🧹

  • Retains properly paired reads
  • Removes PCR duplicates based on UMI

6. Artifact removal βœ‚οΈ

  • Removes primer artifacts
  • Filters soft-clipped reads

7. Quantification πŸ“Š

  • Counts editing events at each position
  • Generates summary statistics

πŸ“Š Output Files

For each sample, the pipeline generates:

βœ… Final Analysis Files

  • dedup_on_target_{sample}_filtered_r1_filtered.bam - Final on-target alignments
  • dedup_off_target_{sample}_filtered_r1_filtered.bam - Final off-target alignments
  • *_align_sites.txt - Junction count tables

πŸ“ˆ Output Format

The *_align_sites.txt files contain:

chromosome    position    count
chr1          12345       10
chr2          67890       5

πŸ› οΈ Customization

Modifying Indicator Sequences

Edit run_umi_pipeline_v5.sh:

indicator_seq_r1="YOUR_R1_SEQUENCE"
indicator_seq_r2="YOUR_R2_SEQUENCE"  

Adjusting Filtering Parameters

  • Alignment scoring: Edit filter_similar_sequences.py
  • Soft-clipping threshold: Edit soft_clip_filter_out.py
  • UMI pattern: Modify --bc-pattern in the pipeline script

πŸ› Troubleshooting

Common Issues & Solutions

πŸ”΄ "command not found" errors

  • Check all dependencies are installed
  • Activate conda environment: conda activate offtarget

πŸ”΄ Memory errors during alignment

  • Reduce thread count (default: 8)
  • Use pre-built minimap2 index

πŸ”΄ Low read counts after filtering

  • Verify indicator sequences match your library
  • Check reference genome paths
  • Review filtering thresholds

πŸ“„ License

MIT License

πŸ“§ Contact

For questions or issues:

πŸ™ Acknowledgments

This pipeline leverages these excellent tools:


About

Codes for analyze long-read genotyping and off-target finding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published