This repository contains a modular and scalable Snakemake workflow for analyzing CUT&RUN (or ChIP-seq) data.
-
🧠 Automatic Sample Detection
Supports various naming conventions including_R1.fastq.gz,_1.fastq.gz,.fastq.gz,_R1_001.fastq.gz, etc. -
🔁 SE/PE Mode Auto-Detection
Automatically routes samples through the correct pipeline depending on whether data is single-end or paired-end. -
⚙️ Flexible and Configurable
Centralizedconfig.yamlto set input paths, number of threads, STAR index, genome size, bin size, and more. -
🧬 Multimapping Handling
Retains multi-mapping reads during STAR alignment, and includes a post-mappingmultimap_weightfunction to adjust forNHtag weights (for accurate peak calling). -
🚫 Blacklist Filtering (Optional)
Whenfilter_blacklist: trueis set inconfig.yaml, ENCODE blacklist regions will be automatically downloaded (based on genome) and applied tobamCoverageusing--blackListFileName. This step replaces the older repeat masking logic. -
📊 BigWig Generation with Normalization
Converts BAM to bigWig usingdeeptoolswith and without normalization (e.g., RPKM), while excluding PCR duplicates (--samFlagExclude 1024).
-
Sample Detection
Automatically detects sample names based on filenames. -
Quality Trimming
Usesfastpto trim adapters and remove low-quality reads. -
Alignment
Aligns reads to the reference genome usingSTAR, retaining up to 100 multi-mapped hits. -
Multimap Weighting
Applies fractional weighting to multi-mapped reads based on theirNHtag values. -
Blacklist Filtering (Optional)
Filters signal from known artefact regions via ENCODE blacklist whenfilter_blacklistis enabled. -
BigWig Conversion
Generates normalized (RPKM) and unnormalized bigWig files for visualization. -
Peak Calling
UsesMACS3to call peaks from the aligned BAM files.
- Clone the repository:
git clone https://github.com/Shall-We-Dance/CUTRUN_smk.git
cd CUTRUN_smk-
Edit
./config/config.yamlto specify your paths and parameters. -
Activate Snakemake and run the pipeline:
snakemake --use-conda --cores 16CUTRUN_smk/
├── config/
│ └── config.yaml # Main configuration file
│
├── workflow/
│ ├── Snakefile # Entry point Snakefile
│ ├── rules/ # Modular rule files
│ │ ├── fastp.smk
│ │ ├── star.smk
│ │ ├── macs3.smk
│ │ ├── bam_to_bigwig.smk
│ │ └── detect_samples.smk
│ └── envs/ # Conda environments
│ ├── fastp.yaml
│ ├── star.yaml
│ ├── macs3.yaml
│ ├── bedtools.yaml
│ └── deeptools.yaml
│
├── results/ # Final and intermediate output files
│ ├── fastp/
│ ├── star/
│ └── bigwig/
│
├── logs/ # Log files for each step
│ ├── fastp/
│ ├── star/
│ └── bigwig/
│
├── resources/ # Resource files for each step
│ └── blacklist/
│
├── LICENSE
└── README.md
- STAR genome index must be prebuilt using
STAR --runMode genomeGenerate. - For blacklist functionality, genome name must match those recognized by ENCODE (e.g.,
hg38,mm10). samtools,deeptools, and other tools will auto-scale to the number of available threads (defaultmax/4).
MIT License
For questions, issues, or contributions, please open an issue or pull request on GitHub.