Skip to content

MissingInputException in rule experiment_statistic_assigned_counts_combine_stats_dna_rna_merge_all #223

@graceoualline

Description

@graceoualline

Hello,
I'm encountering a MissingInputException when running the experiment pipeline. The error occurs during DAG building, before any jobs execute. This appears to be a workflow structure issue, but I've been unable to identify the root cause despite extensive debugging.

Error

Here is my command and error:

$ snakemake --sdm conda --configfile config_MS_MPRA_combined.yaml -c 30 -j 35  --workflow-profile profiles/MS_mpra/ --executor slurm --resources mem_mb=60000 --rerun-incomplete --keep-going -n --quiet rules
Using workflow specific profile /home/go274/project/MPRAsnakeflow/profiles/MS_mpra/ for setting default command line arguments.
Running MPRAsnakeflow version 0.5.4
host: r814u03n01.mccleary.ycrc.yale.edu
Building DAG of jobs...
MissingInputException in rule experiment_statistic_assigned_counts_combine_stats_dna_rna_merge_all in file "/home/go274/project/MPRAsnakeflow/workflow/rules/experiment/statistic/assigned_counts.smk", line 121:
Missing input files for rule experiment_statistic_assigned_counts_combine_stats_dna_rna_merge_all:
    output: results/experiments/MS_mpra_exp_bbmap/statistic/statistic_assigned_counts_merged_MS_mpra_assign_bbmap_default.tsv
    wildcards: project=MS_mpra_exp_bbmap, assignment=MS_mpra_assign_bbmap, config=default
    affected files:
        results/experiments/MS_mpra_exp_bbmap/statistic/assigned_counts/MS_mpra_assign_bbmap/default/combined/jurkat_0hr_merged_assigned_counts.statistic.tsv.gz

results/experiments/MS_mpra_exp_bbmap/statistic/assigned_counts/MS_mpra_assign_bbmap/default/combined/jurkat_12hr_merged_assigned_counts.statistic.tsv.gz
        results/experiments/MS_mpra_exp_bbmap/statistic/assigned_counts/MS_mpra_assign_bbmap/default/combined/jurkat_48hr_merged_assigned_counts.statistic.tsv.gz
        results/experiments/MS_mpra_exp_bbmap/statistic/assigned_counts/MS_mpra_assign_bbmap/default/combined/jurkat_24hr_merged_assigned_counts.statistic.tsv.gz

This error occurs because snakemake thinks that the file results/experiments/{{project}}/statistic/assigned_counts/{{assignment}}/{{config}}/combined/{condition}_merged_assigned_counts.statistic.tsv.gz will never be created. I go through a code trace below, that ends in the missing output of this file.

Code Trace

I attempted to code trace the workflow for this particular bug, working backward from experiment_statistic_assigned_counts_combine_stats_dna_rna_merge_all. The farthest back I got was getFinalCounts in MPRAsnakeflow/workflow/rules/common.smk. I made the code trace color coded here, to make it easier to follow. Files with the same color are the same files that are produced and then fed into the next function.

  1. getFinalCounts() (workflow/rules/common.smk)

    • Output: results/experiments/{project}/%s/{condition}_%s_%s_final_counts[.sampling|].{config}.tsv.gz
  2. experiment_counts_dna_rna_merge_counts (workflow/rules/experiment/counts.smk)

    • Input: calls getFinalCounts() for DNA and RNA
    • Output: results/experiments/{project}/{raw_or_assigned}/{condition}_{replicate}.merged.config.{config}.tsv.gz
  3. experiment_assigned_counts_dna_rna_merge (workflow/rules/experiment/assigned_counts.smk)

    • Input: results/experiments/{project}/counts/{condition}_{replicate}.merged.config.{config}.tsv.gz
    • Output: results/experiments/{project}/statistic/assigned_counts/{assignment}/{config}/{condition}_{replicate}_merged_assigned_counts.statistic.tsv.gz
  4. experiment_statistic_assigned_counts_combine_stats_dna_rna_merge (workflow/rules/experiment/statistic/assigned_counts.smk)

    • Input: results/experiments/{project}/statistic/assigned_counts/{assignment}/{config}/{condition}_{replicate}_merged_assigned_counts.statistic.tsv.gz
    • Output: results/experiments/{project}/statistic/assigned_counts/{assignment}/{config}/combined/{condition}_merged_assigned_counts.statistic.tsv.gz
  5. experiment_statistic_assigned_counts_combine_stats_dna_rna_merge_all (line 121)

    • MISSING INPUT: results/experiments/{project}/statistic/assigned_counts/{assignment}/{config}/combined/{condition}_merged_assigned_counts.statistic.tsv.gz

Attempted Fixes

  1. Fixed wildcard inconsistency: The output from experiment_counts_dna_rna_merge_counts uses {raw_or_assigned} while the input to experiment_assigned_counts_dna_rna_merge hardcodes counts. I replaced all instances of {raw_or_assigned} with counts in experiment_counts_dna_rna_merge_counts. Issue persists.

  2. Commented out statistics targets: I commented out all instances of statistic_assigned_counts_merged in the main Snakefile (lines 160, 309-313, 400-404) to prevent the rule from being triggered as a default target. Issue persists.

  3. Cleared Snakemake cache: Removed .snakemake/ directory and partial output files. Issue persists.

Questions

I am at a loss on how to debug this. I am not familiar with the snakemake workflow. I ran the ENCODE and other example data through this pipeline, and did not get this bug.
Any guidance on debugging this would be greatly appreciated. Thank you!
Grace

Relevant Files

Config:

version: "0.5"
experiments:
  MS_mpra_exp_bbmap:
    bc_length: 20
    data_folder: /home/go274/palmer_scratch/practice/MS_mpra_exp_data
    experiment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/experiment.csv
    demultiplex: false
    assignments:
      MS_mpra_assign_bbmap:
        type: config
        assignment_name: MS_mpra_assign_bbmap
        assignment_config: default_config
    design_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/rev_seq_w_adapter.fasta
    label_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/labels.tsv.gz
    configs:
      default:
        filter:
            bc_threshold: 1
            min_dna_counts: 1
            min_rna_counts: 1
            outlier_detection:
              methods: none
              mad_bins: 20
              times_mad: 5
              times_zscore: 3
            DNA:
              min_counts: 1
            RNA:
              min_counts: 1

experiment.csv:

Condition,Replicate,DNA_BC_F,RNA_BC_F
jurkat_0hr,1,ms_plasmid_rep1.fastq.gz,ms_Jurkat_MS_rep1_0hr.fastq.gz
jurkat_0hr,2,ms_plasmid_rep2.fastq.gz,ms_Jurkat_MS_rep2_0hr.fastq.gz
jurkat_0hr,3,ms_plasmid_rep3.fastq.gz,ms_Jurkat_MS_rep3_0hr.fastq.gz
jurkat_0hr,4,ms_plasmid_rep4.fastq.gz,ms_Jurkat_MS_rep4_0hr.fastq.gz
jurkat_0hr,5,ms_plasmid_rep5.fastq.gz,ms_Jurkat_MS_rep5_0hr.fastq.gz
jurkat_12hr,1,ms_plasmid_rep1.fastq.gz,ms_Jurkat_MS_rep1_12hr.fastq.gz
jurkat_12hr,2,ms_plasmid_rep2.fastq.gz,ms_Jurkat_MS_rep2_12hr.fastq.gz
jurkat_12hr,3,ms_plasmid_rep3.fastq.gz,ms_Jurkat_MS_rep3_12hr.fastq.gz
jurkat_12hr,4,ms_plasmid_rep4.fastq.gz,ms_Jurkat_MS_rep4_12hr.fastq.gz
jurkat_12hr,5,ms_plasmid_rep5.fastq.gz,ms_Jurkat_MS_rep5_12hr.fastq.gz
jurkat_24hr,1,ms_plasmid_rep1.fastq.gz,ms_Jurkat_MS_rep1_24hr.fastq.gz
jurkat_24hr,2,ms_plasmid_rep2.fastq.gz,ms_Jurkat_MS_rep2_24hr.fastq.gz
jurkat_24hr,3,ms_plasmid_rep3.fastq.gz,ms_Jurkat_MS_rep3_24hr.fastq.gz
jurkat_24hr,4,ms_plasmid_rep4.fastq.gz,ms_Jurkat_MS_rep4_24hr.fastq.gz
jurkat_24hr,5,ms_plasmid_rep5.fastq.gz,ms_Jurkat_MS_rep5_24hr.fastq.gz
jurkat_48hr,1,ms_plasmid_rep1.fastq.gz,ms_Jurkat_MS_rep1_48hr.fastq.gz
jurkat_48hr,2,ms_plasmid_rep2.fastq.gz,ms_Jurkat_MS_rep2_48hr.fastq.gz
jurkat_48hr,3,ms_plasmid_rep3.fastq.gz,ms_Jurkat_MS_rep3_48hr.fastq.gz
jurkat_48hr,4,ms_plasmid_rep4.fastq.gz,ms_Jurkat_MS_rep4_48hr.fastq.gz
jurkat_48hr,5,ms_plasmid_rep5.fastq.gz,ms_Jurkat_MS_rep5_48hr.fastq.gz
K562,1,ms_plasmid_rep1.fastq.gz,ms_k562_rep1.fastq.gz
K562,2,ms_plasmid_rep2.fastq.gz,ms_k562_rep2.fastq.gz
K562,3,ms_plasmid_rep3.fastq.gz,ms_k562_rep3.fastq.gz
K562,4,ms_plasmid_rep4.fastq.gz,ms_k562_rep4.fastq.gz
K562,5,ms_plasmid_rep5.fastq.gz,ms_k562_rep5.fastq.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions