Skip to content

Using wildcards in VCF file names #3

@bschilder

Description

@bschilder

In the 1000 Genomes dataset hosted here, all files follow the same pattern except (annoyingly) for the chrX VCF and its index file (which end with ".v2.vcf.gz" instead of just ".vcf.gz" like the other files).

https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/

Atm, my solution is to download those files into the data/vcf/phased folder (and rename them to be consistent with the other VCFs) before running the ProHap snakemake pipeline.

wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz \
-O $HOME/projects/ProHap/data/vcf/phased/1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.vcf.gz \
&& gunzip $HOME/projects/ProHap/data/vcf/phased/1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

But would it be possible to add a bit more flexibility to the pipeline by allowing for wildcards?

For example:

phased_FTP_URL: "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/"
phased_local_path: "" 
phased_vcf_file_name: "1kGP_high_coverage_Illumina.chr{chr}.filtered.SNV_INDEL_SV_phased_panel.vcf"

With the wildcard added:

phased_vcf_file_name: "1kGP_high_coverage_Illumina.chr{chr}.filtered.SNV_INDEL_SV_phased_panel*.vcf"

Not sure how hard this would be, but for cases like this it could be quite helpful

Many thanks again!,
Brian

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions