Hi,
I am encountering unexpected behaviour when using bcftools annotate with an additional INFO field as a matching key.
Background
I decomposed and normalised a multi-allelic VCF using bcftools norm with the –old-rec-tag option, specifying the tag name SOURCE_RECORD. After normalisation, some variants become identical in terms of CHROM, POS, REF and ALT. However, they remain distinguishable by SOURCE_RECORD because they originate from different sources.
I then generated a TSV file of per-variant metrics from the normalised VCF. I want to annotate the original VCF with these metrics using bcftools annotate. Because there are duplicate CHROM, POS, REF, ALT entries after normalisation, I followed the approach suggested in issue #2151, where an additional INFO field can be used as a matching key to disambiguate records.
Expected behaviour
When including SOURCE_RECORD as an additional key, I expect bcftools annotate to:
- Match records using
CHROM, POS, REF, ALT and SOURCE_RECORD
- Treat
SOURCE_RECORD as a literal string
- Only annotate records where the full
SOURCE_RECORD value matches exactly
Observed behaviour
SOURCE_RECORD values have the format CHROM|POS|REF|ALT|USED_ALT_INDEX. When this field this is used as an additional key, bcftools annotate does not appear to treat it as a strict literal string. Instead, it behaves as though the pipe characters are interpreted as OR separators. As a result:
- Records are matched if any substring between pipes matches
- The first duplicate variant receives the correct annotations
- Subsequent duplicates (same
CHROM, POS, REF, ALT but different SOURCE_RECORD) incorrectly receive the annotations from the first occurrence, leading to incorrect assignment of metrics.
Example VCF:
#CHROM POS ID REF ALT QUAL FILTER INFO
chr5 123456 . C CT 35.2 . SOURCE_RECORD=chr5|123456|C|A,CT,G|2
chr5 123456 . C CT 21.6 . SOURCE_RECORD=chr5|123457|TA|AA,CA,TGA|3
Example TSV:
CHR POS REF ALT VARIANT SOURCE_RECORD METRIC1 METRIC2
chr1 123456 C CT chr1:123456-C/CT chr5|123456|C|A,CT,G|2 1.0 700
chr1 123456 C CT chr1:123456-C/CT chr5|123457|TA|AA,CA,TGA|3 0.0 70
Annotation command:
bcftools annotate ${vcf} \
--annotations ${tsv} \
--columns CHROM,POS,REF,ALT,-,SOURCE_RECORD,METRIC1,METRIC2 `
--include 'SOURCE_RECORD={SOURCE_RECORD}' \
--keep-sites \
--header-lines ${header}
Resulting VCF:
#CHROM POS ID REF ALT QUAL FILTER INFO
chr5 123456 . C CT 35.2 . SOURCE_RECORD=chr5|123456|C|A,CT,G|2;METRIC1=1.0;METRIC2=700
chr5 123456 . C CT 21.6 . SOURCE_RECORD=chr5|123457|TA|AA,CA,TGA|3;METRIC1=1.0,METRIC2=700
Note the second variant is assigned the wrong metrics.
If I remove the pipe characters in SOURCE_RECORD, the annotation behaves as expected and matching is correct, suggesting the issue is specifically related to how bcftools annotate interprets pipe characters in INFO fields used as matching keys.
Questions
- Is there a way to force
bcftools annotate to treat the INFO field as a strict literal string when used as a key?
- Alternatively (or additionally), would it be possible to make the format of
–old-rec-tagcustomisable (for example, allowing the delimiter to be specified by the user)?
Thank you for your time.
Lisa
Hi,
I am encountering unexpected behaviour when using
bcftools annotatewith an additionalINFOfield as a matching key.Background
I decomposed and normalised a multi-allelic VCF using
bcftools normwith the–old-rec-tagoption, specifying the tag nameSOURCE_RECORD. After normalisation, some variants become identical in terms ofCHROM,POS,REFandALT. However, they remain distinguishable bySOURCE_RECORDbecause they originate from different sources.I then generated a TSV file of per-variant metrics from the normalised VCF. I want to annotate the original VCF with these metrics using
bcftools annotate. Because there are duplicateCHROM,POS,REF,ALTentries after normalisation, I followed the approach suggested in issue #2151, where an additionalINFOfield can be used as a matching key to disambiguate records.Expected behaviour
When including
SOURCE_RECORDas an additional key, I expectbcftools annotateto:CHROM,POS,REF,ALTandSOURCE_RECORDSOURCE_RECORDas a literal stringSOURCE_RECORDvalue matches exactlyObserved behaviour
SOURCE_RECORDvalues have the formatCHROM|POS|REF|ALT|USED_ALT_INDEX. When this field this is used as an additional key,bcftools annotatedoes not appear to treat it as a strict literal string. Instead, it behaves as though the pipe characters are interpreted as OR separators. As a result:CHROM,POS,REF,ALTbut differentSOURCE_RECORD) incorrectly receive the annotations from the first occurrence, leading to incorrect assignment of metrics.Example VCF:
Example TSV:
Annotation command:
Resulting VCF:
Note the second variant is assigned the wrong metrics.
If I remove the pipe characters in
SOURCE_RECORD, the annotation behaves as expected and matching is correct, suggesting the issue is specifically related to howbcftools annotateinterprets pipe characters inINFOfields used as matching keys.Questions
bcftools annotateto treat theINFOfield as a strict literal string when used as a key?–old-rec-tagcustomisable (for example, allowing the delimiter to be specified by the user)?Thank you for your time.
Lisa