Skip to content

Cram explicit tlen#1993

Merged
daviesrob merged 3 commits intosamtools:developfrom
jkbonfield:cram-explicit-tlen
Apr 2, 2026
Merged

Cram explicit tlen#1993
daviesrob merged 3 commits intosamtools:developfrom
jkbonfield:cram-explicit-tlen

Conversation

@jkbonfield
Copy link
Copy Markdown
Contributor

  • Fix CRAM v4.0 CRAM_FLAG_EXPLICIT_TLEN usage. We needed to check the explicit_tlen variable as well as the cram flags, which led to removal of items from the cram stats and potentially changing data by erroneously believing the values to be constant.
  • Fix c->ref_end for embedded references when bizarrely asking to generate a consensus for embedding while also telling it what the external reference is called. A nonsensical set of options, but they should work.

Both of these show up in io_lib if I switch to using htslib for CRAM reading and writing.

When we have paired reads where the detected template size doesn't
match the recorded size, with CRAM 3.x we turn the reads into
"detached" mode as that enables the ability to store ISIZE explicitly.
However doing so removes many other benefits of deduplication, such as
the read name and POS vs MATE_POS fields.

With CRAM v4.0 if it's only the insert size which differs, which can
be a common issue, we have an additional CRAM_FLAG_EXPLICIT_TLEN flag.

Unfortunately may be set in the cram_flags later on, so we much check
a local variable too.  The impact of this was we build up the
frequency statistics for DS_TS (template size) and then incorrectly
remove the read-1 from it as we believe we're going into detached
mode.  We no longer remove the other read's size so the statistics are
correct.  This turned up in test data due to -20 / +20 becoming -20
only and hence encoding as a constant value.

Bizarrely this only happens when specifying a reference while also
asking it to generate consensus as a reference.  The two together leave
the memory copy of ref LN value as zero, which triggered incorrect
cram container header data and in turn triggers the isize validity
check to happen, forcing explicit-tlen storing.
Firstly, this was over-sizing as it used allocated memory rather than
observed maximum extents.

Secondly the combination of specifying a reference while also asking
it not to use that reference and to use the consensus instead left ref
LN_length variable as zero.  We no longer shrink c->ref_end back down
in this case.
The container ref_start and ref_end are absolute coordinates in the
reference covered by the alignments.  However extend_ref had ref_end
as the maximum reference allocated, meaning we had to track the actual
end point in another variable (max_pos).

Renamed max_pos to ref_end and now track the allocation in
ref_end_alloc.
@daviesrob daviesrob merged commit bf93316 into samtools:develop Apr 2, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants