Merged
Conversation
When we have paired reads where the detected template size doesn't match the recorded size, with CRAM 3.x we turn the reads into "detached" mode as that enables the ability to store ISIZE explicitly. However doing so removes many other benefits of deduplication, such as the read name and POS vs MATE_POS fields. With CRAM v4.0 if it's only the insert size which differs, which can be a common issue, we have an additional CRAM_FLAG_EXPLICIT_TLEN flag. Unfortunately may be set in the cram_flags later on, so we much check a local variable too. The impact of this was we build up the frequency statistics for DS_TS (template size) and then incorrectly remove the read-1 from it as we believe we're going into detached mode. We no longer remove the other read's size so the statistics are correct. This turned up in test data due to -20 / +20 becoming -20 only and hence encoding as a constant value. Bizarrely this only happens when specifying a reference while also asking it to generate consensus as a reference. The two together leave the memory copy of ref LN value as zero, which triggered incorrect cram container header data and in turn triggers the isize validity check to happen, forcing explicit-tlen storing.
Firstly, this was over-sizing as it used allocated memory rather than observed maximum extents. Secondly the combination of specifying a reference while also asking it not to use that reference and to use the consensus instead left ref LN_length variable as zero. We no longer shrink c->ref_end back down in this case.
daviesrob
reviewed
Apr 1, 2026
The container ref_start and ref_end are absolute coordinates in the reference covered by the alignments. However extend_ref had ref_end as the maximum reference allocated, meaning we had to track the actual end point in another variable (max_pos). Renamed max_pos to ref_end and now track the allocation in ref_end_alloc.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
explicit_tlenvariable as well as the cram flags, which led to removal of items from the cram stats and potentially changing data by erroneously believing the values to be constant.Both of these show up in io_lib if I switch to using htslib for CRAM reading and writing.