Skip to content

Add multi-region iterator support for tabix#1997

Open
HeyangQin wants to merge 1 commit intosamtools:developfrom
HeyangQin:tbx-multi-region
Open

Add multi-region iterator support for tabix#1997
HeyangQin wants to merge 1 commit intosamtools:developfrom
HeyangQin:tbx-multi-region

Conversation

@HeyangQin
Copy link
Copy Markdown

Summary

Closes #1913 (and addresses the multi-region query aspect of #1949).

This adds tbx_itr_regions() and tbx_itr_multi_next(), giving tabix-indexed files (VCF, BED, GFF, GAF, etc.) the same multi-region query capability that BAM/CRAM already have via sam_itr_regions().

The key feature is automatic deduplication: when query regions overlap, each record is returned exactly once. This is critical for use cases like "query all known cancer genes" where gene regions may be adjacent or overlapping.

Design

The core insight is that tabix and BAM share the same binning index format (TBI/CSI ↔ BAI/CSI), so hts_itr_multi_bam works directly as the query function for tabix — no new query function needed.

The one interface gap was that hts_itr_multi_next() hardcodes htsFile* as the data argument to readrec, but tbx_readrec expects tbx_t*. This is resolved by adding a void *readrec_data field to hts_itr_t:

  • When set (by tbx_itr_regions), it is passed to readrec instead of fd
  • When NULL (default, from calloc), existing BAM/CRAM behaviour is preserved
  • One field addition, three line changes in hts_itr_multi_next — fully backwards-compatible

Usage

// Parse regions (merges overlaps automatically)
int n_reg;
hts_reglist_t *reglist = hts_reglist_create(regions, n_regions,
    &n_reg, tbx, (hts_name2id_f)(tbx_name2id));

// Create multi-region iterator
hts_itr_t *itr = tbx_itr_regions(tbx, reglist, n_reg);

// Iterate — each record returned at most once
kstring_t s = {0, 0, NULL};
while (tbx_itr_multi_next(fp, itr, &s) >= 0) {
    // process s.s
}
tbx_itr_destroy(itr);

Changes

File Change
htslib/hts.h Add void *readrec_data to hts_itr_t
hts.c Use readrec_data when set in hts_itr_multi_next (3 call sites)
htslib/tbx.h Add tbx_itr_regions() declaration, tbx_itr_multi_next() macro
tbx.c Implement tbx_itr_regions(), tbx_pseek(), tbx_ptell()
htslib.map Export tbx_itr_regions

Test plan

  • htslib builds cleanly with no new warnings
  • Tested with a VCF containing 7 variants across 2 chromosomes:
    • Single-region query: correct record count
    • Non-overlapping multi-region: records from both regions returned
    • Overlapping regions (chr1:100-300 + chr1:200-400): 4 unique records, no duplicates
    • Cross-chromosome with overlap (3 regions, 2 overlapping): 7 unique records
  • samtools builds and passes markdup tests against updated htslib
  • Existing BAM/CRAM multi-region behaviour unaffected (readrec_data defaults to NULL)

Implement tbx_itr_regions() and tbx_itr_multi_next() to provide
multi-region query support for tabix-indexed files (VCF, BED, GFF,
GAF, etc.), matching the existing functionality for BAM/CRAM via
sam_itr_regions().

The multi-region iterator handles overlapping regions by merging
intervals and deduplicating records, so each record is returned at
most once regardless of how many query regions it falls in.

Implementation:
- Add readrec_data field to hts_itr_t, allowing hts_itr_multi_next()
  to pass format-specific context (tbx_t*) to the readrec callback
  instead of htsFile*.  When NULL (default), the existing behaviour
  of passing htsFile* is preserved.
- Add tbx_itr_regions() which creates a multi-region iterator using
  the existing hts_itr_multi_bam query function (tabix and BAM share
  the same binning index format).
- Add tbx_itr_multi_next() macro and tbx_pseek/tbx_ptell helpers.

Closes samtools#1913.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement multi-region iterators for tabix

1 participant