List view
This will be complete when GVL supports including non-personal variants e.g. for variant effect predictions. For example, to inject variants from gnomAD, ClinVar, etc. **Implementation notes** - Add a class (name WIP) for sites-only variants that can be initialized from either a VCF or a table (e.g. csv, tsv, arrow, parquet, anything polars supports) that at least has columns for CHROM, POS, REF, and ALT. - Add methods for lazily filtering on record fields or any columns - Add a class for joined objects init from a sites-only object and a Dataset (or BED + ref genome). Names WIP but something to the effect of SitesOnly+Ref or SitesOnly+Dataset. Similar to Datasets, SitesOnly+Dataset should be a lazy ragged array with indexable shape `(rows, samples)` but now each row is a combination of a region and variant(s). - Add method to extract sites-only variants that overlap with each row of the Dataset and then use the below function to generate haplotypes with non-personal variants - For rapid prototyping, add JIT'd functions for applying 1 or more variants to 1 or more annotated sequences from a `gvl.Dataset`. A more efficient implementation should not rely on `gvl.Dataset.__getitem__` and do this directly from the genotypes + reference instead - Add attributes (arrays with shape `(rows, samples)`) that flag each sequence as having the variant applied or not and having the variant already in the haplotype or not. Can simply be an optional return from the above function(s). - Add methods `to_torch_dataset` and `to_dataloader`
Overdue by 9 month(s)•Due by April 8, 2025•7/7 issues closedThis milestone will be completed when GVL supports returning spliced sequences. The primary use case for this is working with personalized exonic sequences. Partially addresses #24. **Implementation notes** When writing the dataset, this could be enabled by a splice=True flag and providing a BED file with named regions. Having unnamed regions should raise an error or warning. Regions with the same name will be spliced together in the order that they appear during reconstruction. e.g. gvl.Dataset[0] returns a spliced sequence corresponding to the first name appearing in the BED file. This will probably look like calling the reconstruct_haplotype function multiple times, once for each subsequence. This may also require disambiguating between regions and spliced regions in the documentation and/or gvl.Dataset attribute names.
No due date•0/2 issues closed