Skip to content

align_mapper.p_to_g does not support split codons across exon junctions (e.g. BRAF N581S) #453

@Hui-zju

Description

@Hui-zju

Describe the bug

align_mapper.p_to_g currently fails when converting protein coordinates to genomic coordinates if the amino acid is encoded by a split codon across an exon junction.

A concrete example is BRAF N581S, which corresponds to a codon split across two exons in transcript NM_004333.6.

Steps to reproduce

align_mapper.p_to_g(
p_ac='NP_004324.2',
p_start_pos=580,
p_end_pos=580,
)
The call raises an error:
Unable to find transcript alignment for query:
SELECT hgnc, tx_ac, tx_start_i, tx_end_i, alt_ac, alt_start_i,
alt_end_i, alt_strand, alt_aln_method, ord, tx_exon_id, alt_exon_id
FROM uta_20241220.tx_exon_aln_v
WHERE tx_ac='NM_004333.6'
AND alt_ac LIKE 'NC_00%'
AND alt_aln_method='splign'
AND 1966 BETWEEN tx_start_i AND tx_end_i
AND 1969 BETWEEN tx_start_i AND tx_end_i
ORDER BY CAST(
SUBSTR(alt_ac, position('.' in alt_ac) + 1, LENGTH(alt_ac)) AS INT
)
Expected behavior
p_to_g should be able to handle protein positions whose underlying codon spans an exon junction and return the correct genomic coordinates (potentially as a compound or multi-interval mapping).

Acceptance Criteria

align_mapper.p_to_g successfully maps protein positions whose underlying codon spans an exon junction (split codons).

Possible reason(s)/Suggested Fix

BRAF N581 is encoded by a split codon across an exon boundary in NM_004333.6.

Relevant exon alignment records from uta_20241220.tx_exon_aln_v:

SELECT *
FROM uta_20241220.tx_exon_aln_v
WHERE tx_ac = 'NM_004333.6'
AND alt_ac LIKE 'NC_00%'
AND alt_aln_method = 'splign'
AND tx_start_i < 1975
AND tx_end_i > 1960
ORDER BY tx_start_i, tx_end_i;

ord | tx_start_i | tx_end_i | tx_exon_id | alt_exon_id | alt_ac -- | -- | -- | -- | -- | -- 13 | 1920 | 1967 | 7649344 | 8121846 | NC_000007.13 13 | 1920 | 1967 | 7649344 | 9507337 | NC_000007.14 14 | 1967 | 2086 | 7649345 | 8121847 | NC_000007.13 14 | 1967 | 2086 | 7649345 | 9507338 | NC_000007.14

The codon corresponding to protein position 581 spans transcript positions 1966–1969, which cross the boundary between exon ord=13 and ord=14.
However, the current query logic requires both tx_start_i and tx_end_i to fall within a single exon alignment row, which fails for split codons.

Suggestion
Consider enhancing p_to_g to:

detect codons spanning exon junctions, and

support mapping them by aggregating multiple exon alignment segments rather than requiring a single tx_exon_aln_v row.

Environment & Version

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions