Skip to content

Transcript-level CPM sum and gene vs. transcript count discrepancy in multi-sample mode #485

@donghyeon321

Description

@donghyeon321

Hi Bambu team,
Thank you for developing such a powerful and elegant tool for context-aware transcript quantification.
I have two questions regarding the output behavior when running Bambu in multi-sample mode:

1. CPM values do not sum to 1,000,000 per sample

I observed that in the CPM_transcript.txt file, the sum of CPM values for each sample is not 1,000,000, but rather around 800,000 to 900,000.

  • Is this expected behavior?
  • If so, what types of reads or transcripts are excluded from the CPM computation, causing the total to fall below 1 million?

2. Gene-level total counts > Transcript-level total counts

When I compare the sum of raw counts:

  • From counts_gene.txt, the total counts per sample are about 32 million.
  • From counts_transcript.txt, the totals are around 28 million.

This seems counterintuitive, since one would expect transcript-level counts to be equal to or exceed gene-level counts (due to gene = sum of its transcripts).

  • Could you explain why the transcript-level sum is lower?
  • Does this have to do with multi-mapping reads, transcript filtering, or EM assignment behavior?

I would appreciate any clarification on this!
Thanks again for your work on Bambu.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions