Transcript-level CPM sum and gene vs. transcript count discrepancy in multi-sample mode

Hi Bambu team, 
Thank you for developing such a powerful and elegant tool for context-aware transcript quantification. 
I have two questions regarding the output behavior when running Bambu in multi-sample mode:

**1. CPM values do not sum to 1,000,000 per sample** 

I observed that in the CPM_transcript.txt file, the sum of CPM values for each sample is not 1,000,000, but rather around 800,000 to 900,000. 
- Is this expected behavior? 
- If so, what types of reads or transcripts are excluded from the CPM computation, causing the total to fall below 1 million?

**2. Gene-level total counts > Transcript-level total counts** 
 
When I compare the sum of raw counts:
- From counts_gene.txt, the total counts per sample are about 32 million. 
- From counts_transcript.txt, the totals are around 28 million.

This seems counterintuitive, since one would expect transcript-level counts to be equal to or exceed gene-level counts (due to gene = sum of its transcripts). 
- Could you explain why the transcript-level sum is lower?
- Does this have to do with multi-mapping reads, transcript filtering, or EM assignment behavior?


I would appreciate any clarification on this!
Thanks again for your work on Bambu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transcript-level CPM sum and gene vs. transcript count discrepancy in multi-sample mode #485

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Transcript-level CPM sum and gene vs. transcript count discrepancy in multi-sample mode #485

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions