feat: add illicofor rank_genes_groups#4038
feat: add illicofor rank_genes_groups#4038ilan-gold wants to merge 8 commits intoig/exp_post_aggfrom
illicofor rank_genes_groups#4038Conversation
❌ 8 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
|
Hey, good catch for the OVO column ordering. I have no idea where the bug comes from, it seems to be related to group labels for PBMC that |
|
So it seems to be due to np.argsort and np.sort not sorting the weird characters (" ", "+", "/", "#", etc) of the bulk_labels the same way. I will ship a fix sometime today or this weekend. NB: I spent a bit of time looking into the PBMC dataset and it seems rather unusual: a lot of values are negative leading to non-defined logfoldchanges, on top of which there seems to be very little value diversity in the dataset. As a result, a lot of genes end up with identical ranksum, but because illico and scanpy do not compute z-score the exact same way (mathematically equivalent but programmatically different), gene ordering is impacted. The gene ordering impacts the position of NaN values in the results, but non-NaN values do match because they are very similar. I am not sure as if this dataset is a good test case. NB2: details of the programmatic differences: U = ranksum - n_tgt * (n_tgt+ 1)/2
z = U - n_ref * n_tgt / 2scanpy does: z =
ranksum - n_tgt * (n_tgt + 1 + n_ref) / 2Those are mathematically equivalent but result in differences of the order of 1.e-9 approximately, changing gene orders when ranksums are equal. |
Interesting, good observation but glad we are aware of this now. This could be another target for scanpy 2.0. I will try a different dataset but glad to have this documented. |
|
Forget my previous message which was actually kind of off topic. This PBMC dataset was actually a very good test case.
Example: let's take NB: what is the situation with the Dask issue ? Current version is not dask-compatible. |
That should come next, I want to keep this scoped for now to the in-memory stuff. Another thing if you're doing a batch of fixes this weekend - supporting Also for your |
I'm seeing some instances of https://github.com/satijalab/seurat/blob/main/R/differential_expression.R#L1089 i.e., they calculate the mean expression with this "+1" offset so don't need the correction. So we have three different things here, I guess. I think just giving the option to match scanpy (since that seems a project goal) is good, and then we can maybe revisit what seurat does as a new parameter for scanpy 2.0.
For this, I would be open to also making this part of a scanpy 2.0 set of default i.e., using Overall, my ideal scenario would be that Short of exact matches, like I said we can just document changes. It sounds like the biggest things that would close the gap are:
I think we can patch the Overall, this is really cool and I'm super happy that things seem to be pretty close! |
|
Hey, I just released version 0.5.0rc1 which should fix the issues identified on this test case:
Regarding the numerical precision question: so far I force the Testing locally, it seems everything now runs smoothly for PMBC. Let me know how it goes with the rest of the CI. |
TODOs:
See: https://github.com/scverse/scanpy/actions/runs/24088645078/job/70268566419?pr=4038
ovo"column" in scores (cc @remydubois)illico(see thefilterwarningson the new test)ovrresults are so far offexp_post_aggfeature feat: allow exponentiation post agg for log-fold-change inrank_genes_groups#4037) toillico, we will add theillicobackend (or all changes have been documented)illico#4012