-
Notifications
You must be signed in to change notification settings - Fork 1k
fix(7571): bug fix for narm issue on gforce in int64 case #7572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(7571): bug fix for narm issue on gforce in int64 case #7572
Conversation
|
Thanks @manmita! Please ping me when tests are passing, or if you're stuck. |
NEWS.md
Outdated
|
|
||
| 3. `fread("file://...")` works for file URIs with spaces, [#7550](https://github.com/Rdatatable/data.table/issues/7550). Thanks @aitap for the report and @MichaelChirico for the PR. | ||
|
|
||
| 4. Fixed a bug in GForce grouped sum for `integer64` columns where `sum(value, na.rm = FALSE)` did not always return `NA` when any `NA_integer64_` was present in a group. [#7571](https://github.com/Rdatatable/data.table/issues/7571). Now, groups containing any `NA_integer64_` correctly return `NA`, matching base R behavior. Thanks to @rweberc for the report and @MichaelChirico for reproducible examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could as well mention yourself here, and ideally write what was the root cause of that so readers are more likely to understand if they could have been affected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will do
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7572 +/- ##
=======================================
Coverage 98.99% 98.99%
=======================================
Files 87 87
Lines 16729 16729
=======================================
Hits 16561 16561
Misses 168 168 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hello @MichaelChirico , There is coverage issue in dogroups.c. |
inst/tests/tests.Rraw
Outdated
| id = c(1:8, 9, 9), | ||
| value = c(rep(NA_integer64_, 3L), as.integer64(4:10)) | ||
| ) | ||
| test(2358.1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use test(options=c(datatable.optimize=2L)) to emphasize that this was a GForce problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will do
NEWS.md
Outdated
|
|
||
| 3. `fread("file://...")` works for file URIs with spaces, [#7550](https://github.com/Rdatatable/data.table/issues/7550). Thanks @aitap for the report and @MichaelChirico for the PR. | ||
|
|
||
| 4. - Fixed a bug in GForce grouped sum for `integer64` columns where `sum(value, na.rm = FALSE)` when consecutive NA values are present in the same partition but different group. In that case only the first group was handled ([#7571](https://github.com/Rdatatable/data.table/issues/7571)). Thanks to @rweberc for the report, @MichaelChirico for reproducible examples, and @manmita for the fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to anchor the NEWS too much on implementation details -- user reading this likely doesn't care all that much.
It still feels a bit lacking though -- maybe there's a more concise way to explain it?
What is the exact nature of the bug from a user point of view? Is it true that it always results in a group having "sum" 0 instead of the correct NA_integer64_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the code we have a variable called howMany which determines how many are there in this batch
but if the group by id should have been grouped by 2 and batching is done by 4, then the sum becomes 0 for everything after the first encountered na. Although the first na sets the first id group to correct value.
dt_short[, result_problem := sum(value, na.rm = FALSE), by = id] for this if id is 1,1,2,2,3,3 and has na,na,na,1,2,3
then the result is na, 0, 5 if the batch size is 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can say that when we use group by something on na values with paralel processing then multiple groups (consisting na values) can fall into the same batch but the break will only take the first group
from user perspective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That we do things in parallel/in batches should not matter to user who just wants correct code.
then the result is na, 0, 5 if the batch size is 4
It looks like it confirms my diagnosis -- the error always appears as an incorrect 0. Is it not possible that the NA_integer64_ is encountered later after the sum has already been initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the na can still be possible if its in a diff batch like na, na, na, na, na, na for ids 1,2,3,4,5,6 and batch size 4 then it'll give na, 0, 0, 0, na, 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but maybe we can say that na followed by zeros kind of behavior incase of consecutive nas
|
Hii @MichaelChirico the coverage is failing at src/dogroups.c which was fixed by @ben-schwen in this commit The sentry report shows that file only in their report |
you can ignore it here |
|
This bug is really quite serious I think. Consider: DT = data.table(id = sample(letters, 1000, TRUE), value = as.integer64(sample(c(1:100, NA), 1000, TRUE)))
DT[,
.(gforce_sum = sum(value)), by=id
][
DT[, .(true_sum = base::sum(value)), by=id],
on='id'
][
gforce_sum != true_sum | is.na(gforce_sum) != is.na(true_sum)
]
# id gforce_sum true_sum
# <char> <i64> <i64>
# 1: p 1768 2188
# 2: g 2489 2684
# 3: t 1451 1922
# 4: h -9223372036854775096 <NA>
# 5: n 687 2410
# 6: m 266 1556
# 7: k 986 2101
# 8: y 1035 1398
# 9: q -9223372036854774756 <NA>
# 10: d 1650 <NA>
# 11: c -9223372036854774942 <NA>
# 12: f 1073 1674
# 13: o 710 1513
# 14: v 760 1660
# 15: l 1076 <NA>
# 16: e 1089 1736
# 17: w -9223372036854774951 <NA>
# 18: i 1204 <NA>
# 19: j 1478 <NA> |
|
It also affects |
|
yeahh, seems like Will think it through, checkout the eg cases and the code again. Thanks |
|
The current eg did get fixed with the changes made in this branch But I think we should add more test cases and test it thoroughly I believe why the current fix works because |
|
yes, sorry, to clarify, my example is from It would be good to ensure the same issue does not happen elsewhere. |
|
yeah true, I think extensive testing is needed for this case, will write some from my end and check the mean/fastmean too |
|
in the gmean code - line 619 this is adding the NA to the sum so thats creating an issue for int64 (my_gx[i] is the data item) Will check it out! I was wondering if fix for this will go as a seperate PR or a part of this PR? |
Good question. For now, let's put it in this PR. The code change seems like it will be small enough, and the new tests feel logically connected. |
|
Sorry that I'm late to that party, but wouldn't it be easier (and also maybe easier to grasp) to mirror the e.g. remove the for (int i=0; i<howMany; i++) {
const int64_t a = _ans[my_low[i]];
if (a==INT64_MIN) continue;
const int64_t b = my_gx[i];
if (b==INT64_MIN) {
if (!narm) _ans[my_low[i]]=INT64_MIN;
continue;
}
_ans[my_low[i]] += b;
} |
| gforce = DT[, .(gforce_sum = sum(value)), by=id] | ||
| base = DT[, .(true_sum = base::sum(value)), by=id] | ||
| merged = merge(gforce, base, by="id", all=TRUE) | ||
| test(2358.4, options=c(datatable.optimize=2L), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this just
test(2358.4, options=c(datatable.optimize=2L),
merged$gforce_sum, merged$true_sum)|
I think the mean is working correctly as I tested with similar examples but the mean code is giving correct results Also updated the gsum code to follow the int counterpart |
MichaelChirico
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think it's ready to go. We can iterate more in future bugs/PRs.
|
Thank you @MichaelChirico @ben-schwen @jangorecki for you help and suggestions <3 |
|
Thanks @manmita. I've invited you to be a Project Member. Once you accept, you'll be able to create branches directly on this repo (Rdatatable/data.table). Please do so going forward, as it makes collaboration a bit easier. I'll also file a follow-up PR adding you as a contributor in the DESCRIPTION. |
…#7572) * fix(7571): bug fix for narm issue on gforce in int64 case * fix(7571): test sequencing * fix(7571): updated the NEWS.md * trailing newline * Use $V1 * fix(7571): added db optimize 2L * refine NEWS * fix(7571): add more tests and change to code similar to int for gsum * fix(7571): added more tests for mean * eliminate intermediate variable * NEWS again --------- Co-authored-by: Michael Chirico <[email protected]>
Closes #7571
Added a condition to check if ans consist of NA values if element is not NA
Removed the break statement if the element is NA, as that was causing bug on consecutive NA values in the same partition but different group.