Skip to content

Conversation

@manmita
Copy link
Contributor

@manmita manmita commented Jan 7, 2026

Closes #7571

Added a condition to check if ans consist of NA values if element is not NA
Removed the break statement if the element is NA, as that was causing bug on consecutive NA values in the same partition but different group.

@MichaelChirico
Copy link
Member

Thanks @manmita! Please ping me when tests are passing, or if you're stuck.

NEWS.md Outdated

3. `fread("file://...")` works for file URIs with spaces, [#7550](https://github.com/Rdatatable/data.table/issues/7550). Thanks @aitap for the report and @MichaelChirico for the PR.

4. Fixed a bug in GForce grouped sum for `integer64` columns where `sum(value, na.rm = FALSE)` did not always return `NA` when any `NA_integer64_` was present in a group. [#7571](https://github.com/Rdatatable/data.table/issues/7571). Now, groups containing any `NA_integer64_` correctly return `NA`, matching base R behavior. Thanks to @rweberc for the report and @MichaelChirico for reproducible examples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could as well mention yourself here, and ideally write what was the root cause of that so readers are more likely to understand if they could have been affected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.99%. Comparing base (a54ee8d) to head (2365bdf).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7572   +/-   ##
=======================================
  Coverage   98.99%   98.99%           
=======================================
  Files          87       87           
  Lines       16729    16729           
=======================================
  Hits        16561    16561           
  Misses        168      168           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@manmita
Copy link
Contributor Author

manmita commented Jan 7, 2026

Hello @MichaelChirico ,

There is coverage issue in dogroups.c.

id = c(1:8, 9, 9),
value = c(rep(NA_integer64_, 3L), as.integer64(4:10))
)
test(2358.1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use test(options=c(datatable.optimize=2L)) to emphasize that this was a GForce problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do

NEWS.md Outdated

3. `fread("file://...")` works for file URIs with spaces, [#7550](https://github.com/Rdatatable/data.table/issues/7550). Thanks @aitap for the report and @MichaelChirico for the PR.

4. - Fixed a bug in GForce grouped sum for `integer64` columns where `sum(value, na.rm = FALSE)` when consecutive NA values are present in the same partition but different group. In that case only the first group was handled ([#7571](https://github.com/Rdatatable/data.table/issues/7571)). Thanks to @rweberc for the report, @MichaelChirico for reproducible examples, and @manmita for the fix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to anchor the NEWS too much on implementation details -- user reading this likely doesn't care all that much.

It still feels a bit lacking though -- maybe there's a more concise way to explain it?

What is the exact nature of the bug from a user point of view? Is it true that it always results in a group having "sum" 0 instead of the correct NA_integer64_?

Copy link
Contributor Author

@manmita manmita Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the code we have a variable called howMany which determines how many are there in this batch
but if the group by id should have been grouped by 2 and batching is done by 4, then the sum becomes 0 for everything after the first encountered na. Although the first na sets the first id group to correct value.

dt_short[, result_problem := sum(value, na.rm = FALSE), by = id] for this if id is 1,1,2,2,3,3 and has na,na,na,1,2,3
then the result is na, 0, 5 if the batch size is 4

Copy link
Contributor Author

@manmita manmita Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can say that when we use group by something on na values with paralel processing then multiple groups (consisting na values) can fall into the same batch but the break will only take the first group

from user perspective

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That we do things in parallel/in batches should not matter to user who just wants correct code.

then the result is na, 0, 5 if the batch size is 4

It looks like it confirms my diagnosis -- the error always appears as an incorrect 0. Is it not possible that the NA_integer64_ is encountered later after the sum has already been initialized?

Copy link
Contributor Author

@manmita manmita Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the na can still be possible if its in a diff batch like na, na, na, na, na, na for ids 1,2,3,4,5,6 and batch size 4 then it'll give na, 0, 0, 0, na, 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but maybe we can say that na followed by zeros kind of behavior incase of consecutive nas

@manmita
Copy link
Contributor Author

manmita commented Jan 7, 2026

Hii @MichaelChirico the coverage is failing at src/dogroups.c which was fixed by @ben-schwen in this commit
d6fb16c

The sentry report shows that file only in their report
I was wondering how to add this into this PR?

@MichaelChirico
Copy link
Member

Hii @MichaelChirico the coverage is failing at src/dogroups.c which was fixed by @ben-schwen in this commit d6fb16c

The sentry report shows that file only in their report I was wondering how to add this into this PR?

you can ignore it here

@MichaelChirico
Copy link
Member

This bug is really quite serious I think.

Consider:

DT = data.table(id = sample(letters, 1000, TRUE), value = as.integer64(sample(c(1:100, NA), 1000, TRUE)))

DT[,
  .(gforce_sum = sum(value)), by=id
][
  DT[, .(true_sum = base::sum(value)), by=id],
  on='id'
][
  gforce_sum != true_sum | is.na(gforce_sum) != is.na(true_sum)
]
#         id           gforce_sum true_sum
#     <char>                <i64>    <i64>
#  1:      p                 1768     2188
#  2:      g                 2489     2684
#  3:      t                 1451     1922
#  4:      h -9223372036854775096     <NA>
#  5:      n                  687     2410
#  6:      m                  266     1556
#  7:      k                  986     2101
#  8:      y                 1035     1398
#  9:      q -9223372036854774756     <NA>
# 10:      d                 1650     <NA>
# 11:      c -9223372036854774942     <NA>
# 12:      f                 1073     1674
# 13:      o                  710     1513
# 14:      v                  760     1660
# 15:      l                 1076     <NA>
# 16:      e                 1089     1736
# 17:      w -9223372036854774951     <NA>
# 18:      i                 1204     <NA>
# 19:      j                 1478     <NA>

@MichaelChirico
Copy link
Member

It also affects mean(), and Cfastmean appears affected too

@manmita
Copy link
Contributor Author

manmita commented Jan 8, 2026

yeahh, seems like

Will think it through, checkout the eg cases and the code again.

Thanks

@manmita
Copy link
Contributor Author

manmita commented Jan 8, 2026

The current eg did get fixed with the changes made in this branch

> DT[,
  .(gforce_sum = sum(value)), by=id
][
  DT[, .(true_sum = base::sum(value)), by=id],
  on='id'
][
  gforce_sum != true_sum | is.na(gforce_sum) != is.na(true_sum)
]
Empty data.table (0 rows and 3 cols): id,gforce_sum,true_sum

But I think we should add more test cases and test it thoroughly
for mean and Cfastmean too

I believe why the current fix works because
we iterate through each and every element given and we add them if the sum of that particular group is not NA.

@MichaelChirico
Copy link
Member

yes, sorry, to clarify, my example is from master, not this PR. I'm just showing how bad the behavior is before the fix.

It would be good to ensure the same issue does not happen elsewhere.

@manmita
Copy link
Contributor Author

manmita commented Jan 8, 2026

yeah true,

I think extensive testing is needed for this case, will write some from my end

and check the mean/fastmean too

@manmita
Copy link
Contributor Author

manmita commented Jan 8, 2026

in the gmean code - line 619

 _ans[my_low[i]] += my_gx[i];  // let NA propagate when !narm

this is adding the NA to the sum so thats creating an issue for int64 (my_gx[i] is the data item)
if NA sentinel int64_min is added to say 2000 then its no more int64_min so that should give wrong values.
just spectaculations

Will check it out!

I was wondering if fix for this will go as a seperate PR or a part of this PR?
@MichaelChirico

@MichaelChirico
Copy link
Member

in the gmean code - line 619

 _ans[my_low[i]] += my_gx[i];  // let NA propagate when !narm

this is adding the NA to the sum so thats creating an issue for int64 (my_gx[i] is the data item) if NA sentinel int64_min is added to say 2000 then its no more int64_min so that should give wrong values. just spectaculations

Will check it out!

I was wondering if fix for this will go as a seperate PR or a part of this PR? @MichaelChirico

Good question. For now, let's put it in this PR. The code change seems like it will be small enough, and the new tests feel logically connected.

@ben-schwen
Copy link
Member

ben-schwen commented Jan 8, 2026

Sorry that I'm late to that party, but wouldn't it be easier (and also maybe easier to grasp) to mirror the INTSXP case with int64 without the overflow part?

e.g. remove the if (!narm) condition and the else branch and just use

            for (int i=0; i<howMany; i++) {
              const int64_t a = _ans[my_low[i]];
              if (a==INT64_MIN) continue;
              const int64_t b = my_gx[i];
              if (b==INT64_MIN) {
                if (!narm) _ans[my_low[i]]=INT64_MIN;
                continue;
              }
              _ans[my_low[i]] += b;
            }

gforce = DT[, .(gforce_sum = sum(value)), by=id]
base = DT[, .(true_sum = base::sum(value)), by=id]
merged = merge(gforce, base, by="id", all=TRUE)
test(2358.4, options=c(datatable.optimize=2L),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make this just

test(2358.4, options=c(datatable.optimize=2L),
     merged$gforce_sum, merged$true_sum)

@manmita
Copy link
Contributor Author

manmita commented Jan 8, 2026

I think the mean is working correctly as I tested with similar examples but the mean code is giving correct results
reason - INT_MIN, INT64_MIN is used as sentinel value so adding anything to that is still NA

Also updated the gsum code to follow the int counterpart

Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think it's ready to go. We can iterate more in future bugs/PRs.

@manmita
Copy link
Contributor Author

manmita commented Jan 8, 2026

Thank you @MichaelChirico @ben-schwen @jangorecki for you help and suggestions <3

@MichaelChirico MichaelChirico merged commit b472d86 into Rdatatable:master Jan 8, 2026
10 of 11 checks passed
@MichaelChirico
Copy link
Member

Thanks @manmita. I've invited you to be a Project Member.

Once you accept, you'll be able to create branches directly on this repo (Rdatatable/data.table). Please do so going forward, as it makes collaboration a bit easier.

I'll also file a follow-up PR adding you as a contributor in the DESCRIPTION.

manmita added a commit to manmita/data.table that referenced this pull request Jan 8, 2026
…#7572)

* fix(7571): bug fix for narm issue on gforce in int64 case

* fix(7571): test sequencing

* fix(7571): updated the NEWS.md

* trailing newline

* Use $V1

* fix(7571): added db optimize 2L

* refine NEWS

* fix(7571): add more tests and change to code similar to int for gsum

* fix(7571): added more tests for mean

* eliminate intermediate variable

* NEWS again

---------

Co-authored-by: Michael Chirico <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sum(value, na.rm = FALSE) returns unexpected 0s for integer64 field within NAs in group

4 participants