-
Notifications
You must be signed in to change notification settings - Fork 1k
delete rows by reference #7536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
delete rows by reference #7536
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7536 +/- ##
==========================================
- Coverage 98.97% 98.91% -0.06%
==========================================
Files 87 88 +1
Lines 16733 16853 +120
==========================================
+ Hits 16561 16670 +109
- Misses 172 183 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Generated via commit 57a4bdc Download link for the artifact containing the test results: ↓ atime-results.zip
|
This comment was marked as resolved.
This comment was marked as resolved.
Yes. I have added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- setallocrow needs to return invisibly.
- I was expecting following example to show difference, it does not
x1 = data.table(a = 1:2)
x2 = x1
y1 = data.table(a = 1:2)
invisible(setallocrow(y1))
y2 = y1
delete_first_row = function(x) x[1L, .ROW := NULL]
delete_first_row(x1)
delete_first_row(y1)
x2
y2could verbose=T tell when object is re-allocated (as resizable) rather than truly updated in place?
we need example that shows why setallocrow() might be desired.
- does it make sense to mark resizable already in
as.data.table/fread/rbindlist?
|
I have added verbose messages to library(data.table)
dt = data.table(a = 1:5)
truelength(dt$a)
#> [1] 0
setallocrow(dt)
truelength(dt$a)
#> [1] 5
dt[3:5, .ROW := NULL]
truelength(dt$a)
#> [1] 5
setallocrow(dt)
truelength(dt$a)
#> [1] 2Created on 2025-12-29 with reprex v2.1.1 Thinking about the bigger picture of fast |
|
|
||
| Note that for efficiency no check is performed for duplicate assignments, i.e. if multiple values are passed for assignment to the same index, assignment to this index will occur repeatedly and sequentially; for a given use case, consider whether it makes sense to create your own test for duplicates, e.g. in production code. | ||
|
|
||
| Note that \code{.ROW := NULL} is a special case used to delete rows by reference. Unlike column assignment, this requires an \code{i} expression to specify which rows to delete, and does not support \code{by} or \code{keyby}. See \code{\link{.ROW}} or \code{\link{special-symbols}} for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside for future consideration -- another use case would be DT[<optional i>, by=<grp>, having=<condition>, .ROW := NULL]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, be sure to point to ?setallocrow as the functional equivalent
Oh, I misunderstood
| \description{ | ||
| \code{.SD}, \code{.BY}, \code{.N}, \code{.I}, \code{.GRP}, and \code{.NGRP} are \emph{read-only} symbols for use in \code{j}. \code{.N} can be used in \code{i} as well. \code{.I} can be used in \code{by} as well. See the vignettes, Details and Examples here and in \code{\link{data.table}}. | ||
| \code{.EACHI} is a symbol passed to \code{by}; i.e. \code{by=.EACHI}, \code{.NATURAL} is a symbol passed to \code{on}; i.e. \code{on=.NATURAL} | ||
| \code{.ROW} is a symbol used with \code{:= NULL} to delete rows by reference; i.e. \code{DT[i, .ROW := NULL]} deletes the rows selected by \code{i}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just point to ?assign, the info here is a bit repetitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add a section to the Reference semantics vignette about this.
E.g., illustrating any issues that might obtain from shared references to data.tables, what happens here?
DT1 = data.table(i = 1:10)
DT2 = DT1
DT2[i > 5, .ROW := NULL]
nrow(DT1)| \code{setallocrow} is a utility function that prepares columns for fast row operations (delete or insert (not implemented yet)) by reference and manages row capacity. | ||
| Before deleting or inserting rows by reference, columns must be resizable. | ||
| \code{setallocrow} ensures all columns are in the appropriate state by converting ALTREP columns to materialized form and reallocating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs a bit of clarification -- when I read it, my thought was "oh, we can't do .ROW:=NULL on a table with ALTREP columns unless we first setallocrow()", but I don't see any examples/tests to that effect.
OTOH, I think (?) we strive to expand all ALTREP columns, so I haven't been able to come up with such an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We dont really need setallocrow yet since it is kind of down when non growable vectors are encountered. Now you can use it to free up the used space after deleting many columns.
Once we introduce fast insert, using it to allocate more seems the more convincing option.
MichaelChirico
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is kind of 2 PRs -- one about setallocrow() and .ROW := NULL, are they totally independent?
Or perhaps, we need setallocrow() first in order to enable .ROW := NULL?
It might be easier to review in that sequence -- PR 1 adds setallocrow() (with no way to actually use it), PR 2 enables DELETE operations by reference. WDYT?
d = data.table(a = 4:1)
.Internal(inspect(d[["a"]]))
#@55b90c4c1c28 13 INTSXP g0c2 [REF(2)] (len=4, tl=0) 4,3,2,1
setallocrow(d)
.Internal(inspect(d[["a"]]))
#@55b90c4c1e68 13 INTSXP g0c2 [REF(1),gp=0x20] (len=4, tl=4) 4,3,2,1
d[, "x" := c(1,2,3,4)]
.Internal(inspect(d[["a"]]))
#@55b90c4c1e68 13 INTSXP g0c2 [REF(1),gp=0x20] (len=4, tl=4) 4,3,2,1
.Internal(inspect(d[["x"]]))
#@55b910172c58 14 REALSXP g0c3 [REF(2)] (len=4, tl=0) 1,2,3,4
I think it might be explicitly requested from user each time
|
Co-authored-by: Michael Chirico <[email protected]>
Co-authored-by: Michael Chirico <[email protected]>
Co-authored-by: Michael Chirico <[email protected]>
Co-authored-by: Michael Chirico <[email protected]>
I guess its the other way around. As written in the other response the only real usage for |
|
Maybe a somewhat crafted example, but I guess that shows the purpose of library(lobstr)
gc()
m1 = mem_used()
DT = data.table(A = 1:1000000, B = rnorm(1000000))
gc()
m2 = mem_used()
DT[A > 1, .ROW := NULL]
gc()
m3 = mem_used()
setallocrow(DT, n = 0)
gc()
m4 = mem_used()
m1; m2; m3; m4
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 633664 33.9 1259332 67.3 1259332 67.3
# Vcells 2250976 17.2 8388608 64.0 2729386 20.9
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 635771 34.0 1259332 67.3 1259332 67.3
# Vcells 3766173 28.8 8388608 64.0 5009106 38.3
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 647477 34.6 1259332 67.3 1259332 67.3
# Vcells 3855038 29.5 8388608 64.0 8236948 62.9
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 647485 34.6 1259332 67.3 1259332 67.3
# Vcells 2355042 18.0 8388608 64.0 8236948 62.9
# 53.50 MB
# 65.73 MB
# 67.10 MB
# 55.10 MB |

Closes #635
dt[i, .ROW := NULL]byclausesdt[i, .ROW := NULL, by]Currently only supports deletion of
i, but could also be integrated intodogroups.Do we need/want a functional form like
delete(x, irows). Depending on what we allow forirowsthis would need a rewrite/duplicate of the internals of[ievaluation.