Skip to content

Conversation

@ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Dec 27, 2025

Closes #635

  • basic functionality dt[i, .ROW := NULL]
  • support by clauses dt[i, .ROW := NULL, by]
  • documentation
  • NEWS
  • tests

Currently only supports deletion of i, but could also be integrated into dogroups.

Do we need/want a functional form like delete(x, irows). Depending on what we allow for irows this would need a rewrite/duplicate of the internals of [i evaluation.

@codecov
Copy link

codecov bot commented Dec 27, 2025

Codecov Report

❌ Patch coverage is 91.72932% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.91%. Comparing base (dc69b3d) to head (57a4bdc).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/deleterows.c 90.09% 10 Missing ⚠️
src/wrappers.c 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7536      +/-   ##
==========================================
- Coverage   98.97%   98.91%   -0.06%     
==========================================
  Files          87       88       +1     
  Lines       16733    16853     +120     
==========================================
+ Hits        16561    16670     +109     
- Misses        172      183      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Dec 27, 2025

  • HEAD=delete_by_ref slower P<0.001 for DT[by,verbose=TRUE] improved in #6296
    Comparison Plot

Generated via commit 57a4bdc

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 41 seconds
Installing different package versions 42 seconds
Running and plotting the test cases 3 minutes and 52 seconds

@jangorecki

This comment was marked as resolved.

@ben-schwen
Copy link
Member Author

What if columns are not marked as resizable? AFAIU delete rows will have to create a resizable copy in order to remove it. Right? In such case I think we may want to provide a function to mark resizable explicitly, ahead of time, so further processing is more predicable.

Yes. I have added allocrow as a pendant to alloccol. In the future we can also use this to overallocate the number of rows so we go implement the fast insert #660

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. setallocrow needs to return invisibly.
  2. I was expecting following example to show difference, it does not
x1 = data.table(a = 1:2)
x2 = x1

y1 = data.table(a = 1:2)
invisible(setallocrow(y1))
y2 = y1

delete_first_row = function(x) x[1L, .ROW := NULL]

delete_first_row(x1)
delete_first_row(y1)

x2
y2

could verbose=T tell when object is re-allocated (as resizable) rather than truly updated in place?

we need example that shows why setallocrow() might be desired.

  1. does it make sense to mark resizable already in as.data.table / fread / rbindlist?

@ben-schwen
Copy link
Member Author

I have added verbose messages to allocrow. I'm also not convinced how useful allocrow is right now. In the future yes, it will be essential when overallocating the number rows, respectively shortening overallocated rows.

library(data.table)
dt = data.table(a = 1:5)
truelength(dt$a)
#> [1] 0
setallocrow(dt)
truelength(dt$a)
#> [1] 5
dt[3:5, .ROW := NULL]
truelength(dt$a)
#> [1] 5
setallocrow(dt)
truelength(dt$a)
#> [1] 2

Created on 2025-12-29 with reprex v2.1.1

Thinking about the bigger picture of fast insert we will need to think about how much we want to overallocate when inserting respectively when we want to tidy up.


Note that for efficiency no check is performed for duplicate assignments, i.e. if multiple values are passed for assignment to the same index, assignment to this index will occur repeatedly and sequentially; for a given use case, consider whether it makes sense to create your own test for duplicates, e.g. in production code.

Note that \code{.ROW := NULL} is a special case used to delete rows by reference. Unlike column assignment, this requires an \code{i} expression to specify which rows to delete, and does not support \code{by} or \code{keyby}. See \code{\link{.ROW}} or \code{\link{special-symbols}} for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside for future consideration -- another use case would be DT[<optional i>, by=<grp>, having=<condition>, .ROW := NULL]

Copy link
Member

@MichaelChirico MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, be sure to point to ?setallocrow as the functional equivalent

Oh, I misunderstood

\description{
\code{.SD}, \code{.BY}, \code{.N}, \code{.I}, \code{.GRP}, and \code{.NGRP} are \emph{read-only} symbols for use in \code{j}. \code{.N} can be used in \code{i} as well. \code{.I} can be used in \code{by} as well. See the vignettes, Details and Examples here and in \code{\link{data.table}}.
\code{.EACHI} is a symbol passed to \code{by}; i.e. \code{by=.EACHI}, \code{.NATURAL} is a symbol passed to \code{on}; i.e. \code{on=.NATURAL}
\code{.ROW} is a symbol used with \code{:= NULL} to delete rows by reference; i.e. \code{DT[i, .ROW := NULL]} deletes the rows selected by \code{i}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just point to ?assign, the info here is a bit repetitive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a section to the Reference semantics vignette about this.

E.g., illustrating any issues that might obtain from shared references to data.tables, what happens here?

DT1 = data.table(i = 1:10)
DT2 = DT1

DT2[i > 5, .ROW := NULL]
nrow(DT1)

\code{setallocrow} is a utility function that prepares columns for fast row operations (delete or insert (not implemented yet)) by reference and manages row capacity.
Before deleting or inserting rows by reference, columns must be resizable.
\code{setallocrow} ensures all columns are in the appropriate state by converting ALTREP columns to materialized form and reallocating
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a bit of clarification -- when I read it, my thought was "oh, we can't do .ROW:=NULL on a table with ALTREP columns unless we first setallocrow()", but I don't see any examples/tests to that effect.

OTOH, I think (?) we strive to expand all ALTREP columns, so I haven't been able to come up with such an example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont really need setallocrow yet since it is kind of down when non growable vectors are encountered. Now you can use it to free up the used space after deleting many columns.

Once we introduce fast insert, using it to allocate more seems the more convincing option.

Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is kind of 2 PRs -- one about setallocrow() and .ROW := NULL, are they totally independent?

Or perhaps, we need setallocrow() first in order to enable .ROW := NULL?

It might be easier to review in that sequence -- PR 1 adds setallocrow() (with no way to actually use it), PR 2 enables DELETE operations by reference. WDYT?

@jangorecki
Copy link
Member

jangorecki commented Dec 29, 2025

  1. table that is already marked resizable, when it will get new column added, it should lose/reset it's resizable marker, or make newly added columns resizable:
d = data.table(a = 4:1)
.Internal(inspect(d[["a"]]))
#@55b90c4c1c28 13 INTSXP g0c2 [REF(2)] (len=4, tl=0) 4,3,2,1
setallocrow(d)
.Internal(inspect(d[["a"]]))
#@55b90c4c1e68 13 INTSXP g0c2 [REF(1),gp=0x20] (len=4, tl=4) 4,3,2,1
d[, "x" := c(1,2,3,4)]
.Internal(inspect(d[["a"]]))
#@55b90c4c1e68 13 INTSXP g0c2 [REF(1),gp=0x20] (len=4, tl=4) 4,3,2,1
.Internal(inspect(d[["x"]]))
#@55b910172c58 14 REALSXP g0c3 [REF(2)] (len=4, tl=0) 1,2,3,4

Thinking about the bigger picture of fast insert we will need to think about how much we want to overallocate when inserting respectively when we want to tidy up.

I think it might be explicitly requested from user each time

  1. I was expecting we can show why one might want to use setallocrow, any example? the one I made in my last post appeared to not be working...

ben-schwen and others added 5 commits January 8, 2026 18:24
Co-authored-by: Michael Chirico <[email protected]>
Co-authored-by: Michael Chirico <[email protected]>
Co-authored-by: Michael Chirico <[email protected]>
Co-authored-by: Michael Chirico <[email protected]>
@ben-schwen
Copy link
Member Author

It looks like this is kind of 2 PRs -- one about setallocrow() and .ROW := NULL, are they totally independent?

Or perhaps, we need setallocrow() first in order to enable .ROW := NULL?

It might be easier to review in that sequence -- PR 1 adds setallocrow() (with no way to actually use it), PR 2 enables DELETE operations by reference. WDYT?

I guess its the other way around. As written in the other response the only real usage for setallocrow now is to resize growable vectors to shorten versions to free up memory. Once we add fast insert, it could lead to potential speed gains when because we only need to allocate more memory once and not multiple times when growing.

@ben-schwen
Copy link
Member Author

Maybe a somewhat crafted example, but I guess that shows the purpose of setallocrow. Might want to also show the truelength() of dt$a after each step.

library(lobstr)
gc()
m1 = mem_used()
DT = data.table(A = 1:1000000, B = rnorm(1000000))
gc()
m2 = mem_used()
DT[A > 1, .ROW := NULL]
gc()
m3 = mem_used()
setallocrow(DT, n = 0)
gc()
m4 = mem_used()
m1; m2; m3; m4
#           used (Mb) gc trigger (Mb) max used (Mb)
# Ncells  633664 33.9    1259332 67.3  1259332 67.3
# Vcells 2250976 17.2    8388608 64.0  2729386 20.9
#           used (Mb) gc trigger (Mb) max used (Mb)
# Ncells  635771 34.0    1259332 67.3  1259332 67.3
# Vcells 3766173 28.8    8388608 64.0  5009106 38.3
#           used (Mb) gc trigger (Mb) max used (Mb)
# Ncells  647477 34.6    1259332 67.3  1259332 67.3
# Vcells 3855038 29.5    8388608 64.0  8236948 62.9
#           used (Mb) gc trigger (Mb) max used (Mb)
# Ncells  647485 34.6    1259332 67.3  1259332 67.3
# Vcells 2355042 18.0    8388608 64.0  8236948 62.9
# 53.50 MB
# 65.73 MB
# 67.10 MB
# 55.10 MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Delete rows by reference

3 participants