-
Notifications
You must be signed in to change notification settings - Fork 65
Description
What happens?
Calling con.table(name) on a name registered via con.register() immediately executes and materialises the entire result set, rather than returning a lazy relation. The equivalent con.sql("FROM name") returns a lazy relation with no immediate execution.
The description (from the docs) of the con.table() method is:
"Create a relation object for the named table"
So I can see an argument that this is not a bug. And the user probably should be using con.view(name).
But if it is not a bug, I think con.table() should raise an error when called against a view, or the docs should explicitly say that it will create a table corresponding to the view.
However, I think it is probably unintentional, and it's a bit of a footgun - DuckDB is usually lazy, and nowhere does it explicitly say that a call to con.table(name) will materialise the full table.
I am not a c++ programmer, and the following comes from investigation using an LLM, but the c++ seems to substantiate the view that this behaviour is unintentional
Evidence this behaviour is unintentional
Why this appears to be a code oversight rather than intended behaviour
RunQuery (the internal method that con.sql() routes through) already has a lazy path. It only takes that path when the params argument is None:
// In RunQuery — the lazy path is explicitly gated on params being None:
if (py::none().is(params)) {
// creates a lazy QueryRelation — no execution yet
}
// otherwise: executes immediately and materialises into a ColumnDataCollectionWhen con.sql() is exposed to Python, params is explicitly defaulted to None to ensure the lazy path is taken:
m.def("sql", &DuckDBPyConnection::RunQuery,
py::arg("query"), py::kw_only(), py::arg("alias") = "", py::arg("params") = py::none());However, when con.table() falls back to RunQuery for views, it omits the params argument entirely:
unique_ptr<DuckDBPyRelation> DuckDBPyConnection::Table(const string &tname) {
try {
return make_uniq<DuckDBPyRelation>(connection.Table(...)); // fails for views
} catch (const CatalogException &) {
auto sql_query = StringUtil::Format("from %s", ...);
return RunQuery(py::str(sql_query), tname); // <-- params not passed
}
}When params is omitted, the C++ default kicks in — which is py::list(), not py::none(). This means the lazy branch is never taken, and the result is always fully materialised.
The lazy infrastructure is in place and working correctly for con.sql(). The fallback in Table() appears to have simply omitted the py::none() argument that would enable it.
To Reproduce
Run with uv run this_script.py
# /// script
# dependencies = [
# "duckdb==1.4.4",
# "psutil",
# ]
# ///
"""
Reproducer for: con.table() on a registered view silently materialises all rows.
con.register() creates a temporary VIEW. When con.table() is called on that view,
it falls back to RunQuery() but passes py::list() (the C++ default) instead of
py::none() for the `params` argument. This skips the lazy QueryRelation path and
instead fully executes + materialises the query.
Expected: con.table("name") returns a lazy relation (same as con.sql("FROM name")).
Actual: con.table("name") materialises the entire result set, taking O(n) time.
"""
import os
import time
import duckdb
import psutil
def rss_mb():
"""Current process RSS in MB."""
return psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
con = duckdb.connect()
# Create a table with 100M rows (~800 MB of BIGINT data)
con.sql("CREATE TABLE big AS SELECT range AS id FROM range(100_000_000)")
rel = con.sql("SELECT * FROM big")
con.register("my_view", rel)
# --- con.sql(): lazy, no materialisation ---
mem_before_sql = rss_mb()
start = time.perf_counter()
r1 = con.sql("FROM my_view")
t_sql = time.perf_counter() - start
mem_after_sql = rss_mb()
# --- con.table(): materialises all rows ---
mem_before_table = rss_mb()
start = time.perf_counter()
r2 = con.table("my_view")
t_table = time.perf_counter() - start
mem_after_table = rss_mb()
# --- Show the relation types (proves materialisation) ---
print("=== Relation types ===")
print(
f" con.sql() returns: {type(r1.rel).__name__ if hasattr(r1, 'rel') else type(r1).__name__}"
)
print(
f" con.table() returns: {type(r2.rel).__name__ if hasattr(r2, 'rel') else type(r2).__name__}"
)
print()
# --- Timing ---
print("=== Timing ===")
print(f" con.sql('FROM my_view'): {t_sql:.4f}s")
print(f" con.table('my_view'): {t_table:.4f}s")
print(f" table() is {t_table / t_sql:.0f}x slower")
print()
# --- Memory ---
print("=== Process RSS memory ===")
print(f" Before con.sql(): {mem_before_sql:.0f} MB")
print(
f" After con.sql(): {mem_after_sql:.0f} MB (delta: +{mem_after_sql - mem_before_sql:.0f} MB)"
)
print(f" Before con.table(): {mem_before_table:.0f} MB")
print(
f" After con.table(): {mem_after_table:.0f} MB (delta: +{mem_after_table - mem_before_table:.0f} MB)"
)
print()
# --- EXPLAIN shows the difference ---
print("=== EXPLAIN con.sql() result (lazy — references the view) ===")
print(r1.explain())
print()
print(
"=== EXPLAIN con.table() result (materialised — scans a ColumnDataCollection) ==="
)
print(r2.explain())
OS:
Mac OS Tahoe 26.3
DuckDB Package Version:
1.4.4
Python Version:
3.14.0
Full Name:
Robin Linacre
Affiliation:
UK Ministry of Justice
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration to reproduce the issue?
- Yes, I have
Ref: moj-analytical-services/uk_address_matcher#253, where I first fell foul of this behaviour