-
Notifications
You must be signed in to change notification settings - Fork 65
Open
Labels
Description
What happens?
Hi Team,
First of all, thanks for the all the hard work on DuckDB, it's an amazing product.
From my testing, it seems that DuckDB slows down significantly when querying parquet data and serving the result as record_batches. I'm not super sure the exact issue but it's usually 2x/3x slower than polars/pyarrow.
You can reproduce it below with uv run xx.py
To Reproduce
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "duckdb==1.4.4",
# "polars==1.38.1",
# "pyarrow==23.0.1",
# ]
# ///
import pyarrow.parquet as pq
import pyarrow as pa
import duckdb
import polars as pl
from itertools import permutations
import time
from contextlib import contextmanager
pa.show_info()
# Create parquet file with permutations of 'abcdef'
perms = list(permutations('abcdefghijk'))
print(len(perms))
data = {'permutation': [''.join(p) for p in perms]}
table = pa.table(data)
pq.write_table(table, 'alphabet_test.parquet')
time_taken = {}
@contextmanager
def timer(name):
start = time.time()
try:
yield
finally:
elapsed = time.time() - start
print(f"{name}: {elapsed:.6f}s")
time_taken[name] = elapsed
# Test DuckDB fetch_arrow
with timer("DuckDB fetch_arrow_table"):
reader = duckdb.read_parquet('alphabet_test.parquet').fetch_arrow_table(batch_size=100_000)
# Test DuckDB fetch_record_batch
with timer("DuckDB fetch_record_batch"):
reader = duckdb.read_parquet('alphabet_test.parquet').fetch_arrow_reader(batch_size=100_000)
batches_duckdb = []
for batch in reader:
batches_duckdb.append(batch)
# Test PyArrow RecordBatchReader
with timer("PyArrow RecordBatchReader"):
ds = pa.dataset.dataset('alphabet_test.parquet')
reader = pa.dataset.Scanner.from_dataset(ds, batch_size=100_000).to_reader()
batches_pyarrow = []
for batch in reader:
batches_pyarrow.append(batch)
# Test Polars + PyArrow RecordBatchReader
with timer("Polars + PyArrow RecordBatchReader"):
ds = pl.scan_parquet('alphabet_test.parquet').collect_batches(chunk_size=100_000)
reader = pa.RecordBatchReader.from_stream(ds)
batches_pyarrow = []
for batch in reader:
batches_pyarrow.append(batch)
print("\nSummary of time taken:")
for name, elapsed in time_taken.items():
print(f"{name}: {elapsed:.6f}s")
print(f"fetch ducdkb batches {time_taken['DuckDB fetch_record_batch'] / time_taken['PyArrow RecordBatchReader']:.2f} slower than PyArrow RecordBatchReader")OS:
Darwin arm64
DuckDB Version:
1.4.4
DuckDB Client:
python
Hardware:
Apple M4
Full Name:
Valentino Chen
Affiliation:
Personal
Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?
- Yes, I have
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant data sets for reproducing the issue?
No - Other reason (please specify in the issue body)
Reactions are currently unavailable