Skip to content

fetch_record_batch is 2x/3x slower than raw pyArrow #359

@Lklmnovice

Description

@Lklmnovice

What happens?

Hi Team,

First of all, thanks for the all the hard work on DuckDB, it's an amazing product.

From my testing, it seems that DuckDB slows down significantly when querying parquet data and serving the result as record_batches. I'm not super sure the exact issue but it's usually 2x/3x slower than polars/pyarrow.

You can reproduce it below with uv run xx.py

To Reproduce

# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "duckdb==1.4.4",
#     "polars==1.38.1",
#     "pyarrow==23.0.1",
# ]
# ///
import pyarrow.parquet as pq
import pyarrow as pa

import duckdb
import polars as pl
from itertools import permutations
import time
from contextlib import contextmanager



pa.show_info()

# Create parquet file with permutations of 'abcdef'
perms = list(permutations('abcdefghijk'))
print(len(perms))
data = {'permutation': [''.join(p) for p in perms]}
table = pa.table(data)
pq.write_table(table, 'alphabet_test.parquet')


time_taken = {}
@contextmanager
def timer(name):
    start = time.time()
    try:
        yield
    finally:
        elapsed = time.time() - start
        print(f"{name}: {elapsed:.6f}s")
        time_taken[name] = elapsed


# Test DuckDB fetch_arrow
with timer("DuckDB fetch_arrow_table"):
    reader = duckdb.read_parquet('alphabet_test.parquet').fetch_arrow_table(batch_size=100_000)

# Test DuckDB fetch_record_batch
with timer("DuckDB fetch_record_batch"):
    reader = duckdb.read_parquet('alphabet_test.parquet').fetch_arrow_reader(batch_size=100_000)
    batches_duckdb = []
    for batch in reader:
        batches_duckdb.append(batch)

# Test PyArrow RecordBatchReader
with timer("PyArrow RecordBatchReader"):
    ds = pa.dataset.dataset('alphabet_test.parquet')
    reader = pa.dataset.Scanner.from_dataset(ds, batch_size=100_000).to_reader()
    batches_pyarrow = []
    for batch in reader:
        batches_pyarrow.append(batch)

# Test Polars + PyArrow RecordBatchReader
with timer("Polars + PyArrow RecordBatchReader"):
    ds = pl.scan_parquet('alphabet_test.parquet').collect_batches(chunk_size=100_000)
    reader = pa.RecordBatchReader.from_stream(ds)
    batches_pyarrow = []
    for batch in reader:
        batches_pyarrow.append(batch)

print("\nSummary of time taken:")
for name, elapsed in time_taken.items():
    print(f"{name}: {elapsed:.6f}s")
print(f"fetch ducdkb batches {time_taken['DuckDB fetch_record_batch'] / time_taken['PyArrow RecordBatchReader']:.2f} slower than PyArrow RecordBatchReader")

OS:

Darwin arm64

DuckDB Version:

1.4.4

DuckDB Client:

python

Hardware:

Apple M4

Full Name:

Valentino Chen

Affiliation:

Personal

Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?

  • Yes, I have

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions