fix: share unified memory pools across native execution contexts within a task by andygrove · Pull Request #3924 · apache/datafusion-comet

andygrove · 2026-04-10T16:27:57Z

Which issue does this PR close?

Rationale for this change

When Comet executes a shuffle, it creates two native execution contexts that run concurrently within the same Spark task (e.g. one for pre-shuffle operators and one for the shuffle writer). Previously, each context created its own memory pool with the full per-task memory limit, effectively allowing 2x the intended memory to be consumed. This is a performance bug — not a correctness issue — but it causes significantly higher memory usage than expected, leading to OOM errors that can only be worked around by over-provisioning memory.

What changes are included in this PR?

Make fair_unified and greedy_unified memory pools task-shared, so a single pool instance is reused across all native execution contexts within the same Spark task. This uses the same TASK_SHARED_MEMORY_POOLS mechanism that the on-heap greedy_task_shared and fair_spill_task_shared pools already use.
Fix a tracing bug where total_reserved_for_thread() and unregister_and_total() double-counted memory when multiple execution contexts shared the same pool Arc. Deduplicates by Arc data pointer before summing reserved().
Update tuning guide to document that both pool types are shared across execution contexts.

No new configuration options are added. The existing fair_unified (default) and greedy_unified pool names are unchanged. No functional change — queries produce the same results, but memory usage is now correctly bounded.

How are these changes tested?

With this change I was able to run TPC-H and TPC-DS @ 1TB with just 8GB off-heap memory. Previously I was seeing OOM at 16 GB.

…ation When Comet executes a shuffle, it creates two separate native plans (the child plan and the shuffle writer plan) that run concurrently in a pipelined fashion. Previously, each plan got its own memory pool at the full per-task limit, effectively allowing 2x the intended memory to be consumed. The new `fair_unified_task_shared` pool type shares a single CometFairMemoryPool across all native plans within the same Spark task. This ensures the total memory stays within the per-task limit while dynamically distributing memory among operators based on how many register as memory consumers (e.g. if the child plan is a simple scan+filter, the shuffle writer gets 100% of the pool). This is now the default for off-heap mode. Closes apache#3921 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When using fair_unified_task_shared, multiple execution contexts on the same thread share a single Arc<dyn MemoryPool>. The tracing code was summing pool.reserved() for each registered context, double-counting the shared pool and reporting 2x the actual memory reservation. Deduplicate pools by Arc data pointer before summing so each underlying pool is only counted once.

Make fair_unified_task_shared opt-in rather than the default to simplify review. Update docs to reflect the new default.

Add context about how Comet creates two concurrent native plans per Spark task during shuffle and why this matters for pool selection.

comphead · 2026-04-13T03:37:55Z

common/src/main/scala/org/apache/comet/CometConf.scala

        "The type of memory pool to be used for Comet native execution when running Spark in " +
-          "off-heap mode. Available pool types are `greedy_unified` and `fair_unified`. " +
+          "off-heap mode. Available pool types are `greedy_unified`, `fair_unified`, and " +
+          "`fair_unified_task_shared`. The `fair_unified_task_shared` pool is shared across " +


read the description twice, it is difficult to understand without example what exactly fair_unified_task_shared helps to achieve

Updated. Let me know if this is clearer now.

comphead · 2026-04-13T03:41:08Z

docs/source/user-guide/latest/tuning.md


 Comet implements multiple memory pool implementations. The type of pool can be specified with `spark.comet.exec.memoryPool`.

+When Comet executes a shuffle, it creates two separate native plans within the same Spark task: the child plan


maybe it is not a plan, and rather a stage DAG?

plan is per operator: scan-> shuffle -> write
stage DAG is different, the plan above split into
Stage1: scan + shuffle write
Stage2: shuffle read + write

It is really DataFusion execution contexts. I updated.

native/core/src/execution/jni_api.rs

comphead · 2026-04-13T03:43:25Z

native/core/src/execution/jni_api.rs

+            let mut seen = std::collections::HashSet::new();
+            pools
+                .values()
+                .filter(|p| seen.insert(Arc::as_ptr(p) as *const () as usize))


- Clarify terminology: replace "native plans" with "execution contexts" to avoid confusion with Spark plan/stage concepts - Rewrite fair_unified_task_shared config description with concrete example of the 2x memory problem during shuffle - Use filter_map with then() for pool deduplication

Add greedy_unified_task_shared pool type for completeness alongside fair_unified_task_shared. Both _task_shared variants share a single memory pool across all native execution contexts in the same Spark task, preventing 2x memory consumption during shuffle.

Instead of adding new _task_shared pool variants, fix the existing fair_unified and greedy_unified pools to share a single pool instance across all native execution contexts within the same Spark task. This fixes the bug where concurrent execution contexts (e.g. pre-shuffle operators and shuffle writer) could each allocate up to the full per-task memory limit independently.

native/core/src/execution/jni_api.rs

mbutrovich

Minor nits, thanks @andygrove!

andygrove and others added 4 commits April 10, 2026 10:24

Merge remote-tracking branch 'apache/main' into task-shared-unified-pool

8096420

feat: change default memory pool back to fair_unified

fed7428

Make fair_unified_task_shared opt-in rather than the default to simplify review. Update docs to reflect the new default.

andygrove changed the title ~~feat: add fair_unified_task_shared memory pool to fix 2x memory allocation [experimental]~~ feat: add fair_unified_task_shared memory pool to fix 2x memory allocation Apr 11, 2026

docs: explain dual native plan architecture in tuning guide

f8b15e5

Add context about how Comet creates two concurrent native plans per Spark task during shuffle and why this matters for pool selection.

andygrove marked this pull request as ready for review April 11, 2026 16:06

andygrove requested review from comphead, mbutrovich and parthchandra April 11, 2026 16:08

andygrove added 2 commits April 11, 2026 22:04

change default back

394f038

prettier

a5ed8d3

comphead reviewed Apr 13, 2026

View reviewed changes

native/core/src/execution/jni_api.rs Show resolved Hide resolved

comphead reviewed Apr 13, 2026

View reviewed changes

andygrove added 2 commits April 13, 2026 05:37

andygrove changed the title ~~feat: add fair_unified_task_shared memory pool to fix 2x memory allocation~~ feat: add task_shared variants of unified memory pools to fix 2x memory allocation Apr 13, 2026

reorder

3a403f6

andygrove mentioned this pull request Apr 13, 2026

Create 0.14.1 release #3933

Open

andygrove changed the title ~~feat: add task_shared variants of unified memory pools to fix 2x memory allocation~~ fix: share unified memory pools across native execution contexts within a task Apr 13, 2026

andygrove added the performance label Apr 13, 2026

mbutrovich reviewed Apr 13, 2026

View reviewed changes

native/core/src/execution/jni_api.rs Outdated Show resolved Hide resolved

mbutrovich reviewed Apr 13, 2026

View reviewed changes

native/core/src/execution/jni_api.rs Show resolved Hide resolved

mbutrovich approved these changes Apr 13, 2026

View reviewed changes

refactor: use import for HashSet instead of qualified path

d9aca39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: share unified memory pools across native execution contexts within a task#3924

fix: share unified memory pools across native execution contexts within a task#3924
andygrove wants to merge 12 commits intoapache:mainfrom
andygrove:task-shared-unified-pool

andygrove commented Apr 10, 2026 •

edited

Loading

Uh oh!

comphead Apr 13, 2026

Uh oh!

andygrove Apr 13, 2026

Uh oh!

comphead Apr 13, 2026

Uh oh!

andygrove Apr 13, 2026

Uh oh!

Uh oh!

comphead Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

mbutrovich left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Comet implements multiple memory pool implementations. The type of pool can be specified with `spark.comet.exec.memoryPool`.

		When Comet executes a shuffle, it creates two separate native plans within the same Spark task: the child plan

Conversation

andygrove commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

comphead Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

comphead Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Apr 10, 2026 •

edited

Loading