-
Notifications
You must be signed in to change notification settings - Fork 284
Description
Background
PR #2521 added memory reservation debug logging (spark.comet.debug.memory config and LoggingPool wrapper). That PR also contained Python scripts for parsing and visualizing the memory debug logs, but those scripts were not merged. This issue tracks adding analysis/visualization scripts as a follow-up.
Log Format
When spark.comet.debug.memory=true is set, the LoggingPool produces log lines like:
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(256232960) returning Ok
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(257820416) returning Err
[Task 486] MemoryPool[ExternalSorterMerge[6]].shrink(10485760)
[Task 486] MemoryPool[ExternalSorterMerge[6]].try_grow(68928) returning Ok
Proposed Scripts
1. dev/scripts/mem_debug_to_csv.py — Parse logs to CSV
Parses the Spark executor/worker log file, filters by task ID, and tracks cumulative memory allocation per consumer (operator).
Key details from the #2521 implementation:
- Uses regex to parse lines matching
[Task <id>] MemoryPool[<consumer>].<method>(<size>) - Tracks running total per consumer:
grow/try_growadd to allocation,shrinksubtracts - For
try_growfailures (line contains "Err"), the allocation is not updated but the row is annotated with anERRlabel - Outputs CSV with columns:
name, size, label - Accepts
--task <id>to filter to a specific Spark task and--file <path>for the log file
2. dev/scripts/plot_memory_usage.py — Visualize memory usage
Reads the CSV output and produces a stacked area chart showing memory usage over time by consumer (operator).
Key details from the #2521 implementation:
- Uses pandas and matplotlib
- Creates a time index from row order (each row = sequential event)
- Pivots data so each consumer is a column, forward-fills missing values
- Renders a stacked area chart (
plt.stackplot) - Annotates
try_growfailures with red vertical dashed lines labeled "ERR" - Saves chart as PNG (same path as CSV but with
_chart.pngsuffix)
Suggestions from PR #2521 Code Review
The following review feedback should be incorporated:
- Use
#!/usr/bin/env python3shebang and make scripts executable (chmod +x) - Fix CSV formatting — use f-strings (
f"{consumer},{alloc[consumer]}") instead ofprint(consumer, ",", alloc[consumer])to avoid extra spaces around values - Fix ERR label handling — the original implementation printed two rows for the same event on
try_growfailure (one with ERR label, one without). Use a label variable so only one row is printed per event - Handle first occurrence being
shrink— the original code assumed the first event for a consumer is alwaysgrow/try_grow, but the first event could be ashrink - Fix
--taskargument —int(None)fails with TypeError when--taskis not provided; make it optional or a positional arg - Consider making
--filea positional argument for simpler CLI usage - Use
pandas.DataFrame.ffill()instead of deprecatedfillna(method='ffill')(deprecated since pandas 2.1.0) - Consider logging backtraces — when the backtrace feature is enabled, it could be useful to log backtraces on every call (not just errors) to trace precise allocation origins. This was suggested as an optional
trace!-level enhancement to the RustLoggingPool
Example Workflow
# Step 1: Run Spark with memory debug logging enabled
spark-submit --conf spark.comet.debug.memory=true ...
# Step 2: Parse the log and generate CSV for a specific task
python3 dev/scripts/mem_debug_to_csv.py --task 486 /path/to/executor/log > /tmp/mem.csv
# Step 3: Generate a chart
python3 dev/scripts/plot_memory_usage.py /tmp/mem.csvReference
- PR chore: Add memory reservation debug logging and visualization #2521: chore: Add memory reservation debug logging and visualization #2521
- Example charts from chore: Add memory reservation debug logging and visualization #2521 showing stacked memory usage per operator with ERR annotations for failed
try_growcalls