Skip to content

Add Python scripts for analyzing memory debug logs #3490

@andygrove

Description

@andygrove

Background

PR #2521 added memory reservation debug logging (spark.comet.debug.memory config and LoggingPool wrapper). That PR also contained Python scripts for parsing and visualizing the memory debug logs, but those scripts were not merged. This issue tracks adding analysis/visualization scripts as a follow-up.

Log Format

When spark.comet.debug.memory=true is set, the LoggingPool produces log lines like:

[Task 486] MemoryPool[ExternalSorter[6]].try_grow(256232960) returning Ok
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(257820416) returning Err
[Task 486] MemoryPool[ExternalSorterMerge[6]].shrink(10485760)
[Task 486] MemoryPool[ExternalSorterMerge[6]].try_grow(68928) returning Ok

Proposed Scripts

1. dev/scripts/mem_debug_to_csv.py — Parse logs to CSV

Parses the Spark executor/worker log file, filters by task ID, and tracks cumulative memory allocation per consumer (operator).

Key details from the #2521 implementation:

  • Uses regex to parse lines matching [Task <id>] MemoryPool[<consumer>].<method>(<size>)
  • Tracks running total per consumer: grow/try_grow add to allocation, shrink subtracts
  • For try_grow failures (line contains "Err"), the allocation is not updated but the row is annotated with an ERR label
  • Outputs CSV with columns: name, size, label
  • Accepts --task <id> to filter to a specific Spark task and --file <path> for the log file

2. dev/scripts/plot_memory_usage.py — Visualize memory usage

Reads the CSV output and produces a stacked area chart showing memory usage over time by consumer (operator).

Key details from the #2521 implementation:

  • Uses pandas and matplotlib
  • Creates a time index from row order (each row = sequential event)
  • Pivots data so each consumer is a column, forward-fills missing values
  • Renders a stacked area chart (plt.stackplot)
  • Annotates try_grow failures with red vertical dashed lines labeled "ERR"
  • Saves chart as PNG (same path as CSV but with _chart.png suffix)

Suggestions from PR #2521 Code Review

The following review feedback should be incorporated:

  1. Use #!/usr/bin/env python3 shebang and make scripts executable (chmod +x)
  2. Fix CSV formatting — use f-strings (f"{consumer},{alloc[consumer]}") instead of print(consumer, ",", alloc[consumer]) to avoid extra spaces around values
  3. Fix ERR label handling — the original implementation printed two rows for the same event on try_grow failure (one with ERR label, one without). Use a label variable so only one row is printed per event
  4. Handle first occurrence being shrink — the original code assumed the first event for a consumer is always grow/try_grow, but the first event could be a shrink
  5. Fix --task argumentint(None) fails with TypeError when --task is not provided; make it optional or a positional arg
  6. Consider making --file a positional argument for simpler CLI usage
  7. Use pandas.DataFrame.ffill() instead of deprecated fillna(method='ffill') (deprecated since pandas 2.1.0)
  8. Consider logging backtraces — when the backtrace feature is enabled, it could be useful to log backtraces on every call (not just errors) to trace precise allocation origins. This was suggested as an optional trace!-level enhancement to the Rust LoggingPool

Example Workflow

# Step 1: Run Spark with memory debug logging enabled
spark-submit --conf spark.comet.debug.memory=true ...

# Step 2: Parse the log and generate CSV for a specific task
python3 dev/scripts/mem_debug_to_csv.py --task 486 /path/to/executor/log > /tmp/mem.csv

# Step 3: Generate a chart
python3 dev/scripts/plot_memory_usage.py /tmp/mem.csv

Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions