Skip to content

Memory not released between sessions causing OOM in ML processing jobs #10

@saveli

Description

@saveli

Problem

When processing multiple sessions or large sessions in ML jobs (e.g., LibreFace), the system runs out of memory (OOM) because memory is not properly released between processing sessions.

Symptoms

  • Jobs fail with out-of-memory errors when processing many sessions
  • Memory usage continuously increases throughout job execution
  • System becomes unresponsive during large batch processing

Root Cause Analysis

Python and PyTorch memory management issues where:

  1. Objects remain in memory between session processing loops
  2. GPU memory (CUDA) accumulates without proper cleanup
  3. Python garbage collection doesn't release memory aggressively enough

Attempted Solutions (Insufficient)

The following approaches have been tried but don't fully resolve the issue:

In libreface_script.py and similar modules:

import gc
import torch

# Attempted cleanup between sessions
gc.collect()
torch.cuda.empty_cache()

These calls are insufficient to prevent memory accumulation.

Impact

  • Large batch jobs fail due to memory exhaustion
  • System resources are not efficiently utilized
  • Processing capacity is limited by memory leaks rather than actual requirements

Potential Solutions

  1. Process isolation: Run each session in a separate subprocess that terminates and releases all memory
  2. Explicit object deletion: Manually delete large objects and call gc.collect() more aggressively
  3. Memory monitoring: Add memory usage tracking and automatic cleanup triggers
  4. Batch size limiting: Automatically limit concurrent sessions based on available memory
  5. Memory mapping: Use memory-mapped files for large datasets instead of loading into RAM
  6. Framework-specific cleanup: Implement PyTorch/TensorFlow specific memory management patterns

Affected Modules

  • LibreFace processing (libreface_script.py)
  • Other ML modules with similar memory-intensive processing
  • Any module processing multiple sessions in sequence

System Requirements

This issue is more severe on systems with limited RAM and affects scalability of the DISCOVER framework for production workloads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions