-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Problem
When processing multiple sessions or large sessions in ML jobs (e.g., LibreFace), the system runs out of memory (OOM) because memory is not properly released between processing sessions.
Symptoms
- Jobs fail with out-of-memory errors when processing many sessions
- Memory usage continuously increases throughout job execution
- System becomes unresponsive during large batch processing
Root Cause Analysis
Python and PyTorch memory management issues where:
- Objects remain in memory between session processing loops
- GPU memory (CUDA) accumulates without proper cleanup
- Python garbage collection doesn't release memory aggressively enough
Attempted Solutions (Insufficient)
The following approaches have been tried but don't fully resolve the issue:
In libreface_script.py and similar modules:
import gc
import torch
# Attempted cleanup between sessions
gc.collect()
torch.cuda.empty_cache()These calls are insufficient to prevent memory accumulation.
Impact
- Large batch jobs fail due to memory exhaustion
- System resources are not efficiently utilized
- Processing capacity is limited by memory leaks rather than actual requirements
Potential Solutions
- Process isolation: Run each session in a separate subprocess that terminates and releases all memory
- Explicit object deletion: Manually delete large objects and call gc.collect() more aggressively
- Memory monitoring: Add memory usage tracking and automatic cleanup triggers
- Batch size limiting: Automatically limit concurrent sessions based on available memory
- Memory mapping: Use memory-mapped files for large datasets instead of loading into RAM
- Framework-specific cleanup: Implement PyTorch/TensorFlow specific memory management patterns
Affected Modules
- LibreFace processing (
libreface_script.py) - Other ML modules with similar memory-intensive processing
- Any module processing multiple sessions in sequence
System Requirements
This issue is more severe on systems with limited RAM and affects scalability of the DISCOVER framework for production workloads.
Metadata
Metadata
Assignees
Labels
No labels