Skip to content

Improve aggregate efficiency#9

Merged
rmarow merged 14 commits intomainfrom
improve-aggregate-efficiency
Sep 10, 2025
Merged

Improve aggregate efficiency#9
rmarow merged 14 commits intomainfrom
improve-aggregate-efficiency

Conversation

@rmarow
Copy link
Contributor

@rmarow rmarow commented Sep 9, 2025

This dynamically reads in the ingest files to prevent job failure through running out of memory. Here is an example output from the march month

Date range: 2025-03-01 to 2025-03-30
System memory: 14.9GB available

Loading data with memory monitoring...
Found 30 files, total size: 1064.6MB
Available memory: 14.9GB
Strategy: 8 huge files (>50MB), 3 large files (15-50MB), 19 small files (<15MB)
Processing huge file: noaa-metrics-2025-03-19.json (205.3MB)
  Loaded 936,731 rows, memory: 13.7GB available
Processing huge file: noaa-metrics-2025-03-25.json (140.0MB)
  Loaded 613,705 rows, memory: 13.7GB available
Processing huge file: noaa-metrics-2025-03-04.json (99.1MB)
  Loaded 441,177 rows, memory: 13.7GB available
Processing huge file: noaa-metrics-2025-03-18.json (98.9MB)
  Loaded 442,217 rows, memory: 13.6GB available
Processing huge file: noaa-metrics-2025-03-11.json (89.2MB)
  Loaded 396,732 rows, memory: 13.4GB available
Processing huge file: noaa-metrics-2025-03-24.json (82.9MB)
  Loaded 364,585 rows, memory: 13.3GB available
Processing huge file: noaa-metrics-2025-03-13.json (82.4MB)
  Loaded 364,933 rows, memory: 13.2GB available
Processing huge file: noaa-metrics-2025-03-21.json (57.4MB)
  Loaded 251,430 rows, memory: 13.2GB available
Processing 3 large files in pairs...
  Batch: ['noaa-metrics-2025-03-10.json', 'noaa-metrics-2025-03-05.json'] (43.7MB total)
    Combined: 198,236 rows
  Batch: ['noaa-metrics-2025-03-16.json'] (15.0MB total)
    Combined: 9 rows
Processing 19 small files in batches...
  Small batch 1: 486,972 rows
  Small batch 2: 176,662 rows
  Small batch 3: 22,230 rows
Final concatenation of 13 DataFrames...
Memory before final concat: 12.8GB available
MEMORY ERROR in final concatenation!
Falling back to streaming approach...
Using emergency streaming mode...
  noaa-metrics-2025-03-01.json (4.0MB, 12.6GB available)
    Added 17,815 rows (total: 17,815)
  noaa-metrics-2025-03-02.json (1.5MB, 12.6GB available)
    Added 6,881 rows (total: 24,696)
  noaa-metrics-2025-03-03.json (8.5MB, 12.6GB available)
    Added 38,840 rows (total: 63,536)
  noaa-metrics-2025-03-04.json (99.1MB, 12.6GB available)
    Added 441,177 rows (total: 504,713)
  noaa-metrics-2025-03-05.json (16.6MB, 12.1GB available)
    Added 78,175 rows (total: 582,888)
    Cleanup: 12.2GB available
  noaa-metrics-2025-03-06.json (11.6MB, 12.2GB available)
    Added 52,107 rows (total: 634,995)
  noaa-metrics-2025-03-07.json (13.1MB, 12.2GB available)
    Added 58,225 rows (total: 693,220)
  noaa-metrics-2025-03-08.json (2.0MB, 12.2GB available)
    Added 8,798 rows (total: 702,018)
  noaa-metrics-2025-03-09.json (13.6MB, 12.1GB available)
    Added 60,290 rows (total: 762,308)
  noaa-metrics-2025-03-10.json (27.1MB, 12.1GB available)
    Added 120,061 rows (total: 882,369)
    Cleanup: 12.1GB available
  noaa-metrics-2025-03-11.json (89.2MB, 12.1GB available)
    Added 396,732 rows (total: 1,279,101)
  noaa-metrics-2025-03-12.json (14.9MB, 11.9GB available)
    Added 65,467 rows (total: 1,344,568)
  noaa-metrics-2025-03-13.json (82.4MB, 11.8GB available)
    Added 364,933 rows (total: 1,709,501)
  noaa-metrics-2025-03-14.json (5.9MB, 11.7GB available)
    Added 28,435 rows (total: 1,737,936)
  noaa-metrics-2025-03-15.json (1.9MB, 11.8GB available)
    Added 9,412 rows (total: 1,747,348)
    Cleanup: 11.8GB available
  noaa-metrics-2025-03-16.json (15.0MB, 11.8GB available)
    Error: Unable to allocate 95.9 GiB for an array with shape (7368, 1747348) and data type object
  noaa-metrics-2025-03-17.json (14.2MB, 11.8GB available)
    Added 63,431 rows (total: 1,810,779)
  noaa-metrics-2025-03-18.json (98.9MB, 11.8GB available)
    Added 442,217 rows (total: 2,252,996)
  noaa-metrics-2025-03-19.json (205.3MB, 11.5GB available)
    Added 936,731 rows (total: 3,189,727)
  noaa-metrics-2025-03-20.json (10.6MB, 10.7GB available)
    Added 50,981 rows (total: 3,240,708)
  noaa-metrics-2025-03-21.json (57.4MB, 10.8GB available)
    Added 251,430 rows (total: 3,492,138)
    Cleanup: 10.7GB available
  noaa-metrics-2025-03-22.json (13.8MB, 10.7GB available)
    Added 73,147 rows (total: 3,565,285)
  noaa-metrics-2025-03-23.json (2.7MB, 10.8GB available)
    Added 13,091 rows (total: 3,578,376)
  noaa-metrics-2025-03-24.json (82.9MB, 10.8GB available)
    Added 364,585 rows (total: 3,942,961)
  noaa-metrics-2025-03-25.json (140.0MB, 10.7GB available)
    Added 613,705 rows (total: 4,556,666)
  noaa-metrics-2025-03-26.json (14.8MB, 10.5GB available)
    Added 63,324 rows (total: 4,619,990)
    Cleanup: 10.5GB available
  noaa-metrics-2025-03-27.json (2.2MB, 10.5GB available)
    Added 9,745 rows (total: 4,629,735)
  noaa-metrics-2025-03-28.json (1.3MB, 10.5GB available)
    Added 5,937 rows (total: 4,635,672)
  noaa-metrics-2025-03-29.json (7.8MB, 10.5GB available)
    Added 34,160 rows (total: 4,669,832)
  noaa-metrics-2025-03-30.json (6.0MB, 10.5GB available)
    Added 25,778 rows (total: 4,695,610)
Streaming complete: 4,695,610 rows from 29 files

Generating reports...
Preparing output...
Writing CSV report...
CSV report written to: /share/logs/noaa-web/report/noaa-downloads.csv

Sending email...
Email sent successfully!

============================================================
AGGREGATION COMPLETE!
============================================================
Final memory usage: 11.9GB available

real    1m24.272s
user    1m19.025s
sys     0m5.663s```

@rmarow rmarow requested a review from sc0tts September 9, 2025 23:17
@rmarow rmarow marked this pull request as ready for review September 9, 2025 23:23
def get_file_size_mb(filepath: Path) -> float:
"""Get file size in MB."""
try:
return os.path.getsize(filepath) / (1024 * 1024)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In get_available_memory_gb(), the denominator is 1024^3.
For "similarity", I'd suggest either making this denominator 1024^2 or the other 102410241024.

"""Safely read a JSON file and return DataFrame."""
try:
if os.path.getsize(filepath) > 2:
return pd.read_json(filepath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use read_json_file_safe() here?

continue

try:
df = pd.read_json(filepath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use safe read method from above?

print(f" {filepath.name} ({size_mb:.1f}MB, {available_gb:.1f}GB available)")

# Skip files that are too large for current memory
if size_mb > available_gb * 300: # Conservative threshold
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe define 300 as a constant.
And then in the print statement, include the size_mb and the 300*available_gb to give an indication of the "scale" of the issue that occurred here.

Copy link
Collaborator

@sc0tts sc0tts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made lots of small suggestions, but they are not blocking.
Very nice summary stats!

@rmarow rmarow merged commit 570c1cc into main Sep 10, 2025
1 check passed
@rmarow rmarow deleted the improve-aggregate-efficiency branch September 10, 2025 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants