Skip to content

Improve efficiency on ingest #8

Merged
rmarow merged 12 commits intomainfrom
improve-efficiency
Sep 9, 2025
Merged

Improve efficiency on ingest #8
rmarow merged 12 commits intomainfrom
improve-efficiency

Conversation

@rmarow
Copy link
Contributor

@rmarow rmarow commented Sep 5, 2025

To test this i spun up a dev VM. (https://github.com/nsidc/noaadata-vm)

Then i activated the conda environment . activate noaadata (may have to create the environment if you are in a new environment).

I edited line 36 with open(NGINX_DOWNLOAD_LOG_FILE) as f: to with open(/share/logs/noaa-web-all/production/download.log) as f:
then ran time PYTHONPATH=. python noaa_metrics/cli.py ingest -s 2025-09-04 -e 2025-09-04 on this branch and again on main branch. I saw a huge improvement down from like 8 minutes to 3 minutes

@rmarow rmarow requested a review from Copilot September 5, 2025 23:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the efficiency of log ingestion by optimizing date filtering and DNS lookup operations. The changes move date filtering to an earlier stage and implement batch DNS lookups with caching to dramatically reduce processing time.

Key changes:

  • Implemented early date filtering using direct string parsing before creating data structures
  • Added batch DNS lookups with threading and LRU caching to reduce network I/O bottlenecks
  • Restructured the processing pipeline to filter data earlier and reuse DNS lookup results

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

rmarow and others added 2 commits September 5, 2025 18:07
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@rmarow rmarow marked this pull request as ready for review September 6, 2025 00:10
@rmarow rmarow requested a review from sc0tts September 6, 2025 00:16
def log_line_in_date_range(log_line: str, start_date: dt.date, end_date: dt.date) -> bool:
"""
date filtering - parse date directly from string without splits.
Uses direct substring indexing for maximum performance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest slightly more detail here, something like:

This routine parses only the datetime section of the log_line. This saves significant processing time because the the entire log_line can be ignored if it is not in the relevant date range.

I think this is a HUGE memory and cpu-time saver!

return COUNTRY_CODES[""]


def batch_dns_lookups(ip_addresses: Set[str]) -> Dict[str, str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest a varname for "ip_addresses" that indicates that this is already a list of unique values.

Something like...unique_ip_addresses or ip_address_set

I was concerned about a race condition with the Threadpool invocation until I saw that there is a "set()" operation on this list before it gets to this point.

Copy link
Collaborator

@sc0tts sc0tts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't install and run the code, but the two big improvements I saw here are:

  • batching the ip address lookups
  • pre-selecting log entries by date...instead of processing every..single..entry.....and then selecting by date.

Made a couple of optional suggestions re a comment and a variable name, but those aren't blocking.

Approve!

@rmarow rmarow merged commit 34458de into main Sep 9, 2025
1 check passed
@rmarow rmarow deleted the improve-efficiency branch September 9, 2025 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants