Improve efficiency on ingest by rmarow · Pull Request #8 · nsidc/noaadata-web-server-metrics

rmarow · 2025-09-05T23:47:59Z

To test this i spun up a dev VM. (https://github.com/nsidc/noaadata-vm)

Then i activated the conda environment . activate noaadata (may have to create the environment if you are in a new environment).

I edited line 36 with open(NGINX_DOWNLOAD_LOG_FILE) as f: to with open(/share/logs/noaa-web-all/production/download.log) as f:
then ran time PYTHONPATH=. python noaa_metrics/cli.py ingest -s 2025-09-04 -e 2025-09-04 on this branch and again on main branch. I saw a huge improvement down from like 8 minutes to 3 minutes

Copilot

Pull Request Overview

This PR improves the efficiency of log ingestion by optimizing date filtering and DNS lookup operations. The changes move date filtering to an earlier stage and implement batch DNS lookups with caching to dramatically reduce processing time.

Key changes:

Implemented early date filtering using direct string parsing before creating data structures
Added batch DNS lookups with threading and LRU caching to reduce network I/O bottlenecks
Restructured the processing pipeline to filter data earlier and reuse DNS lookup results

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

noaa_metrics/ingest_logs.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sc0tts · 2025-09-08T16:56:19Z

noaa_metrics/ingest_logs.py

+def log_line_in_date_range(log_line: str, start_date: dt.date, end_date: dt.date) -> bool:
+    """
+    date filtering - parse date directly from string without splits.
+    Uses direct substring indexing for maximum performance.


I suggest slightly more detail here, something like:

This routine parses only the datetime section of the log_line. This saves significant processing time because the the entire log_line can be ignored if it is not in the relevant date range.

I think this is a HUGE memory and cpu-time saver!

sc0tts · 2025-09-08T17:00:11Z

noaa_metrics/ingest_logs.py

+        return COUNTRY_CODES[""]
+
+
+def batch_dns_lookups(ip_addresses: Set[str]) -> Dict[str, str]:


I'd suggest a varname for "ip_addresses" that indicates that this is already a list of unique values.

Something like...unique_ip_addresses or ip_address_set

I was concerned about a race condition with the Threadpool invocation until I saw that there is a "set()" operation on this list before it gets to this point.

sc0tts

I didn't install and run the code, but the two big improvements I saw here are:

batching the ip address lookups
pre-selecting log entries by date...instead of processing every..single..entry.....and then selecting by date.

Made a couple of optional suggestions re a comment and a variable name, but those aren't blocking.

Approve!

rmarow added 7 commits September 5, 2025 16:42

Filter by date sooner

f0b29e5

add log output

806dd4d

improve the log line fitering more

4dbfa5b

attempt at improving ip lookup efficiency

c34e366

improve ip lookup efficiency

f7999e5

remove unneeded comment

e2d06fa

Ingest updates cont.

981d83f

rmarow requested a review from Copilot September 5, 2025 23:56

Copilot AI reviewed Sep 5, 2025

View reviewed changes

rmarow and others added 2 commits September 5, 2025 18:07

lint

d7356eb

Update noaa_metrics/ingest_logs.py

cf3b83d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

rmarow marked this pull request as ready for review September 6, 2025 00:10

rmarow added 2 commits September 5, 2025 18:12

lint

ccf5ca2

remove unused completed

55e74fa

rmarow requested a review from sc0tts September 6, 2025 00:16

sc0tts reviewed Sep 8, 2025

View reviewed changes

sc0tts approved these changes Sep 8, 2025

View reviewed changes

commit scotts suggestions

d5b8e78

rmarow merged commit 34458de into main Sep 9, 2025
1 check passed

rmarow deleted the improve-efficiency branch September 9, 2025 01:29

rmarow mentioned this pull request Sep 10, 2025

Out of memory error on VM when day is too big #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency on ingest #8

Improve efficiency on ingest #8
rmarow merged 12 commits intomainfrom
improve-efficiency

rmarow commented Sep 5, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sc0tts Sep 8, 2025

Uh oh!

sc0tts Sep 8, 2025

Uh oh!

sc0tts left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return COUNTRY_CODES[""]


		def batch_dns_lookups(ip_addresses: Set[str]) -> Dict[str, str]:

Conversation

rmarow commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sc0tts Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

sc0tts Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

sc0tts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rmarow commented Sep 5, 2025 •

edited

Loading