Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR improves the efficiency of log ingestion by optimizing date filtering and DNS lookup operations. The changes move date filtering to an earlier stage and implement batch DNS lookups with caching to dramatically reduce processing time.
Key changes:
- Implemented early date filtering using direct string parsing before creating data structures
- Added batch DNS lookups with threading and LRU caching to reduce network I/O bottlenecks
- Restructured the processing pipeline to filter data earlier and reuse DNS lookup results
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| def log_line_in_date_range(log_line: str, start_date: dt.date, end_date: dt.date) -> bool: | ||
| """ | ||
| date filtering - parse date directly from string without splits. | ||
| Uses direct substring indexing for maximum performance. |
There was a problem hiding this comment.
I suggest slightly more detail here, something like:
This routine parses only the datetime section of the log_line. This saves significant processing time because the the entire log_line can be ignored if it is not in the relevant date range.
I think this is a HUGE memory and cpu-time saver!
| return COUNTRY_CODES[""] | ||
|
|
||
|
|
||
| def batch_dns_lookups(ip_addresses: Set[str]) -> Dict[str, str]: |
There was a problem hiding this comment.
I'd suggest a varname for "ip_addresses" that indicates that this is already a list of unique values.
Something like...unique_ip_addresses or ip_address_set
I was concerned about a race condition with the Threadpool invocation until I saw that there is a "set()" operation on this list before it gets to this point.
sc0tts
left a comment
There was a problem hiding this comment.
I didn't install and run the code, but the two big improvements I saw here are:
- batching the ip address lookups
- pre-selecting log entries by date...instead of processing every..single..entry.....and then selecting by date.
Made a couple of optional suggestions re a comment and a variable name, but those aren't blocking.
Approve!
To test this i spun up a dev VM. (https://github.com/nsidc/noaadata-vm)
Then i activated the conda environment
. activate noaadata(may have to create the environment if you are in a new environment).I edited line 36
with open(NGINX_DOWNLOAD_LOG_FILE) as f:towith open(/share/logs/noaa-web-all/production/download.log) as f:then ran
time PYTHONPATH=. python noaa_metrics/cli.py ingest -s 2025-09-04 -e 2025-09-04on this branch and again onmainbranch. I saw a huge improvement down from like 8 minutes to 3 minutes