A high-performance command-line tool for correlating user activity from web server logs. It identifies users (by IP) who reached a specific "final" event and looks back through their history to find "earlier" trigger events that occurred before the final event.
This tool is built with performance as a primary requirement:
- Parallel Processing: All file I/O, parsing, and correlation is parallelized using
rayon - Optimized Hashing: Uses
ahash::AHashMapfor significantly faster hash operations - Smart Matching: Defaults to hyper-fast literal string matching using Aho-Corasick automaton, only uses regex when explicitly requested
This is a general-purpose auditing and enumeration tool. Example use cases:
- Marketing Attribution: "Of the users who reached checkout, which ones arrived with specific UTM parameters?"
- Security Analysis: "Which IPs that accessed admin pages also triggered security alerts?"
- User Journey Analysis: "Which users who completed signup had previously visited specific landing pages?"
cargo build --releaseThe compiled binary will be in target/release/web-path-scanner-to-ips.
web-path-scanner-to-ips [OPTIONS] --logs <LOGS>... --final-page <FINAL_PAGE> --find <FIND>...-l, --logs <LOGS>...: Log file paths or glob patterns (e.g.,/var/log/nginx/*.log)-f, --final-page <FINAL_PAGE>: The "final" event string to match (e.g., "checkout")-i, --find <FIND>...: Earlier "trigger" event strings to find (e.g., "utm_source" "utm_campaign")
-e, --regex: Treat patterns as regex instead of literal strings (slower but more flexible)--strict-utf8: Preserve legacy strict decoding (warn and skip any line containing invalid UTF-8)
- Default (lossy): Invalid UTF-8 sequences are replaced with the
\uFFFDreplacement character so processing never stalls. Each file prints a substitution summary when replacements occur. - Strict mode: Enable
--strict-utf8to mirror the old behaviorβinvalid byte ranges raise warnings and the entire offending line is skipped. - Regardless of mode, hard I/O errors (e.g., truncated files) are still surfaced so operators stay informed.
The parser recognises several common access-log layouts out of the box:
- Apache / Nginx Combined Log Format
- AWS Application/Network Load Balancer access logs
- Amazon CloudFront distribution logs (tab-delimited)
- Microsoft IIS W3C Extended logs
- JSON-structured reverse proxy or application logs
- Key=value diagnostics (e.g., PHP-FPM style)
Lines that do not match these patterns are skipped without halting processing, and tests cover each format to guard against regressions.
web-path-scanner-to-ips \
--logs /var/log/nginx/access.log \
--final-page "checkout" \
--find "utm_source" "utm_campaign"This finds all IPs that:
- Reached a page containing "checkout"
- Had earlier log entries containing "utm_source" or "utm_campaign"
web-path-scanner-to-ips \
--logs /var/log/nginx/*.log \
--final-page "/api/purchase" \
--find "promo_code=\w+" \
--regexThis uses regex patterns to find more complex patterns.
web-path-scanner-to-ips \
-l "logs/access-*.log" \
-f "signup" \
-i "referrer=facebook" "referrer=google"The tool supports standard Apache/Nginx Combined Log Format:
192.168.1.1 - - [07/Mar/2004:16:05:49 -0800] "GET /path?query=value HTTP/1.1" 200 1234 "referer" "user-agent"
The project follows a clean modular architecture:
main.rs: Minimal CLI entry pointlib.rs: Library root and orchestrationconfig.rs: CLI argument parsing and matcher compilationparser.rs: Log line parsing (matcher-agnostic)processor.rs: Parallel log processing with rayonreporter.rs: Parallel report generationerror.rs: Custom error types
- Parse CLI args β Build compiled matchers (regex/Aho-Corasick) once
- Expand globs β Get concrete file paths
- Parallel map-reduce:
- Each thread processes file(s) with local state (no contention)
- Thread-local states merged into final result
- Parallel report generation β Format output strings in parallel
- Serial output β Print to stdout
- Parallelism: Scales with CPU cores (uses all available cores by default)
- Memory: O(unique IPs) - only stores IPs that match criteria
- I/O: Buffered reading with parallel file processing
- Matching:
- Literal mode: O(n) with Aho-Corasick (very fast)
- Regex mode: O(n*m) where m is pattern complexity (slower)
Run the test suite:
cargo testRun with verbose output:
cargo test -- --nocaptureclap- CLI argument parsingglob- File pattern expansionregex- Regular expression supportchrono- Timestamp parsingrayon- Parallel processingahash- Fast hashingthiserror- Error handlingaho-corasick- Fast multi-pattern matching
This is a focused, performance-oriented tool. When contributing:
- Maintain the performance-first mindset
- Use
ahash::AHashMapinstead ofstd::collections::HashMap - Leverage
rayonfor parallelism where appropriate - Profile before and after changes with large log files
This project is dual-licensed under either of:
- Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
-
Identify Final Events: The tool scans all logs to find when each IP hit the "final" event (e.g., checkout page). It tracks the latest timestamp for each IP.
-
Identify Trigger Events: It also tracks all "earlier" trigger events (e.g., UTM parameters) for each IP with their timestamps.
-
Temporal Correlation: For each IP that hit the final event, it filters trigger events to only those that occurred before the final event timestamp.
-
Report: Outputs IPs that match both criteria, showing the final event time and all earlier trigger events.
This temporal filtering is crucial - it ensures you're seeing the user's journey chronologically, not just any matching events.