Skip to content

A high-performance Rust CLI tool to correlate user activity (e.g., marketing attribution, security events) from Nginx/Apache logs.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

KnotFalse/Web-Path-Event-Tracer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Web Path Scanner to IPs

A high-performance command-line tool for correlating user activity from web server logs. It identifies users (by IP) who reached a specific "final" event and looks back through their history to find "earlier" trigger events that occurred before the final event.

πŸš€ Performance Features

This tool is built with performance as a primary requirement:

  • Parallel Processing: All file I/O, parsing, and correlation is parallelized using rayon
  • Optimized Hashing: Uses ahash::AHashMap for significantly faster hash operations
  • Smart Matching: Defaults to hyper-fast literal string matching using Aho-Corasick automaton, only uses regex when explicitly requested

πŸ“‹ Use Cases

This is a general-purpose auditing and enumeration tool. Example use cases:

  • Marketing Attribution: "Of the users who reached checkout, which ones arrived with specific UTM parameters?"
  • Security Analysis: "Which IPs that accessed admin pages also triggered security alerts?"
  • User Journey Analysis: "Which users who completed signup had previously visited specific landing pages?"

πŸ”§ Installation

cargo build --release

The compiled binary will be in target/release/web-path-scanner-to-ips.

πŸ“– Usage

web-path-scanner-to-ips [OPTIONS] --logs <LOGS>... --final-page <FINAL_PAGE> --find <FIND>...

Required Arguments

  • -l, --logs <LOGS>...: Log file paths or glob patterns (e.g., /var/log/nginx/*.log)
  • -f, --final-page <FINAL_PAGE>: The "final" event string to match (e.g., "checkout")
  • -i, --find <FIND>...: Earlier "trigger" event strings to find (e.g., "utm_source" "utm_campaign")

Optional Arguments

  • -e, --regex: Treat patterns as regex instead of literal strings (slower but more flexible)
  • --strict-utf8: Preserve legacy strict decoding (warn and skip any line containing invalid UTF-8)

🧡 UTF-8 Handling

  • Default (lossy): Invalid UTF-8 sequences are replaced with the \uFFFD replacement character so processing never stalls. Each file prints a substitution summary when replacements occur.
  • Strict mode: Enable --strict-utf8 to mirror the old behaviorβ€”invalid byte ranges raise warnings and the entire offending line is skipped.
  • Regardless of mode, hard I/O errors (e.g., truncated files) are still surfaced so operators stay informed.

πŸ“š Supported Log Formats

The parser recognises several common access-log layouts out of the box:

  • Apache / Nginx Combined Log Format
  • AWS Application/Network Load Balancer access logs
  • Amazon CloudFront distribution logs (tab-delimited)
  • Microsoft IIS W3C Extended logs
  • JSON-structured reverse proxy or application logs
  • Key=value diagnostics (e.g., PHP-FPM style)

Lines that do not match these patterns are skipped without halting processing, and tests cover each format to guard against regressions.

πŸ“ Examples

Example 1: Find UTM Parameters Before Checkout

web-path-scanner-to-ips \
  --logs /var/log/nginx/access.log \
  --final-page "checkout" \
  --find "utm_source" "utm_campaign"

This finds all IPs that:

  1. Reached a page containing "checkout"
  2. Had earlier log entries containing "utm_source" or "utm_campaign"

Example 2: Regex Matching with Multiple Log Files

web-path-scanner-to-ips \
  --logs /var/log/nginx/*.log \
  --final-page "/api/purchase" \
  --find "promo_code=\w+" \
  --regex

This uses regex patterns to find more complex patterns.

Example 3: Glob Pattern for Logs

web-path-scanner-to-ips \
  -l "logs/access-*.log" \
  -f "signup" \
  -i "referrer=facebook" "referrer=google"

πŸ“Š Log Format

The tool supports standard Apache/Nginx Combined Log Format:

192.168.1.1 - - [07/Mar/2004:16:05:49 -0800] "GET /path?query=value HTTP/1.1" 200 1234 "referer" "user-agent"

πŸ—οΈ Architecture

The project follows a clean modular architecture:

  • main.rs: Minimal CLI entry point
  • lib.rs: Library root and orchestration
  • config.rs: CLI argument parsing and matcher compilation
  • parser.rs: Log line parsing (matcher-agnostic)
  • processor.rs: Parallel log processing with rayon
  • reporter.rs: Parallel report generation
  • error.rs: Custom error types

Performance Strategy

  1. Parse CLI args β†’ Build compiled matchers (regex/Aho-Corasick) once
  2. Expand globs β†’ Get concrete file paths
  3. Parallel map-reduce:
    • Each thread processes file(s) with local state (no contention)
    • Thread-local states merged into final result
  4. Parallel report generation β†’ Format output strings in parallel
  5. Serial output β†’ Print to stdout

πŸ“ˆ Performance Characteristics

  • Parallelism: Scales with CPU cores (uses all available cores by default)
  • Memory: O(unique IPs) - only stores IPs that match criteria
  • I/O: Buffered reading with parallel file processing
  • Matching:
    • Literal mode: O(n) with Aho-Corasick (very fast)
    • Regex mode: O(n*m) where m is pattern complexity (slower)

πŸ§ͺ Testing

Run the test suite:

cargo test

Run with verbose output:

cargo test -- --nocapture

πŸ“¦ Dependencies

  • clap - CLI argument parsing
  • glob - File pattern expansion
  • regex - Regular expression support
  • chrono - Timestamp parsing
  • rayon - Parallel processing
  • ahash - Fast hashing
  • thiserror - Error handling
  • aho-corasick - Fast multi-pattern matching

🀝 Contributing

This is a focused, performance-oriented tool. When contributing:

  • Maintain the performance-first mindset
  • Use ahash::AHashMap instead of std::collections::HashMap
  • Leverage rayon for parallelism where appropriate
  • Profile before and after changes with large log files

πŸ“„ License

This project is dual-licensed under either of:

at your option.

πŸ” How It Works

  1. Identify Final Events: The tool scans all logs to find when each IP hit the "final" event (e.g., checkout page). It tracks the latest timestamp for each IP.

  2. Identify Trigger Events: It also tracks all "earlier" trigger events (e.g., UTM parameters) for each IP with their timestamps.

  3. Temporal Correlation: For each IP that hit the final event, it filters trigger events to only those that occurred before the final event timestamp.

  4. Report: Outputs IPs that match both criteria, showing the final event time and all earlier trigger events.

This temporal filtering is crucial - it ensures you're seeing the user's journey chronologically, not just any matching events.

About

A high-performance Rust CLI tool to correlate user activity (e.g., marketing attribution, security events) from Nginx/Apache logs.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published