Skip to content

CVanF5/ngx-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ngx-inference: NGINX module for Gateway API Inference Extensions

Overview

This project provides a native NGINX module (built with ngx-rust) that implements the Gateway API Inference Extension using Envoy's ext_proc protocol over gRPC.

It implements two standard components:

  • Endpoint Picker Processor (EPP): gRPC exchange following the Gateway API Inference Extension specification to obtain upstream endpoint selection and expose endpoints via the $inference_upstream NGINX variable.
  • Body-Based Routing (BBR): Direct in-module implementation that extracts model names from JSON request bodies and injects model headers, following the OpenAI API specification and Gateway API Inference Extension standards.

Reference docs:

Inference Module Architecture

flowchart TD
  A[Client Request] --> B[Core]
  subgraph NGINX Pod
    subgraph NGINX Container
      subgraph NGINX Process
        B --"(1) Request Body"--> C[Inference Module<br/> with Body-Based Routing]
      end
    end
  end
  C --"(2) gRPC (Request Headers)"--> D[EPP Service<br/>Endpoint Picker]
  D --"(3) Endpoint Header"--> C
  C --"(4) $inference_upstream"--> B
  B --"(5)"--> E[AI Workload Endpoint]
Loading

NGINX configuration

Example configuration snippet for a location using BBR followed by EPP:

# Load the compiled module (Linux: .so path; macOS local build: .dylib)
load_module /usr/lib/nginx/modules/libngx_inference.so;

http {
    server {
        listen 8080;

        # OpenAI-like API endpoint with both EPP and BBR
        location /responses {
            # Configure the inference module for direct BBR processing
            inference_bbr on;
            inference_bbr_max_body_size 52428800; # 50MB for AI workloads
            inference_bbr_default_model "gpt-3.5-turbo"; # Default model when none found

            # Configure the inference module for EPP (Endpoint Picker Processor)
            inference_epp on;
            inference_epp_endpoint "epp-server:9001"; # EPP service name
            inference_epp_timeout_ms 5000;
            inference_epp_failure_mode_allow off; # Fail-closed for production
            # inference_epp_tls off; # Disable TLS for development/testing
            # inference_epp_ca_file /etc/ssl/certs/ca.crt; # Custom CA file

            # Default upstream fallback when EPP fails and failure_mode_allow is on
            # inference_default_upstream "fallback-server:8080";

            # Proxy to the chosen upstream (will be determined by EPP)
            # Use the $inference_upstream variable set by the module
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_pass http://$inference_upstream;
        }
    }
}

Current behavior and defaults

  • BBR:

    • Directive inference_bbr on|off enables/disables direct BBR implementation.
    • BBR follows the Gateway API specification: parses JSON request bodies directly for the "model" field and sets the model header.
    • Directive inference_bbr_header_name configures the model header name to inject (default X-Gateway-Model-Name).
    • Directive inference_bbr_max_body_size sets maximum body size for BBR processing in bytes (default 10MB).
    • Directive inference_bbr_default_model sets the default model value when no model is found in request body (default unknown).
    • Hybrid memory/file support: small bodies stay in memory, large bodies are read from NGINX temporary files.
    • Memory allocation pre-allocation is capped at 1MB to avoid large upfront allocations. Actual in-memory accumulation may grow up to the configured inference_bbr_max_body_size limit; large payloads spill to disk and are read incrementally.
  • EPP:

    • Directive inference_epp on|off enables/disables EPP functionality.
    • Directive inference_epp_endpoint sets the gRPC endpoint for standard EPP ext-proc server communication.
    • Directive inference_epp_header_name configures the upstream header name to read from EPP responses (default X-Inference-Upstream).
    • Directive inference_epp_timeout_ms sets the gRPC timeout for EPP communication (default 200ms).
    • Directive inference_epp_failure_mode_allow on|off controls fail-open vs fail-closed behavior (default off).
    • Directive inference_default_upstream sets a fallback upstream when EPP fails and inference_epp_failure_mode_allow is on.
    • Directive inference_epp_tls on|off enables TLS for gRPC connections (default on).
    • Directive inference_epp_ca_file /path/to/ca.crt specifies CA certificate file path for TLS verification (optional).
    • EPP follows the Gateway API Inference Extension specification: performs headers-only exchange, reads header mutations from responses, and sets the upstream header for endpoint selection.
    • The $inference_upstream NGINX variable exposes the EPP-selected endpoint (read from the header configured by inference_epp_header_name) and can be used in proxy_pass directives.
  • Fail-open/closed:

    • inference_epp_failure_mode_allow on|off controls EPP fail-open vs fail-closed behavior.
    • EPP fail-closed mode returns 500 Internal Server Error on EPP processing failures.
    • EPP fail-open mode continues processing when EPP fails. When inference_epp_failure_mode_allow is on, you can configure inference_default_upstream to specify a fallback upstream when EPP fails.

Notes and assumptions

  • Standards Compliance:

    • Both EPP and BBR implementations follow the Gateway API Inference Extension specification.
    • EPP is compatible with reference EPP servers for endpoint selection.
    • BBR is compatible with the OpenAI API specification for model detection from JSON request bodies.
  • Header names:

    • BBR returns and injects a model header (default X-Gateway-Model-Name). You can configure this via inference_bbr_header_name.
    • EPP should return an endpoint hint via header mutation. This module reads a configurable upstream header via inference_epp_header_name (default X-Inference-Upstream) and exposes its value as $inference_upstream.
  • TLS:

    • TLS support for gRPC connections is enabled by default via the inference_epp_tls directive.
    • Use inference_epp_ca_file to specify a custom CA certificate file for TLS verification.
    • TLS can be disabled by setting inference_epp_tls off if needed for development or testing.
  • Body processing:

    • EPP follows the standard Gateway API specification with headers-only mode (no body streaming).
    • BBR implements hybrid memory/file processing: small bodies (< client_body_buffer_size) stay in memory, larger bodies are read from NGINX temporary files.
    • Memory allocation pre-allocation is capped at 1MB to avoid large upfront allocations. Actual in-memory accumulation may grow up to the configured inference_bbr_max_body_size limit; large payloads spill to disk and are read incrementally.
    • BBR respects configurable size limits via inference_bbr_max_body_size directive.
  • Request headers to ext-proc:

    • EPP implementation forwards incoming request headers per the Gateway API specification for endpoint selection context.
    • BBR implementation processes request bodies directly for model detection without external communication.

Testing

For comprehensive testing information, examples, and troubleshooting guides, see tests/README.md.

Local Development Setup

For local development and testing without Docker:

  1. Setup local environment and build the module:

    # Setup local development environment
    make setup-local
    
    # Build the module
    make build
  2. Start local services and run tests:

    # Start local mock services (echo server on :8080 and mock ext-proc on :9001).
    # NGINX is started automatically by 'make test-local'.
    make start-local
    
    # Run configuration tests locally
    make test-local

Troubleshooting

  • If EPP endpoints are unreachable or not listening on gRPC, you may see BAD_GATEWAY when failure mode allow is off. Toggle *_failure_mode_allow on to fail-open during testing.
  • Enhanced TLS Error Logging: The module now provides detailed TLS certificate validation error messages (e.g., "invalid peer certificate: UnknownIssuer") instead of generic transport errors. Check error logs for specific TLS issues like unknown issuers or certificate validation failures.
  • Ensure your EPP implementation is configured to return a header mutation for the upstream endpoint. The module will parse response frames and search for header_mutation entries.
  • BBR processes JSON directly in the module - ensure request bodies contain valid JSON with a "model" field.
  • Use error_log and debug logging to verify module activation. BBR logs body reading and size limit enforcement; EPP logs gRPC errors with detailed TLS diagnostics. Set error_log to debug to observe processing details.

License

Apache-2.0 (to align with upstream projects).

About

PoC NGINX inference module for Gateway API

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •