ngx-inference: NGINX module for Gateway API Inference Extensions

Overview

This project provides a native NGINX module (built with ngx-rust) that implements the Gateway API Inference Extension using Envoy's ext_proc protocol over gRPC.

It implements two standard components:

Endpoint Picker Processor (EPP): gRPC exchange following the Gateway API Inference Extension specification to obtain upstream endpoint selection and expose endpoints via the $inference_upstream NGINX variable.
Body-Based Routing (BBR): Direct in-module implementation that extracts model names from JSON request bodies and injects model headers, following the OpenAI API specification and Gateway API Inference Extension standards.

Reference docs:

NGF design doc: https://github.com/nginx/nginx-gateway-fabric/blob/main/docs/proposals/gateway-inference-extension.md
EPP reference implementation: https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp
Module configuration: docs/configuration.md
Example configurations: docs/examples/README.md

Inference Module Architecture

flowchart TD
  A[Client Request] --> B[Core]
  subgraph NGINX Pod
    subgraph NGINX Container
      subgraph NGINX Process
        B --"(1) Request Body"--> C[Inference Module<br/> with Body-Based Routing]
      end
    end
  end
  C --"(2) gRPC (Request Headers)"--> D[EPP Service<br/>Endpoint Picker]
  D --"(3) Endpoint Header"--> C
  C --"(4) $inference_upstream"--> B
  B --"(5)"--> E[AI Workload Endpoint]

NGINX configuration

Example configuration snippet for a location using BBR followed by EPP:

# Load the compiled module (Linux: .so path; macOS local build: .dylib)
load_module /usr/lib/nginx/modules/libngx_inference.so;

http {
    server {
        listen 8080;

        # OpenAI-like API endpoint with both EPP and BBR
        location /responses {
            # Configure the inference module for direct BBR processing
            inference_bbr on;
            inference_bbr_max_body_size 52428800; # 50MB for AI workloads
            inference_bbr_default_model "gpt-3.5-turbo"; # Default model when none found

            # Configure the inference module for EPP (Endpoint Picker Processor)
            inference_epp on;
            inference_epp_endpoint "epp-server:9001"; # EPP service name
            inference_epp_timeout_ms 5000;
            inference_epp_failure_mode_allow off; # Fail-closed for production
            # inference_epp_tls off; # Disable TLS for development/testing
            # inference_epp_ca_file /etc/ssl/certs/ca.crt; # Custom CA file

            # Default upstream fallback when EPP fails and failure_mode_allow is on
            # inference_default_upstream "fallback-server:8080";

            # Proxy to the chosen upstream (will be determined by EPP)
            # Use the $inference_upstream variable set by the module
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_pass http://$inference_upstream;
        }
    }
}

Current behavior and defaults

BBR:
- Directive inference_bbr on|off enables/disables direct BBR implementation.
- BBR follows the Gateway API specification: parses JSON request bodies directly for the "model" field and sets the model header.
- Directive inference_bbr_header_name configures the model header name to inject (default X-Gateway-Model-Name).
- Directive inference_bbr_max_body_size sets maximum body size for BBR processing in bytes (default 10MB).
- Directive inference_bbr_default_model sets the default model value when no model is found in request body (default unknown).
- Hybrid memory/file support: small bodies stay in memory, large bodies are read from NGINX temporary files.
- Memory allocation pre-allocation is capped at 1MB to avoid large upfront allocations. Actual in-memory accumulation may grow up to the configured inference_bbr_max_body_size limit; large payloads spill to disk and are read incrementally.
EPP:
- Directive inference_epp on|off enables/disables EPP functionality.
- Directive inference_epp_endpoint sets the gRPC endpoint for standard EPP ext-proc server communication.
- Directive inference_epp_header_name configures the upstream header name to read from EPP responses (default X-Inference-Upstream).
- Directive inference_epp_timeout_ms sets the gRPC timeout for EPP communication (default 200ms).
- Directive inference_epp_failure_mode_allow on|off controls fail-open vs fail-closed behavior (default off).
- Directive inference_default_upstream sets a fallback upstream when EPP fails and inference_epp_failure_mode_allow is on.
- Directive inference_epp_tls on|off enables TLS for gRPC connections (default on).
- Directive inference_epp_ca_file /path/to/ca.crt specifies CA certificate file path for TLS verification (optional).
- EPP follows the Gateway API Inference Extension specification: performs headers-only exchange, reads header mutations from responses, and sets the upstream header for endpoint selection.
- The $inference_upstream NGINX variable exposes the EPP-selected endpoint (read from the header configured by inference_epp_header_name) and can be used in proxy_pass directives.
Fail-open/closed:
- inference_epp_failure_mode_allow on|off controls EPP fail-open vs fail-closed behavior.
- EPP fail-closed mode returns 500 Internal Server Error on EPP processing failures.
- EPP fail-open mode continues processing when EPP fails. When inference_epp_failure_mode_allow is on, you can configure inference_default_upstream to specify a fallback upstream when EPP fails.

Notes and assumptions

Standards Compliance:
- Both EPP and BBR implementations follow the Gateway API Inference Extension specification.
- EPP is compatible with reference EPP servers for endpoint selection.
- BBR is compatible with the OpenAI API specification for model detection from JSON request bodies.
Header names:
- BBR returns and injects a model header (default X-Gateway-Model-Name). You can configure this via inference_bbr_header_name.
- EPP should return an endpoint hint via header mutation. This module reads a configurable upstream header via inference_epp_header_name (default X-Inference-Upstream) and exposes its value as $inference_upstream.
TLS:
- TLS support for gRPC connections is enabled by default via the inference_epp_tls directive.
- Use inference_epp_ca_file to specify a custom CA certificate file for TLS verification.
- TLS can be disabled by setting inference_epp_tls off if needed for development or testing.
Body processing:
- EPP follows the standard Gateway API specification with headers-only mode (no body streaming).
- BBR implements hybrid memory/file processing: small bodies (< client_body_buffer_size) stay in memory, larger bodies are read from NGINX temporary files.
- Memory allocation pre-allocation is capped at 1MB to avoid large upfront allocations. Actual in-memory accumulation may grow up to the configured inference_bbr_max_body_size limit; large payloads spill to disk and are read incrementally.
- BBR respects configurable size limits via inference_bbr_max_body_size directive.
Request headers to ext-proc:
- EPP implementation forwards incoming request headers per the Gateway API specification for endpoint selection context.
- BBR implementation processes request bodies directly for model detection without external communication.

Testing

For comprehensive testing information, examples, and troubleshooting guides, see tests/README.md.

Local Development Setup

For local development and testing without Docker:

Setup local environment and build the module:

# Setup local development environment
make setup-local

# Build the module
make build

Start local services and run tests:

# Start local mock services (echo server on :8080 and mock ext-proc on :9001).
# NGINX is started automatically by 'make test-local'.
make start-local

# Run configuration tests locally
make test-local

Troubleshooting

If EPP endpoints are unreachable or not listening on gRPC, you may see BAD_GATEWAY when failure mode allow is off. Toggle *_failure_mode_allow on to fail-open during testing.
Enhanced TLS Error Logging: The module now provides detailed TLS certificate validation error messages (e.g., "invalid peer certificate: UnknownIssuer") instead of generic transport errors. Check error logs for specific TLS issues like unknown issuers or certificate validation failures.
Ensure your EPP implementation is configured to return a header mutation for the upstream endpoint. The module will parse response frames and search for header_mutation entries.
BBR processes JSON directly in the module - ensure request bodies contain valid JSON with a "model" field.
Use error_log and debug logging to verify module activation. BBR logs body reading and size limit enforcement; EPP logs gRPC errors with detailed TLS diagnostics. Set error_log to debug to observe processing details.

License

Apache-2.0 (to align with upstream projects).

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github		.github
.vscode		.vscode
bin		bin
docker		docker
docs		docs
proto/envoy		proto/envoy
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ngx-inference: NGINX module for Gateway API Inference Extensions

Overview

Inference Module Architecture

NGINX configuration

Current behavior and defaults

Notes and assumptions

Testing

Local Development Setup

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

CVanF5/ngx-inference

Folders and files

Latest commit

History

Repository files navigation

ngx-inference: NGINX module for Gateway API Inference Extensions

Overview

Inference Module Architecture

NGINX configuration

Current behavior and defaults

Notes and assumptions

Testing

Local Development Setup

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages