This project provides a native NGINX module (built with ngx-rust) that implements the Gateway API Inference Extension using Envoy's ext_proc protocol over gRPC.
It implements two standard components:
- Endpoint Picker Processor (EPP): gRPC exchange following the Gateway API Inference Extension specification to obtain upstream endpoint selection and expose endpoints via the
$inference_upstreamNGINX variable. - Body-Based Routing (BBR): Direct in-module implementation that extracts model names from JSON request bodies and injects model headers, following the OpenAI API specification and Gateway API Inference Extension standards.
Reference docs:
- NGF design doc: https://github.com/nginx/nginx-gateway-fabric/blob/main/docs/proposals/gateway-inference-extension.md
- EPP reference implementation: https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp
- Module configuration: docs/configuration.md
- Example configurations: docs/examples/README.md
flowchart TD
A[Client Request] --> B[Core]
subgraph NGINX Pod
subgraph NGINX Container
subgraph NGINX Process
B --"(1) Request Body"--> C[Inference Module<br/> with Body-Based Routing]
end
end
end
C --"(2) gRPC (Request Headers)"--> D[EPP Service<br/>Endpoint Picker]
D --"(3) Endpoint Header"--> C
C --"(4) $inference_upstream"--> B
B --"(5)"--> E[AI Workload Endpoint]
Example configuration snippet for a location using BBR followed by EPP:
# Load the compiled module (Linux: .so path; macOS local build: .dylib)
load_module /usr/lib/nginx/modules/libngx_inference.so;
http {
server {
listen 8080;
# OpenAI-like API endpoint with both EPP and BBR
location /responses {
# Configure the inference module for direct BBR processing
inference_bbr on;
inference_bbr_max_body_size 52428800; # 50MB for AI workloads
inference_bbr_default_model "gpt-3.5-turbo"; # Default model when none found
# Configure the inference module for EPP (Endpoint Picker Processor)
inference_epp on;
inference_epp_endpoint "epp-server:9001"; # EPP service name
inference_epp_timeout_ms 5000;
inference_epp_failure_mode_allow off; # Fail-closed for production
# inference_epp_tls off; # Disable TLS for development/testing
# inference_epp_ca_file /etc/ssl/certs/ca.crt; # Custom CA file
# Default upstream fallback when EPP fails and failure_mode_allow is on
# inference_default_upstream "fallback-server:8080";
# Proxy to the chosen upstream (will be determined by EPP)
# Use the $inference_upstream variable set by the module
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_pass http://$inference_upstream;
}
}
}-
BBR:
- Directive
inference_bbr on|offenables/disables direct BBR implementation. - BBR follows the Gateway API specification: parses JSON request bodies directly for the "model" field and sets the model header.
- Directive
inference_bbr_header_nameconfigures the model header name to inject (defaultX-Gateway-Model-Name). - Directive
inference_bbr_max_body_sizesets maximum body size for BBR processing in bytes (default 10MB). - Directive
inference_bbr_default_modelsets the default model value when no model is found in request body (defaultunknown). - Hybrid memory/file support: small bodies stay in memory, large bodies are read from NGINX temporary files.
- Memory allocation pre-allocation is capped at 1MB to avoid large upfront allocations. Actual in-memory accumulation may grow up to the configured
inference_bbr_max_body_sizelimit; large payloads spill to disk and are read incrementally.
- Directive
-
EPP:
- Directive
inference_epp on|offenables/disables EPP functionality. - Directive
inference_epp_endpointsets the gRPC endpoint for standard EPP ext-proc server communication. - Directive
inference_epp_header_nameconfigures the upstream header name to read from EPP responses (defaultX-Inference-Upstream). - Directive
inference_epp_timeout_mssets the gRPC timeout for EPP communication (default200ms). - Directive
inference_epp_failure_mode_allow on|offcontrols fail-open vs fail-closed behavior (defaultoff). - Directive
inference_default_upstreamsets a fallback upstream when EPP fails andinference_epp_failure_mode_allowison. - Directive
inference_epp_tls on|offenables TLS for gRPC connections (defaulton). - Directive
inference_epp_ca_file /path/to/ca.crtspecifies CA certificate file path for TLS verification (optional). - EPP follows the Gateway API Inference Extension specification: performs headers-only exchange, reads header mutations from responses, and sets the upstream header for endpoint selection.
- The
$inference_upstreamNGINX variable exposes the EPP-selected endpoint (read from the header configured byinference_epp_header_name) and can be used inproxy_passdirectives.
- Directive
-
Fail-open/closed:
inference_epp_failure_mode_allow on|offcontrols EPP fail-open vs fail-closed behavior.- EPP fail-closed mode returns
500 Internal Server Erroron EPP processing failures. - EPP fail-open mode continues processing when EPP fails. When
inference_epp_failure_mode_allowison, you can configureinference_default_upstreamto specify a fallback upstream when EPP fails.
-
Standards Compliance:
- Both EPP and BBR implementations follow the Gateway API Inference Extension specification.
- EPP is compatible with reference EPP servers for endpoint selection.
- BBR is compatible with the OpenAI API specification for model detection from JSON request bodies.
-
Header names:
- BBR returns and injects a model header (default
X-Gateway-Model-Name). You can configure this viainference_bbr_header_name. - EPP should return an endpoint hint via header mutation. This module reads a configurable upstream header via
inference_epp_header_name(defaultX-Inference-Upstream) and exposes its value as$inference_upstream.
- BBR returns and injects a model header (default
-
TLS:
- TLS support for gRPC connections is enabled by default via the
inference_epp_tlsdirective. - Use
inference_epp_ca_fileto specify a custom CA certificate file for TLS verification. - TLS can be disabled by setting
inference_epp_tls offif needed for development or testing.
- TLS support for gRPC connections is enabled by default via the
-
Body processing:
- EPP follows the standard Gateway API specification with headers-only mode (no body streaming).
- BBR implements hybrid memory/file processing: small bodies (< client_body_buffer_size) stay in memory, larger bodies are read from NGINX temporary files.
- Memory allocation pre-allocation is capped at 1MB to avoid large upfront allocations. Actual in-memory accumulation may grow up to the configured
inference_bbr_max_body_sizelimit; large payloads spill to disk and are read incrementally. - BBR respects configurable size limits via
inference_bbr_max_body_sizedirective.
-
Request headers to ext-proc:
- EPP implementation forwards incoming request headers per the Gateway API specification for endpoint selection context.
- BBR implementation processes request bodies directly for model detection without external communication.
For comprehensive testing information, examples, and troubleshooting guides, see tests/README.md.
For local development and testing without Docker:
-
Setup local environment and build the module:
# Setup local development environment make setup-local # Build the module make build
-
Start local services and run tests:
# Start local mock services (echo server on :8080 and mock ext-proc on :9001). # NGINX is started automatically by 'make test-local'. make start-local # Run configuration tests locally make test-local
- If EPP endpoints are unreachable or not listening on gRPC, you may see
BAD_GATEWAYwhen failure mode allow is off. Toggle*_failure_mode_allow onto fail-open during testing. - Enhanced TLS Error Logging: The module now provides detailed TLS certificate validation error messages (e.g., "invalid peer certificate: UnknownIssuer") instead of generic transport errors. Check error logs for specific TLS issues like unknown issuers or certificate validation failures.
- Ensure your EPP implementation is configured to return a header mutation for the upstream endpoint. The module will parse response frames and search for
header_mutationentries. - BBR processes JSON directly in the module - ensure request bodies contain valid JSON with a "model" field.
- Use
error_logand debug logging to verify module activation. BBR logs body reading and size limit enforcement; EPP logs gRPC errors with detailed TLS diagnostics. Seterror_logtodebugto observe processing details.
Apache-2.0 (to align with upstream projects).