diff --git a/website/blog/2025-12-11-autoscale-inference-workloads-with-kaito/index.md b/website/blog/2025-12-11-autoscale-inference-workloads-with-kaito/index.md new file mode 100644 index 000000000..acbeb48dc --- /dev/null +++ b/website/blog/2025-12-11-autoscale-inference-workloads-with-kaito/index.md @@ -0,0 +1,274 @@ +--- +title: "Autoscale KAITO inference workloads on AKS using KEDA" +date: "2026-01-15" +description: "Learn how to autoscale KAITO inference workloads on AKS with KEDA to handle varying inference requests and optimize Kubernetes GPU utilization in AKS clusters." +authors: ["andy-zhang", "sachi-desai"] +tags: ["ai", "kaito"] +--- + +[Kubernetes AI Toolchain Operator](https://github.com/Azure/kaito) (KAITO) is an operator that simplifies and automates AI/ML model inference, tuning, and RAG in a Kubernetes cluster. With the recent [v0.8.0 release](https://github.com/Azure/kaito/releases/tag/v0.8.0), KAITO has introduced intelligent autoscaling for inference workloads as an alpha feature! In this blog, we'll guide you through setting up event-driven autoscaling for vLLM inference workloads. + + + +## Introduction + +LLM inference service is a basic and widely used feature in KAITO. As the number of waiting inference requests increases, it's necessary to scale more inference instances to prevent blocking inference requests. Conversely, if the number of waiting inference requests declines, consider reducing inference instances to improve GPU resource utilization. Kubernetes Event-driven Autoscaling (KEDA) is well-suited for inference pod autoscaling. It enables event-driven, fine-grained scaling based on external metrics and triggers. KEDA supports a wide range of event sources (like custom metrics), allowing pods to scale precisely in response to workload demand. This flexibility and extensibility make KEDA ideal for dynamic, cloud-native applications that require responsive and efficient autoscaling. + +To enable intelligent autoscaling for KAITO inference workloads using service.monitoring metrics, use the following components and features: + +- [Kubernetes Event Driven Autoscaling (KEDA)](https://github.com/kedacore/keda) + +- **[keda.kaito.scaler](https://github.com/kaito-project/keda-kaito-scaler)** – A dedicated KEDA external scaler, eliminating the need for external dependencies such as Prometheus. + +- **KAITO `InferenceSet` CustomResourceDefinition (CRD) and controller** – A new CRD and controller were built on top of the KAITO workspace for intelligent autoscaling, introduced as an alpha feature in KAITO version `v0.8.0`. + +### Architecture + +The following diagram shows how keda-kaito-scaler integrates KAITO InferenceSet with KEDA to autoscale inference workloads on AKS: + + ![Architecture diagram showing keda-kaito-scaler integrating KAITO InferenceSet with KEDA to autoscale inference workloads on AKS](keda-kaito-scaler-arch.png) + +## Getting started + +### Create an AKS cluster with GPU auto-provisioning capabilities for KAITO + +You could refer to the instructions on [how to create an AKS cluster with GPU auto-provisioning capabilities for KAITO](https://kaito-project.github.io/kaito/docs/azure). + +### Enable InferenceSet controller in KAITO + +The InferenceSet CRD and controller were introduced as an **alpha** feature in KAITO version `v0.8.0`. Built on top of the KAITO workspace, InferenceSet supports the scale subresource API for intelligent autoscaling. To use InferenceSet, the InferenceSet controller must be enabled during the KAITO installation. + +```bash +export CLUSTER_NAME=kaito + +helm repo add kaito https://kaito-project.github.io/kaito/charts/kaito +helm repo update +helm upgrade --install kaito-workspace kaito/workspace \ + --namespace kaito-workspace \ + --create-namespace \ + --set clusterName="$CLUSTER_NAME" \ + --set featureGates.enableInferenceSetController=true \ + --wait +``` + +### Install KEDA + +- **Option 1**: Enable managed KEDA add-on +For instructions on enabling KEDA add-on on AKS, you could refer to the guide [Install KEDA add-on on AKS](https://learn.microsoft.com/azure/aks/keda-deploy-add-on-cli) + +- **Option 2**: Install KEDA using Helm chart + +> The following example demonstrates how to install KEDA 2.x using Helm chart. For instructions on installing KEDA through other methods, please refer to the guide [KEDA deployment documentation](https://github.com/kedacore/keda#deploying-keda). + +```bash +helm repo add kedacore https://kedacore.github.io/charts +helm install keda kedacore/keda --namespace kube-system +``` + +## Example Scenarios + +### Time-Based KEDA Scaler + +The KEDA cron scaler enables scaling of workloads according to time-based schedules, making it especially beneficial for workloads with predictable traffic patterns. It is perfect for situations where peak hours are known ahead of time, allowing you to proactively adjust resources before demand rises. For more details about time-based scalers, refer to [Scale applications based on a cron schedule](https://keda.sh/docs/2.18/scalers/cron/). + +#### Example: Business Hours Scaling + +- Create a KAITO InferenceSet for running inference workloads + +The following example creates an InferenceSet for the phi-4-mini model: + +```bash +cat < This component is required only when using metric-based KEDA scaler, ensure that keda-kaito-scaler is installed within the same namespace as KEDA. + +```bash +helm repo add keda-kaito-scaler https://kaito-project.github.io/keda-kaito-scaler/charts/kaito-project +helm upgrade --install keda-kaito-scaler -n kube-system keda-kaito-scaler/keda-kaito-scaler +``` + +After a few seconds, a new deployment `keda-kaito-scaler` would be started. + +```bash +# kubectl get deployment keda-kaito-scaler -n kube-system +NAME READY UP-TO-DATE AVAILABLE AGE +keda-kaito-scaler 1/1 1 1 28h +``` + +The `keda-kaito-scaler` provides a simplified configuration interface for scaling vLLM inference workloads, it directly scrapes metrics from inference pods, eliminating the need for a separate monitoring stack. + +#### Example: Create a KAITO InferenceSet with annotations for running inference workloads + +- The following example creates an InferenceSet for the phi-4-mini model, using annotations with the prefix `scaledobject.kaito.sh/` to supply parameter inputs for the KEDA KAITO scaler. + + - `scaledobject.kaito.sh/auto-provision` + - required, if it's `true`, KEDA KAITO scaler will automatically provision a ScaledObject based on the `InferenceSet` object + - `scaledobject.kaito.sh/max-replicas` + - required, maximum number of replicas for the target InferenceSet + - `scaledobject.kaito.sh/metricName` + - optional, specifies the metric name collected from the vLLM pod, which is used for monitoring and triggering the scaling operation, default is `vllm:num_requests_waiting`, find all vllm metrics in [vLLM Production Metrics](https://docs.vllm.ai/en/stable/usage/metrics/#general-metrics) + - `scaledobject.kaito.sh/threshold` + - required, specifies the threshold for the monitored metric that triggers the scaling operation + +```bash +cat <