|
| 1 | +--- |
| 2 | +title: "Scaling multi-node LLM inference with NVIDIA Dynamo and NVIDIA GPUs on AKS (Part 2)" |
| 3 | +date: 2026-01-22 |
| 4 | +description: "Learn how to scale multi-node LLM inference on Kubernetes using NVIDIA Dynamo, H100 GPUs, and Dynamo Planner tools to optimize throughput and latency." |
| 5 | +authors: |
| 6 | +- sachi-desai |
| 7 | +- sertac-ozercan |
| 8 | +tags: ["dynamo-series", "ai", "performance", "open-source"] |
| 9 | +--- |
| 10 | + |
| 11 | +*This blog post is co-authored with |
| 12 | +[Saurabh Aggarwal](https://www.linkedin.com/in/sa126/), |
| 13 | +[Anish Maddipoti](https://www.linkedin.com/in/anish-maddipoti/), |
| 14 | +[Amr Elmeleegy](https://www.linkedin.com/in/meleegy/), and |
| 15 | +[Rohan Varma](https://www.linkedin.com/in/rohan-s-varma/) from NVIDIA.* |
| 16 | + |
| 17 | +In our [previous post](https://blog.aks.azure.com/2025/10/24/dynamo-on-aks), |
| 18 | +we demonstrated the power of the ND GB200 NVL72 platform, achieving a |
| 19 | +staggering **1.2M tokens per second** across 10 nodes using NVIDIA Dynamo. |
| 20 | +Today, we’re shifting focus from raw throughput to **developer velocity** and |
| 21 | +**operational efficiency**. |
| 22 | + |
| 23 | +We will explore how the [**Dynamo Planner**](https://github.com/ai-dynamo/dynamo/blob/main/docs/planner/sla_planner.md) and [**Dynamo Profiler**](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/profiler) |
| 24 | +remove the guesswork from performance tuning. |
| 25 | + |
| 26 | +<!-- truncate --> |
| 27 | + |
| 28 | +## The Challenge: Balancing the "Rate Matching" Equation |
| 29 | + |
| 30 | +Disaggregated serving separates the prefill phase (when the model first |
| 31 | +processes the entire input sequence at once) and decode phase (when the |
| 32 | +model starts sequentially generating output tokens) of inference |
| 33 | +across distinct GPU nodes. This allows each phase to be independently |
| 34 | +optimized with custom GPU counts and model parallelism configurations. |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +One of the main challenges in disaggregated serving is **rate matching**: |
| 39 | +determining the right GPU allocation between prefill and decode stages to |
| 40 | +meet a specific Service Level Objective (SLO). If you miscalculate the GPU |
| 41 | +ratio between these stages, you face two "silent killers" of performance: |
| 42 | + |
| 43 | +* **Over-provisioned Prefill**: Your prompt processing is fast, but |
| 44 | +requests bottleneck at the generation stage. This spikes *Inter-Token |
| 45 | +Latency (ITL)* and leaves expensive compute nodes idle. |
| 46 | +* **Under-provisioned Prefill**: Your decode GPUs sit starved for data. |
| 47 | +This drives up *Time-To-First-Token (TTFT)* and inflates your |
| 48 | +*Total Cost of Ownership (TCO)*. |
| 49 | + |
| 50 | +Beyond rate matching, developers must also optimize model parallelism |
| 51 | +parameters (data, tensor, and expert parallelism) to maintain high |
| 52 | +["Goodput"](https://arxiv.org/abs/2401.09670) (the fraction of time |
| 53 | +and resources where the model is learning or producing correct results, |
| 54 | +instead of waiting or doing extra work). |
| 55 | + |
| 56 | +Exploring these configurations manually is technically challenging, |
| 57 | +time-consuming and often results in suboptimal resource utilization. |
| 58 | + |
| 59 | +## Dynamic Traffic: The Move to SLO-Driven Scaling |
| 60 | + |
| 61 | +Static configurations are brittle. In production, traffic is rarely uniform: |
| 62 | + |
| 63 | +* **Volatile Request Volume**: Traditional Horizontal Pod Autoscalers (HPA) |
| 64 | +are too slow for LLM jitters. |
| 65 | +* **Shifting Sequence Patterns**: If your workload shifts from short chat |
| 66 | +queries (low input sequence length (ISL)) to long-context document analysis (high ISL), a static |
| 67 | +disaggregated split becomes suboptimal instantly (resulting in overworked |
| 68 | +prefill GPUs and idle decode GPUs). |
| 69 | + |
| 70 | +NVIDIA Dynamo addresses these gaps through two integrated components: |
| 71 | +the **Planner Profiler** and the **SLO-based Planner**. |
| 72 | + |
| 73 | +--- |
| 74 | + |
| 75 | +### Let’s see it through an example application scenario |
| 76 | + |
| 77 | +Consider a major airline’s mobile app that uses AI to offer personalized |
| 78 | +rerouting during flight delays. This use case is a 'stress test' for |
| 79 | +inference: it is subject to massive, sudden bursts in traffic and highly |
| 80 | +variable request patterns, such as a mix of short status queries and |
| 81 | +long-context itinerary processing. To prevent latency spikes during these |
| 82 | +peaks, the underlying system requires the precise orchestration offered |
| 83 | +by a disaggregated architecture. |
| 84 | + |
| 85 | +Using the |
| 86 | +[Qwen3-32B-FP8](https://huggingface.co/Qwen/Qwen3-32B-FP8) |
| 87 | +model, we can deploy an Airline Assistant with |
| 88 | +strict SLA targets: TTFT ≤ 500ms and ITL (Inter-Token Latency) ≤ 30ms. |
| 89 | + |
| 90 | +During normal operations, passengers ask short queries like |
| 91 | +"What's my flight status?" But when a major weather system causes |
| 92 | +flight cancellations, passengers flood the app with complex rerouting |
| 93 | +requests—long-context queries (~4,000 tokens) requiring detailed itinerary |
| 94 | +responses (~500 tokens). This sudden surge of 200 concurrent users is |
| 95 | +exactly the kind of real-world spike that breaks static configurations. |
| 96 | + |
| 97 | +To build a truly efficient disaggregated AI inference system, you |
| 98 | +need to transition from manual "guess-and-check" configurations |
| 99 | +to an automated, SLO-driven approach. The core of this automation |
| 100 | +lies in two distinct but deeply integrated components: the Dynamo |
| 101 | +Planner profiler and the Dynamo Planner. |
| 102 | + |
| 103 | +The first step in building your system is determining the "Golden Ratio" |
| 104 | +of GPUs: how many should handle prefill versus decode, and what tensor |
| 105 | +parallelism (TP) levels each should use. |
| 106 | + |
| 107 | +### The Architect: Dynamo Profiler |
| 108 | + |
| 109 | +The Dynamo Planner profiler is your pre-deployment simulation engine. |
| 110 | +Instead of burning GPU hours testing every possible configuration, you |
| 111 | +define your requirements in a **DynamoGraphDeploymentRequest (DGDR)** |
| 112 | +manifest. The profiler then executes an automated |
| 113 | +["sweep"](https://github.com/ai-dynamo/dynamo/blob/main/docs/benchmarks/sla_driven_profiling.md#profiling-method) |
| 114 | +of the search space: |
| 115 | + |
| 116 | +* **Parallelization Mapping**: It tests different TP sizes for both prefill |
| 117 | +and decode stages to find the lowest TTFT and ITL. |
| 118 | +* **Hardware Simulation**: Using the **AI Configurator (AIC)** mode, the |
| 119 | +profiler can simulate performance in just 20–30 seconds |
| 120 | +based on pre-measured performance data, allowing for rapid |
| 121 | +iteration before you ever touch a physical GPU. |
| 122 | +* **Resulting Recommendation**: The output is a highly tuned |
| 123 | +configuration that maximizes ["Goodput"](https://arxiv.org/abs/2401.09670), |
| 124 | +the maximum throughput |
| 125 | +achievable while staying strictly within your latency bounds. |
| 126 | + |
| 127 | +Ultimately, the app developers and AI engineers reduce their time |
| 128 | +spent on testing different system setups, and can focus on their airline |
| 129 | +passengers’ needs. |
| 130 | + |
| 131 | +### The Pilot: Dynamo Planner |
| 132 | + |
| 133 | +Once your system is deployed, static configurations can't handle the |
| 134 | +"jitter" of real-world traffic. This is where the Dynamo Planner takes |
| 135 | +over as a runtime orchestration engine. |
| 136 | + |
| 137 | +Unlike a traditional load balancer, the Dynamo Planner is **LLM-aware**. |
| 138 | +It continuously monitors the live state of your cluster, specifically |
| 139 | +looking at: |
| 140 | + |
| 141 | +* **KV Cache Load**: It monitors memory utilization in the decode pool. |
| 142 | +* **Prefill Queue Depth**: It tracks how many prompts are waiting to be |
| 143 | +processed. |
| 144 | + |
| 145 | +Using the performance bounds identified earlier by the profiler |
| 146 | +(i.e. TTFT ≤ 500ms and ITL ≤ 30ms) the Planner |
| 147 | +proactively scales the number of prefill and decode workers up or down. For |
| 148 | +example, if a *sudden burst of long-context itinerary queries* floods the |
| 149 | +system, the Planner detects the spike in the prefill queue and shifts available |
| 150 | +GPU resources to the prefill pool *before* your TTFT violates its SLO. |
| 151 | + |
| 152 | +## Seeing it in Action |
| 153 | + |
| 154 | +In our airline scenario, the system starts with 1 prefill worker and |
| 155 | +1 decode worker. When the passenger surge hits, the Planner's 60-second |
| 156 | +adjustment interval detects the SLA violations: |
| 157 | + |
| 158 | +```bash |
| 159 | +Prefill calculation: 138.55 (p_thpt) / 4838.61 (p_engine_cap) = 1 (num_p) |
| 160 | +Decode calculation: 27.27 (d_thpt) / 19381.08 (d_engine_cap) = 1 (num_d) |
| 161 | +``` |
| 162 | + |
| 163 | +As traffic spikes to 200 concurrent passengers, the Planner recalculates: |
| 164 | + |
| 165 | +```bash |
| 166 | +Prefill calculation: 16177.75 (p_thpt) / 8578.39 (p_engine_cap) = 2 (num_p) |
| 167 | +Decode calculation: 400.00 (d_thpt) / 3354.30 (d_engine_cap) = 1 (num_d) |
| 168 | +Predicted number of engine replicas: prefill=2, decode=1 |
| 169 | +Updating prefill component VllmPrefillWorker to desired replica count 2 |
| 170 | +``` |
| 171 | + |
| 172 | +[See the Dynamo SLA Planner](https://asciinema.org/a/67XW4yXJIBmIe7bv) |
| 173 | +in action as it automatically scales the Airline Assistant during a |
| 174 | +traffic surge. The Planner automatically scales to 2 prefill workers |
| 175 | +while keeping 1 decode worker (the optimal configuration to handle the |
| 176 | +surge while maintaining SLA targets). Within minutes, the new worker is |
| 177 | +online and passengers are getting their rerouting options without |
| 178 | +frustrating delays. |
| 179 | + |
| 180 | +Now, you can try this yourself by running the NVIDIA Dynamo Planner Profiler |
| 181 | +to capture burst and request behavior, then using the SLO-based Planner to |
| 182 | +translate latency targets into placement and scaling decisions on your AKS |
| 183 | +cluster. Setting it up in this order - profile under stress, define SLOs, |
| 184 | +and let the planner orchestrate your disaggregated inference system to |
| 185 | +handle sudden traffic spikes without latency spikes. |
| 186 | + |
| 187 | +After deploying Dynamo by following [these instructions](https://aka.ms/aks-dynamo), |
| 188 | +get hands on with the |
| 189 | +[Qwen3-32B-FP8](https://huggingface.co/Qwen/Qwen3-32B-FP8) |
| 190 | +model using the example in [AKS Dynamo Part 2 sample](https://aka.ms/aks-dynamo-part-2). |
| 191 | + |
| 192 | +## Conclusion: Inference Without the Infrastructure Burden |
| 193 | + |
| 194 | +The shift toward disaggregated serving is a necessity for the next |
| 195 | +generation of reasoning-heavy and long-context LLMs. However, as we |
| 196 | +have seen, the complexity of manually tuning these distributed systems |
| 197 | +on Kubernetes can quickly become a bottleneck for even the most |
| 198 | +experienced AI teams. |
| 199 | + |
| 200 | +By utilizing the NVIDIA Dynamo Planner Profiler, developers can move |
| 201 | +from educated guessing to data-driven certainty, modeling performance |
| 202 | +in seconds rather than days. When paired with the Dynamo Planner, this |
| 203 | +static optimization becomes a dynamic, SLO-aware reality on AKS, capable of |
| 204 | +weathering the unpredictable traffic spikes of production environments. |
| 205 | + |
| 206 | +Ultimately, this suite transforms your inference stack from a series of |
| 207 | +fragile configurations into a resilient, self-optimizing engine. For the AI |
| 208 | +engineer, this means less time spent managing hardware limits and configuring |
| 209 | +system scalability and more time spent delivering the high-quality, |
| 210 | +interactive experiences that your users (and your passengers) expect. |
0 commit comments