Skip to content

Commit c6a880d

Browse files
authored
Merge branch 'master' into kubevirt-blog
2 parents ace060b + 9702258 commit c6a880d

File tree

11 files changed

+339
-2
lines changed

11 files changed

+339
-2
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
.DS_Store
2-
.vscode
2+
.vscode
3+
package-lock.json

website/blog/2025-10-24-dynamo-on-aks/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ authors:
77
- sachi-desai
88
- sertac-ozercan
99
- rita-zhang
10-
tags: ["ai", "gpu", "oss", "GB200"]
10+
tags: ["dynamo-series", "ai", "gpu", "oss", "GB200"]
1111
---
1212

1313
*This blog post is co-authored with
590 KB
Loading
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
---
2+
title: "Scaling multi-node LLM inference with NVIDIA Dynamo and NVIDIA GPUs on AKS (Part 2)"
3+
date: 2026-01-22
4+
description: "Learn how to scale multi-node LLM inference on Kubernetes using NVIDIA Dynamo, H100 GPUs, and Dynamo Planner tools to optimize throughput and latency."
5+
authors:
6+
- sachi-desai
7+
- sertac-ozercan
8+
tags: ["dynamo-series", "ai", "performance", "open-source"]
9+
---
10+
11+
*This blog post is co-authored with
12+
[Saurabh Aggarwal](https://www.linkedin.com/in/sa126/),
13+
[Anish Maddipoti](https://www.linkedin.com/in/anish-maddipoti/),
14+
[Amr Elmeleegy](https://www.linkedin.com/in/meleegy/), and
15+
[Rohan Varma](https://www.linkedin.com/in/rohan-s-varma/) from NVIDIA.*
16+
17+
In our [previous post](https://blog.aks.azure.com/2025/10/24/dynamo-on-aks),
18+
we demonstrated the power of the ND GB200 NVL72 platform, achieving a
19+
staggering **1.2M tokens per second** across 10 nodes using NVIDIA Dynamo.
20+
Today, we’re shifting focus from raw throughput to **developer velocity** and
21+
**operational efficiency**.
22+
23+
We will explore how the [**Dynamo Planner**](https://github.com/ai-dynamo/dynamo/blob/main/docs/planner/sla_planner.md) and [**Dynamo Profiler**](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/profiler)
24+
remove the guesswork from performance tuning.
25+
26+
<!-- truncate -->
27+
28+
## The Challenge: Balancing the "Rate Matching" Equation
29+
30+
Disaggregated serving separates the prefill phase (when the model first
31+
processes the entire input sequence at once) and decode phase (when the
32+
model starts sequentially generating output tokens) of inference
33+
across distinct GPU nodes. This allows each phase to be independently
34+
optimized with custom GPU counts and model parallelism configurations.
35+
36+
![Disaggregated serving with Dynamo](./disag-serving-with-dynamo.png)
37+
38+
One of the main challenges in disaggregated serving is **rate matching**:
39+
determining the right GPU allocation between prefill and decode stages to
40+
meet a specific Service Level Objective (SLO). If you miscalculate the GPU
41+
ratio between these stages, you face two "silent killers" of performance:
42+
43+
* **Over-provisioned Prefill**: Your prompt processing is fast, but
44+
requests bottleneck at the generation stage. This spikes *Inter-Token
45+
Latency (ITL)* and leaves expensive compute nodes idle.
46+
* **Under-provisioned Prefill**: Your decode GPUs sit starved for data.
47+
This drives up *Time-To-First-Token (TTFT)* and inflates your
48+
*Total Cost of Ownership (TCO)*.
49+
50+
Beyond rate matching, developers must also optimize model parallelism
51+
parameters (data, tensor, and expert parallelism) to maintain high
52+
["Goodput"](https://arxiv.org/abs/2401.09670) (the fraction of time
53+
and resources where the model is learning or producing correct results,
54+
instead of waiting or doing extra work).
55+
56+
Exploring these configurations manually is technically challenging,
57+
time-consuming and often results in suboptimal resource utilization.
58+
59+
## Dynamic Traffic: The Move to SLO-Driven Scaling
60+
61+
Static configurations are brittle. In production, traffic is rarely uniform:
62+
63+
* **Volatile Request Volume**: Traditional Horizontal Pod Autoscalers (HPA)
64+
are too slow for LLM jitters.
65+
* **Shifting Sequence Patterns**: If your workload shifts from short chat
66+
queries (low input sequence length (ISL)) to long-context document analysis (high ISL), a static
67+
disaggregated split becomes suboptimal instantly (resulting in overworked
68+
prefill GPUs and idle decode GPUs).
69+
70+
NVIDIA Dynamo addresses these gaps through two integrated components:
71+
the **Planner Profiler** and the **SLO-based Planner**.
72+
73+
---
74+
75+
### Let’s see it through an example application scenario
76+
77+
Consider a major airline’s mobile app that uses AI to offer personalized
78+
rerouting during flight delays. This use case is a 'stress test' for
79+
inference: it is subject to massive, sudden bursts in traffic and highly
80+
variable request patterns, such as a mix of short status queries and
81+
long-context itinerary processing. To prevent latency spikes during these
82+
peaks, the underlying system requires the precise orchestration offered
83+
by a disaggregated architecture.
84+
85+
Using the
86+
[Qwen3-32B-FP8](https://huggingface.co/Qwen/Qwen3-32B-FP8)
87+
model, we can deploy an Airline Assistant with
88+
strict SLA targets: TTFT ≤ 500ms and ITL (Inter-Token Latency) ≤ 30ms.
89+
90+
During normal operations, passengers ask short queries like
91+
"What's my flight status?" But when a major weather system causes
92+
flight cancellations, passengers flood the app with complex rerouting
93+
requests—long-context queries (~4,000 tokens) requiring detailed itinerary
94+
responses (~500 tokens). This sudden surge of 200 concurrent users is
95+
exactly the kind of real-world spike that breaks static configurations.
96+
97+
To build a truly efficient disaggregated AI inference system, you
98+
need to transition from manual "guess-and-check" configurations
99+
to an automated, SLO-driven approach. The core of this automation
100+
lies in two distinct but deeply integrated components: the Dynamo
101+
Planner profiler and the Dynamo Planner.
102+
103+
The first step in building your system is determining the "Golden Ratio"
104+
of GPUs: how many should handle prefill versus decode, and what tensor
105+
parallelism (TP) levels each should use.
106+
107+
### The Architect: Dynamo Profiler
108+
109+
The Dynamo Planner profiler is your pre-deployment simulation engine.
110+
Instead of burning GPU hours testing every possible configuration, you
111+
define your requirements in a **DynamoGraphDeploymentRequest (DGDR)**
112+
manifest. The profiler then executes an automated
113+
["sweep"](https://github.com/ai-dynamo/dynamo/blob/main/docs/benchmarks/sla_driven_profiling.md#profiling-method)
114+
of the search space:
115+
116+
* **Parallelization Mapping**: It tests different TP sizes for both prefill
117+
and decode stages to find the lowest TTFT and ITL.
118+
* **Hardware Simulation**: Using the **AI Configurator (AIC)** mode, the
119+
profiler can simulate performance in just 20–30 seconds
120+
based on pre-measured performance data, allowing for rapid
121+
iteration before you ever touch a physical GPU.
122+
* **Resulting Recommendation**: The output is a highly tuned
123+
configuration that maximizes ["Goodput"](https://arxiv.org/abs/2401.09670),
124+
the maximum throughput
125+
achievable while staying strictly within your latency bounds.
126+
127+
Ultimately, the app developers and AI engineers reduce their time
128+
spent on testing different system setups, and can focus on their airline
129+
passengers’ needs.
130+
131+
### The Pilot: Dynamo Planner
132+
133+
Once your system is deployed, static configurations can't handle the
134+
"jitter" of real-world traffic. This is where the Dynamo Planner takes
135+
over as a runtime orchestration engine.
136+
137+
Unlike a traditional load balancer, the Dynamo Planner is **LLM-aware**.
138+
It continuously monitors the live state of your cluster, specifically
139+
looking at:
140+
141+
* **KV Cache Load**: It monitors memory utilization in the decode pool.
142+
* **Prefill Queue Depth**: It tracks how many prompts are waiting to be
143+
processed.
144+
145+
Using the performance bounds identified earlier by the profiler
146+
(i.e. TTFT ≤ 500ms and ITL ≤ 30ms) the Planner
147+
proactively scales the number of prefill and decode workers up or down. For
148+
example, if a *sudden burst of long-context itinerary queries* floods the
149+
system, the Planner detects the spike in the prefill queue and shifts available
150+
GPU resources to the prefill pool *before* your TTFT violates its SLO.
151+
152+
## Seeing it in Action
153+
154+
In our airline scenario, the system starts with 1 prefill worker and
155+
1 decode worker. When the passenger surge hits, the Planner's 60-second
156+
adjustment interval detects the SLA violations:
157+
158+
```bash
159+
Prefill calculation: 138.55 (p_thpt) / 4838.61 (p_engine_cap) = 1 (num_p)
160+
Decode calculation: 27.27 (d_thpt) / 19381.08 (d_engine_cap) = 1 (num_d)
161+
```
162+
163+
As traffic spikes to 200 concurrent passengers, the Planner recalculates:
164+
165+
```bash
166+
Prefill calculation: 16177.75 (p_thpt) / 8578.39 (p_engine_cap) = 2 (num_p)
167+
Decode calculation: 400.00 (d_thpt) / 3354.30 (d_engine_cap) = 1 (num_d)
168+
Predicted number of engine replicas: prefill=2, decode=1
169+
Updating prefill component VllmPrefillWorker to desired replica count 2
170+
```
171+
172+
[See the Dynamo SLA Planner](https://asciinema.org/a/67XW4yXJIBmIe7bv)
173+
in action as it automatically scales the Airline Assistant during a
174+
traffic surge. The Planner automatically scales to 2 prefill workers
175+
while keeping 1 decode worker (the optimal configuration to handle the
176+
surge while maintaining SLA targets). Within minutes, the new worker is
177+
online and passengers are getting their rerouting options without
178+
frustrating delays.
179+
180+
Now, you can try this yourself by running the NVIDIA Dynamo Planner Profiler
181+
to capture burst and request behavior, then using the SLO-based Planner to
182+
translate latency targets into placement and scaling decisions on your AKS
183+
cluster. Setting it up in this order - profile under stress, define SLOs,
184+
and let the planner orchestrate your disaggregated inference system to
185+
handle sudden traffic spikes without latency spikes.
186+
187+
After deploying Dynamo by following [these instructions](https://aka.ms/aks-dynamo),
188+
get hands on with the
189+
[Qwen3-32B-FP8](https://huggingface.co/Qwen/Qwen3-32B-FP8)
190+
model using the example in [AKS Dynamo Part 2 sample](https://aka.ms/aks-dynamo-part-2).
191+
192+
## Conclusion: Inference Without the Infrastructure Burden
193+
194+
The shift toward disaggregated serving is a necessity for the next
195+
generation of reasoning-heavy and long-context LLMs. However, as we
196+
have seen, the complexity of manually tuning these distributed systems
197+
on Kubernetes can quickly become a bottleneck for even the most
198+
experienced AI teams.
199+
200+
By utilizing the NVIDIA Dynamo Planner Profiler, developers can move
201+
from educated guessing to data-driven certainty, modeling performance
202+
in seconds rather than days. When paired with the Dynamo Planner, this
203+
static optimization becomes a dynamic, SLO-aware reality on AKS, capable of
204+
weathering the unpredictable traffic spikes of production environments.
205+
206+
Ultimately, this suite transforms your inference stack from a series of
207+
fragile configurations into a resilient, self-optimizing engine. For the AI
208+
engineer, this means less time spent managing hardware limits and configuring
209+
system scalability and more time spent delivering the high-quality,
210+
interactive experiences that your users (and your passengers) expect.
77.1 KB
Loading
70.1 KB
Loading
62.7 KB
Loading
985 KB
Loading

0 commit comments

Comments
 (0)