Skip to content

Commit c141f03

Browse files
committed
Add LLMs on Kubernetes blog post
1 parent daba2c0 commit c141f03

File tree

4 files changed

+322
-1
lines changed

4 files changed

+322
-1
lines changed

src/assets/blog/chip.jpg

406 KB
Loading
Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
---
2+
title: Running LLMs on Kubernetes
3+
author: Lena Fuhrimann
4+
date: 2025-11-26
5+
tags: ["cloud", "infrastructure", "storage", "scaling", "serverless"]
6+
excerpt:
7+
"Follow Kurt's journey as he learns about essential cloud services including
8+
object storage, managed databases, serverless runtimes, and message queues
9+
while running a theater ticket shop."
10+
image: ../../assets/blog/chip.jpg
11+
---
12+
13+
Large language models (LLMs) power many modern apps. Chatbots, coding helpers,
14+
and document tools all use them. The question isn't whether you need LLMs, but
15+
how to run them well. Kubernetes helps you deploy and manage these heavy
16+
workloads next to your other services.
17+
18+
Running LLMs on Kubernetes gives you a few benefits. You get a standard way to
19+
deploy them. You can easily manage GPU resources. Furthermore, you can scale up
20+
when demand grows. Most importantly, though, you can keep your data private by
21+
hosting models yourself instead of calling external APIs.
22+
23+
Here, we'll look at two ways to deploy LLMs on Kubernetes. First, we'll cover
24+
**Ollama** for simple setups. Then, we'll explore **vLLM** Production Stack for
25+
high-traffic scenarios.
26+
27+
## Ollama
28+
29+
[Ollama](https://ollama.ai/) is popular because it's easy to use. You download a
30+
model, and it just works. The
31+
[Ollama Helm Chart](https://github.com/otwld/ollama-helm) brings this same ease
32+
to Kubernetes.
33+
34+
### What You Need
35+
36+
For CPU-only use, you need Kubernetes 1.16.0 or newer. For GPU support with
37+
NVIDIA or AMD cards, you need Kubernetes 1.26.0 or newer.
38+
39+
### How to Install
40+
41+
Add the Helm repository and install:
42+
43+
```bash
44+
helm repo add otwld https://helm.otwld.com/
45+
helm repo update
46+
helm install ollama otwld/ollama --namespace ollama --create-namespace
47+
```
48+
49+
This sets up Ollama with good defaults. The service runs on port `11434`. You
50+
can use the normal Ollama tools to talk to it.
51+
52+
To test your deployment, forward the port to your local machine and run a model:
53+
54+
```bash
55+
kubectl port-forward -n ollama svc/ollama 11434:11434
56+
```
57+
58+
Then, in another terminal, you can interact with Ollama:
59+
60+
```bash
61+
curl http://localhost:11434/api/generate -d '{
62+
"model": "llama3.2:1b",
63+
"prompt": "Why is the sky blue?",
64+
"stream": false
65+
}'
66+
```
67+
68+
### Adding GPU Support
69+
70+
To use a GPU, create a file called `values.yaml`:
71+
72+
```yaml
73+
ollama:
74+
gpu:
75+
enabled: true
76+
type: "nvidia"
77+
number: 1
78+
```
79+
80+
Then update the install:
81+
82+
```bash
83+
helm upgrade ollama otwld/ollama --namespace ollama --values values.yaml
84+
```
85+
86+
### Downloading Models Early
87+
88+
Ollama downloads models when you first ask for them. This can be slow. You can
89+
tell it to download models when the pod starts:
90+
91+
```yaml
92+
ollama:
93+
gpu:
94+
enabled: true
95+
type: "nvidia"
96+
number: 1
97+
models:
98+
pull:
99+
- llama3.2:1b
100+
```
101+
102+
### Making Custom Models
103+
104+
You can also make custom versions of models with different settings:
105+
106+
```yaml
107+
ollama:
108+
models:
109+
create:
110+
- name: llama3.2-1b-large-context
111+
template: |
112+
FROM llama3.2:1b
113+
PARAMETER num_ctx 32768
114+
run:
115+
- llama3.2-1b-large-context
116+
```
117+
118+
This creates a version of Llama 3.2 that can handle longer text.
119+
120+
### Opening Access from Outside
121+
122+
To let people reach the API from outside the cluster, add an Ingress:
123+
124+
```yaml
125+
ollama:
126+
models:
127+
pull:
128+
- llama3.2:1b
129+
ingress:
130+
enabled: true
131+
hosts:
132+
- host: ollama.example.com
133+
paths:
134+
- path: /
135+
pathType: Prefix
136+
```
137+
138+
Now you can reach the API at `ollama.example.com`.
139+
140+
### When to Use Ollama
141+
142+
Ollama is great when you want things simple. It works well for getting started
143+
fast, running different models without much setup, or when you don't need to
144+
handle lots of traffic. If you've used Ollama on your laptop, using it on
145+
Kubernetes will feel familiar.
146+
147+
## vLLM
148+
149+
[vLLM](https://github.com/vllm-project/vllm) is built for speed. It uses
150+
performance optimizations like
151+
[Paged Attention](https://huggingface.co/docs/text-generation-inference/en/conceptual/paged_attention),
152+
[Continuous Batching](https://huggingface.co/docs/transformers/main/en/continuous_batching),
153+
and
154+
[Prefix Caching](https://bentoml.com/llm/inference-optimization/prefix-caching)
155+
to handle many requests at once. The
156+
[vLLM Production Stack](https://github.com/vllm-project/production-stack) wraps
157+
vLLM in a Kubernetes-friendly package with routing, monitoring, and caching.
158+
159+
### How It Works
160+
161+
The stack has three main parts. First, serving engines run the LLMs. Second, a
162+
router sends requests to the right place. Third, monitoring tools (Prometheus
163+
and Grafana) show you what's happening.
164+
165+
This setup lets you grow from one instance to many without changing your app
166+
code. The router uses an API that works like OpenAI's, so you can swap it in
167+
easily.
168+
169+
![Production Stack Architecture](https://private-user-images.githubusercontent.com/25103655/406084851-8f05e7b9-0513-40a9-9ba9-2d3acca77c0c.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NjQxNTQ1NTgsIm5iZiI6MTc2NDE1NDI1OCwicGF0aCI6Ii8yNTEwMzY1NS80MDYwODQ4NTEtOGYwNWU3YjktMDUxMy00MGE5LTliYTktMmQzYWNjYTc3YzBjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTExMjYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUxMTI2VDEwNTA1OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg4M2E2YWY4YjI1MDQzM2JlYTgxYzE1N2IwMjgwMGM0YjRlYmE2MzdmZjRhYzAyYzk5OWUzNjdhMjYxOTRlZWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.AzVWvUrxG7wEn4QXjo2TkENHQFqgSqvCfHtH1P65eiY)
170+
171+
### How to Install
172+
173+
Add the Helm repository and install with a config file:
174+
175+
```bash
176+
helm repo add vllm https://vllm-project.github.io/production-stack
177+
helm install vllm vllm/vllm-stack -f values.yaml
178+
```
179+
180+
A simple `values.yaml` looks like this:
181+
182+
```yaml
183+
servingEngineSpec:
184+
runtimeClassName: ""
185+
modelSpec:
186+
- name: "llama3"
187+
repository: "vllm/vllm-openai"
188+
tag: "latest"
189+
modelURL: "meta-llama/Llama-3.2-3B-Instruct"
190+
replicaCount: 1
191+
requestCPU: 6
192+
requestMemory: "16Gi"
193+
requestGPU: 1
194+
```
195+
196+
After it's ready, you'll see two pods. One is the router. One runs the model:
197+
198+
```
199+
NAME READY STATUS AGE
200+
vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 2m
201+
vllm-llama3-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 2m
202+
```
203+
204+
### Using the API
205+
206+
Forward the router to your machine:
207+
208+
```bash
209+
kubectl port-forward svc/vllm-router-service 30080:80
210+
```
211+
212+
Check which models are available:
213+
214+
```bash
215+
curl http://localhost:30080/v1/models
216+
```
217+
218+
Send a chat message:
219+
220+
```bash
221+
curl -X POST http://localhost:30080/v1/chat/completions \
222+
-H "Content-Type: application/json" \
223+
-d '{
224+
"model": "meta-llama/Llama-3.2-3B-Instruct",
225+
"messages": [{"role": "user", "content": "Why is the sky blue?"}]
226+
}'
227+
```
228+
229+
### Running More Than One Model
230+
231+
You can run different models at the same time. The router sends each request to
232+
the right one:
233+
234+
```yaml
235+
servingEngineSpec:
236+
modelSpec:
237+
- name: "llama3"
238+
repository: "vllm/vllm-openai"
239+
tag: "latest"
240+
modelURL: "meta-llama/Llama-3.2-3B-Instruct"
241+
replicaCount: 1
242+
requestCPU: 6
243+
requestMemory: "16Gi"
244+
requestGPU: 1
245+
- name: "mistral"
246+
repository: "vllm/vllm-openai"
247+
tag: "latest"
248+
modelURL: "mistralai/Mistral-7B-Instruct-v0.3"
249+
replicaCount: 1
250+
requestCPU: 6
251+
requestMemory: "24Gi"
252+
requestGPU: 1
253+
```
254+
255+
### Logging In to Hugging Face
256+
257+
Some models need a Hugging Face account. You can add your token like this:
258+
259+
```yaml
260+
servingEngineSpec:
261+
modelSpec:
262+
- name: "llama3"
263+
repository: "vllm/vllm-openai"
264+
tag: "latest"
265+
modelURL: "meta-llama/Llama-3.2-3B-Instruct"
266+
replicaCount: 1
267+
requestCPU: 6
268+
requestMemory: "16Gi"
269+
requestGPU: 1
270+
env:
271+
- name: HF_TOKEN
272+
value: "your-huggingface-token"
273+
```
274+
275+
For real deployments, store the token in a Kubernetes Secret instead.
276+
277+
### Watching How It Runs
278+
279+
The stack comes with a Grafana dashboard. It shows you how many instances are
280+
healthy, how fast requests finish, how long users wait for the first response,
281+
how many requests are running or waiting, and how much GPU memory the cache
282+
uses. This helps you spot problems and plan for growth.
283+
284+
### When to Use vLLM
285+
286+
Use vLLM and its production stack when you need to handle lots of requests fast.
287+
Its router is smart about reusing cached work, which saves time and money. The
288+
OpenAI-style API makes it easy to plug into existing apps. The monitoring tools
289+
help you run it well in production.
290+
291+
## Wrapping Up
292+
293+
Ollama and vLLM serve different needs.
294+
295+
Ollama with its Helm chart gets you running fast with little setup. It's good
296+
for development, lighter workloads, and teams that want things simple.
297+
298+
vLLM Production Stack gives you the tools for heavy traffic. The router,
299+
multi-model support, and monitoring make it fit for production where speed and
300+
uptime matter.
301+
302+
Both use standard Kubernetes and Helm, so they'll feel familiar if you know
303+
containers. Pick based on how much traffic you expect and how much complexity
304+
you're willing to manage.
305+
306+
> “Ollama runs one, vLLM a batch — take your time, pick a match”
307+
>
308+
> Lena F.

src/layouts/ContentDetailPage.astro

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,7 +211,10 @@ const {
211211

212212
.content-body :global(blockquote) {
213213
border-left: 4px solid var(--color-primary);
214-
padding-left: 1.5rem;
214+
background: var(--color-blockquote-bg);
215+
padding: 1.5rem;
216+
padding-bottom: 0.5rem;
217+
font-size: 1.25rem;
215218
margin: 1.5rem 0;
216219
color: var(--color-gray);
217220
font-style: italic;
@@ -250,6 +253,11 @@ const {
250253
background: #f9f9f9;
251254
}
252255

256+
:global(.content-body img) {
257+
max-width: 100%;
258+
margin: 2rem 0;
259+
}
260+
253261
@media (max-width: 48rem) {
254262
.content-detail {
255263
padding: var(--spacing-md) 0;

src/pages/[lang]/customers/[...slug].astro

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,11 @@ const canonicalUrl = new URL(Astro.url.pathname, Astro.site).toString();
179179
color: var(--color-gray);
180180
}
181181

182+
:global(.content-body img) {
183+
max-width: 100%;
184+
margin: 2rem 0;
185+
}
186+
182187
@media (max-width: 48rem) {
183188
.results-highlight {
184189
padding: 1.5rem;

0 commit comments

Comments
 (0)