This project is a Topo template and follows the Topo Template Format Specification.
Complete LLM chat application optimized for Arm CPU inference.
Features: SVE, NEON
This project demonstrates running large language models on CPU using llama.cpp compiled with Arm baseline optimizations and accelerated using NEON SIMD and SVE (when supported and enabled).
The stack includes:
- llama.cpp server with Arm NEON optimizations (SVE optional)
- Quantized Qwen3.5-0.8B model bundled in the image
- Simple web-based chat interface
- No GPU required - pure CPU inference
- Arm Hardware: An Arm system (physical or virtual). Note that SVE support in llama.cpp requires an Armv8.2-A (or newer) CPU with the SVE extension.
- Docker: For container orchestration with Topo
- LLM Model: A GGUF format model (e.g., Llama 3.1, Mistral, etc.)
Note:
HF_MODELmust point to a Hugging Face repo that contains at least one supported.gguffile. If the repo contains multiple.gguffiles andHF_MODEL_FILEis unset, the build auto-selects a CPU-friendly quantization (preferring Q4_K_M). Sharded GGUFs and multimodal projector files (mmproj) are rejected with a clear error because this template only supports single-file text model GGUFs today. Not all model repos include GGUF quantizations — look for repos with-GGUFin the name. The selected model is baked into the image at/models/model.gguf.
| Parameter | Description | Default |
|---|---|---|
HF_MODEL |
Hugging Face model repo ID containing .gguf files |
bartowski/Qwen_Qwen3.5-0.8B-GGUF |
HF_MODEL_FILE |
Optional explicit GGUF filename | "" |
ENABLE_SVE |
Enable SVE optimizations | OFF |
The easiest way to deploy is using topo. Download and install topo from here
topo clone git@github.com:Arm-Examples/topo-v9-cpu-chat.gitcd topo-v9-cpu-chat
topo deploy --target <ip-address-of-target>Use a different model:
topo deploy --target <ip-address-of-target> \
--arg HF_MODEL=unsloth/SmolLM2-135M-Instruct-GGUFForce an exact GGUF file:
topo deploy --target <ip-address-of-target> \
--arg HF_MODEL=bartowski/Qwen_Qwen3.5-0.8B-GGUF \
--arg HF_MODEL_FILE=Qwen_Qwen3.5-0.8B-Q5_K_M.ggufOpen your browser to URL:3000 to start chatting!