This project demonstrates running ONNX language models directly in the browser using WebGPU and a bundled Transformers IIFE. It's a lightweight, local demo that downloads models at runtime and runs inference inside a Web Worker.
The demo uses the ONNX Runtime Web backend via the Transformers.js library.
ONNX is the Open Neural Network Exchange format that enables interoperability between AI frameworks (PyTorch, TensorFlow, Caffe2) and across platforms. This demo uses the ONNX Runtime Web backend via the Transformers.js library.
What's in this repo
- A small frontend (
index.html,styles.css,app.js) that talks to a worker for inference. - A blob-based inlined worker (created from
app.js) and a standalone worker (public/worker.js) that both use the bundled Transformers IIFE (public/transformers_lib.js). - A centralized
MODEL_REGISTRYatpublic/models.jsthat contains model ids, friendly names, default dtypes, and whether the model exposes internal "thoughts".
What changed recently
- Model registry centralized:
public/models.jsis the source-of-truth for available models and metadata (friendly,dtype,thinking). The UI populates the model dropdown from this registry. - Workers receive the registry (the blob worker gets it via
postMessageand the standalone worker can be configured toimportScripts('public/models.js')or receive it the same way). Workers prefer the registry's dtype when loading a model. - Special/control tokens (ASCII
<|...|>and fullwidth variants like<|...|>) and explicit end-of-sentence tokens such as<|end▁of▁sentence|>are now logged to the console but stripped from UI output. This avoids showing control tokens in the chat while keeping them available for debugging via console logs andtoken_debugmessages.
Quick usage
- Open the project root and serve or open
index.htmlin a modern Chromium-based browser with WebGPU support. For a quick local server you can run:
# from the project root
python3 -m http.server 8000
# then open http://localhost:8000 in your browser- The UI performs a WebGPU check. If WebGPU is available the model loader will proceed.
- Select a model from the
Model:dropdown (populated frompublic/models.js). The UI shows loading progress and a friendly model name. - When the model is ready you can send messages in the chat input. Responses stream incrementally;
<think>...</think>segments (if produced) appear in the Thought Panel.
Developer notes
- Centralized model registry: edit
public/models.jsto add/remove models. Each entry should look like:
'owner/model-name-ONNX': { friendly: 'Friendly Name', dtype: 'q4f16'|'q4'|'fp32', thinking: false }-
Worker behavior:
- The blob worker (created from
app.js) receives the registry viaworker.postMessage({ type: 'model_registry', data: MODEL_REGISTRY })on startup. - The standalone worker (
public/worker.js) currently accepts themodel_registrymessage as well. Optionally you can have the standalone worker callimportScripts('public/models.js')to read the registry directly instead of receiving it by postMessage. - When the worker loads a model it prefers the registry-defined
dtypefor that model; fallbacks exist for models not present in the registry.
- The blob worker (created from
-
Token handling:
- The workers detect both ASCII special tokens (
<|...|>) and fullwidth variants (<|...|>), as well as an explicit fullwidth end-of-sentence token pattern like<|end▁of▁sentence|>. These are logged to the console (and emitted astoken_debugmessages) but removed from the UI output so users don't see control tokens.
- The workers detect both ASCII special tokens (
Testing tips
-
Open DevTools → Console to inspect worker logs. Look for:
registry_received(worker acknowledged the registry)Model readyorModel load failed: ...messagesSpecial token (...)orEnd-of-turn token (...)logs when token-debugging
-
If a model fails to load due to memory or WebGPU errors, the worker falls back to a safer configuration where possible (e.g.,
wasm/fp32) and reports status messages to the UI.
Contributing
- To add a model, update
public/models.jsand include adtypesuitable for the model (for quantized models useq4/q4f16, for small FP models usefp32). - If you prefer the standalone worker to read the registry directly, replace the
model_registrymessage handler inpublic/worker.jswith a call toimportScripts('public/models.js')and remove thepostMessagefromapp.jsthat sends the registry.
License / Disclaimer
This is an experimental demo. Models referenced in the registry are loaded at runtime from Hugging Face (or other remotes) and may have their own licenses and terms. Use with models you have permission to load.
Enjoy running ONNX models in the browser!