In‑Browser ONNX Model Demo (Qwen / Llama + more)

This project demonstrates running ONNX language models directly in the browser using WebGPU and a bundled Transformers IIFE. It's a lightweight, local demo that downloads models at runtime and runs inference inside a Web Worker.

The demo uses the ONNX Runtime Web backend via the Transformers.js library.

ONNX is the Open Neural Network Exchange format that enables interoperability between AI frameworks (PyTorch, TensorFlow, Caffe2) and across platforms. This demo uses the ONNX Runtime Web backend via the Transformers.js library.

What's in this repo

A small frontend (index.html, styles.css, app.js) that talks to a worker for inference.
A blob-based inlined worker (created from app.js) and a standalone worker (public/worker.js) that both use the bundled Transformers IIFE (public/transformers_lib.js).
A centralized MODEL_REGISTRY at public/models.js that contains model ids, friendly names, default dtypes, and whether the model exposes internal "thoughts".

What changed recently

Model registry centralized: public/models.js is the source-of-truth for available models and metadata (friendly, dtype, thinking). The UI populates the model dropdown from this registry.
Workers receive the registry (the blob worker gets it via postMessage and the standalone worker can be configured to importScripts('public/models.js') or receive it the same way). Workers prefer the registry's dtype when loading a model.
Special/control tokens (ASCII <|...|> and fullwidth variants like <｜...｜>) and explicit end-of-sentence tokens such as <｜end▁of▁sentence｜> are now logged to the console but stripped from UI output. This avoids showing control tokens in the chat while keeping them available for debugging via console logs and token_debug messages.

Quick usage

Open the project root and serve or open index.html in a modern Chromium-based browser with WebGPU support. For a quick local server you can run:

# from the project root
python3 -m http.server 8000
# then open http://localhost:8000 in your browser

The UI performs a WebGPU check. If WebGPU is available the model loader will proceed.
Select a model from the Model: dropdown (populated from public/models.js). The UI shows loading progress and a friendly model name.
When the model is ready you can send messages in the chat input. Responses stream incrementally; <think>...</think> segments (if produced) appear in the Thought Panel.

Developer notes

Centralized model registry: edit public/models.js to add/remove models. Each entry should look like:

	'owner/model-name-ONNX': { friendly: 'Friendly Name', dtype: 'q4f16'|'q4'|'fp32', thinking: false }

Worker behavior:
- The blob worker (created from app.js) receives the registry via worker.postMessage({ type: 'model_registry', data: MODEL_REGISTRY }) on startup.
- The standalone worker (public/worker.js) currently accepts the model_registry message as well. Optionally you can have the standalone worker call importScripts('public/models.js') to read the registry directly instead of receiving it by postMessage.
- When the worker loads a model it prefers the registry-defined dtype for that model; fallbacks exist for models not present in the registry.
Token handling:
- The workers detect both ASCII special tokens (<|...|>) and fullwidth variants (<｜...｜>), as well as an explicit fullwidth end-of-sentence token pattern like <｜end▁of▁sentence｜>. These are logged to the console (and emitted as token_debug messages) but removed from the UI output so users don't see control tokens.

Testing tips

Open DevTools → Console to inspect worker logs. Look for:
- registry_received (worker acknowledged the registry)
- Model ready or Model load failed: ... messages
- Special token (...) or End-of-turn token (...) logs when token-debugging
If a model fails to load due to memory or WebGPU errors, the worker falls back to a safer configuration where possible (e.g., wasm/fp32) and reports status messages to the UI.

Contributing

To add a model, update public/models.js and include a dtype suitable for the model (for quantized models use q4/q4f16, for small FP models use fp32).
If you prefer the standalone worker to read the registry directly, replace the model_registry message handler in public/worker.js with a call to importScripts('public/models.js') and remove the postMessage from app.js that sends the registry.

License / Disclaimer

This is an experimental demo. Models referenced in the registry are loaded at runtime from Hugging Face (or other remotes) and may have their own licenses and terms. Use with models you have permission to load.

Enjoy running ONNX models in the browser!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
public		public
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
app.js		app.js
build_lib.js		build_lib.js
index.html		index.html
styles.css		styles.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

In‑Browser ONNX Model Demo (Qwen / Llama + more)

About

Uh oh!

Releases

Packages

Languages

Fernando-Fernandez/ONNXTransformersJS

Folders and files

Latest commit

History

Repository files navigation

In‑Browser ONNX Model Demo (Qwen / Llama + more)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages