Yuniko.Software.Qwen3Tokenizer

Native .NET tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding with HuggingFace integration.

Features

Native .NET Implementation: Qwen3 tokenizer built specifically for .NET applications
Identical to HuggingFace: Produces the same token IDs and outputs as the official HuggingFace tokenizer
Qwen Model Family Support: Compatible with all Qwen3 variants (LLM, Embedding, Reranker, and Vision-Language models)
HuggingFace Integration: Load tokenizer files directly from HuggingFace model repositories
Configurable: Customize tokenization behavior through options and custom file providers
No External Dependencies: Does not require Python or other runtime dependencies

Installation

dotnet add package Yuniko.Software.Qwen3Tokenizer

Or via Package Manager:

Install-Package Yuniko.Software.Qwen3Tokenizer

Quick Start

using Yuniko.Software.Qwen3Tokenizer;

// Load from HuggingFace model (specify if it's for an embedding model)
var tokenizer = await Qwen3Tokenizer.FromHuggingFaceAsync(
    "Qwen/Qwen3-0.6B", 
    isForEmbeddingModel: false
);

// Encode text
int[] tokenIds = tokenizer.Encode("Hello, world!");
Console.WriteLine($"Token IDs: [{string.Join(", ", tokenIds)}]");
Console.WriteLine($"Token count: {tokenIds.Length}");

// Decode tokens
string decodedText = tokenizer.Decode(tokenIds);
Console.WriteLine($"Decoded: {decodedText}");

Usage Examples

Basic Tokenization

// For regular LLM models
var tokenizer = Qwen3Tokenizer.FromHuggingFace("Qwen/Qwen3-0.6B", isForEmbeddingModel: false);

// Encode text into token IDs
int[] ids = tokenizer.Encode("The quick brown fox jumps over the lazy dog");

// Decode token IDs back to text
string text = tokenizer.Decode(ids);

// Count tokens without full encoding
int tokenCount = tokenizer.CountTokens("Some text to count");

Working with Embedding Models

// For embedding models - adds pad token at the end when addSpecialTokens=true
var embeddingTokenizer = Qwen3Tokenizer.FromHuggingFace(
    "Qwen/Qwen3-Embedding-0.6B", 
    isForEmbeddingModel: true
);

// With special tokens (includes pad token at the end)
int[] withSpecial = embeddingTokenizer.Encode("Your text here", addSpecialTokens: true);

// Without special tokens
int[] withoutSpecial = embeddingTokenizer.Encode("Your text here", addSpecialTokens: false);

Console.WriteLine($"With special tokens: {withSpecial.Length} tokens");
Console.WriteLine($"Without special tokens: {withoutSpecial.Length} tokens");

Decoding with Special Tokens

// Encode text with special tokens
string chatMessage = "<|im_start|>user\nHello!<|im_end|>";
int[] tokens = tokenizer.Encode(chatMessage);

// Decode with special tokens preserved
string withSpecial = tokenizer.Decode(tokens, skipSpecialTokens: false);

// Decode with special tokens removed (default behavior)
string withoutSpecial = tokenizer.Decode(tokens, skipSpecialTokens: true);

Console.WriteLine($"With special tokens: {withSpecial}");
Console.WriteLine($"Without special tokens: {withoutSpecial}");

Detailed Encoding Information

// Get detailed information about tokens
var result = tokenizer.EncodeDetailed("Hello, world!");

for (int i = 0; i < result.Ids.Length; i++)
{
    Console.WriteLine($"Token: '{result.Tokens[i]}' | ID: {result.Ids[i]} | " +
                     $"Offset: {result.Offsets[i].Index}, Length: {result.Offsets[i].Length}");
}

ONNX Runtime Integration

// Prepare inputs for ONNX Runtime inference (dynamic length, no padding)
var inputs = tokenizer.PrepareForOnnx("Your input text here");

// Use with ONNX Runtime
// Note: Some models (e.g., embedding models) may not require position_ids
long[] inputIds = inputs.InputIds;
long[] attentionMask = inputs.AttentionMask;
long[] positionIds = inputs.PositionIds;

Loading from Local Files

// Load tokenizer from local vocabulary and merges files
var tokenizer = Qwen3Tokenizer.FromFiles(
    vocabPath: "/path/to/vocab.json",
    mergesPath: "/path/to/merges.txt",
    isForEmbeddingModel: false
);

Custom File Provider

// Implement custom file provider for advanced scenarios
public class CustomFileProvider : ITokenizerFileProvider
{
    public (string VocabPath, string MergesPath) GetFiles()
    {
        // Custom logic to provide tokenizer files
        return ("/path/to/vocab.json", "/path/to/merges.txt");
    }

    public Task<(string VocabPath, string MergesPath)> GetFilesAsync(
        CancellationToken cancellationToken = default)
    {
        return Task.FromResult(GetFiles());
    }
}

// Use custom provider
var tokenizer = Qwen3Tokenizer.FromProvider(new CustomFileProvider(), isForEmbeddingModel: false);

Accessing Vocabulary and Token Information

// Get vocabulary size
int vocabSize = tokenizer.VocabularySize;

// Access full vocabulary
IReadOnlyDictionary<string, int> vocab = tokenizer.Vocabulary;

// Access added tokens (special tokens and others)
IReadOnlyDictionary<string, int> addedTokens = tokenizer.AddedTokens;

// Access special token IDs
IReadOnlySet<int> specialTokenIds = tokenizer.SpecialTokenIds;

// Use predefined token constants
Console.WriteLine($"IM_END token: {Qwen3Tokens.ImEnd} (ID: {Qwen3Tokens.ImEndTokenId})");
Console.WriteLine($"PAD token: {Qwen3Tokens.EndOfText} (ID: {Qwen3Tokens.EndOfTextTokenId})");
Console.WriteLine($"IM_START token: {Qwen3Tokens.ImStart} (ID: {Qwen3Tokens.ImStartTokenId})");

For more examples, see the sample project.

Supported Models

Works with all Qwen3 model variants:

Qwen3 LLM models (text generation)
Qwen3-Embedding models (text embeddings)
Qwen3-Reranker models (document reranking)
Qwen3-VL models (vision-language)

Requirements

.NET 10.0 or later

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please visit the GitHub repository for more information.

Support

For issues, questions, or suggestions, please open an issue on GitHub.

⭐ If you find this project useful, please consider giving it a star on GitHub! ⭐

Your support helps make this project more visible to other developers who might benefit from a native .NET implementation of the Qwen3 tokenizer.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
python		python
samples/Yuniko.Software.Qwen3Tokenizer.Sample		samples/Yuniko.Software.Qwen3Tokenizer.Sample
src/Yuniko.Software.Qwen3Tokenizer		src/Yuniko.Software.Qwen3Tokenizer
tests/Yuniko.Software.Qwen3Tokenizer.Tests		tests/Yuniko.Software.Qwen3Tokenizer.Tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
Directory.Packages.props		Directory.Packages.props
LICENSE		LICENSE
README.md		README.md
Yuniko.Software.Qwen3Tokenizer.slnx		Yuniko.Software.Qwen3Tokenizer.slnx
run_tests.ps1		run_tests.ps1
run_tests.sh		run_tests.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yuniko.Software.Qwen3Tokenizer

Features

Installation

Quick Start

Usage Examples

Basic Tokenization

Working with Embedding Models

Decoding with Special Tokens

Detailed Encoding Information

ONNX Runtime Integration

Loading from Local Files

Custom File Provider

Accessing Vocabulary and Token Information

Supported Models

Requirements

License

Contributing

Support

About

Uh oh!

Releases 7

Packages

Contributors 3

Uh oh!

Languages

License

yuniko-software/qwen3-tokenizer-dotnet

Folders and files

Latest commit

History

Repository files navigation

Yuniko.Software.Qwen3Tokenizer

Features

Installation

Quick Start

Usage Examples

Basic Tokenization

Working with Embedding Models

Decoding with Special Tokens

Detailed Encoding Information

ONNX Runtime Integration

Loading from Local Files

Custom File Provider

Accessing Vocabulary and Token Information

Supported Models

Requirements

License

Contributing

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 3

Uh oh!

Languages

Packages