Skip to content

[Model] Qwen3TTS #1090

@ashok-arora

Description

@ashok-arora

Detailed description of the requested feature

Support for quantization and deployment of Qwen3-TTS-style models within the NVIDIA optimization stack, ideally including compatibility with TensorRT-LLM or a clearly defined alternative pipeline.

Specifically, the request is for:

Ability to quantize non-Transformer / non-text-generation models (e.g., TTS pipelines) using a unified workflow similar to LLMs
Support for multi-component models, including:
text encoder (Transformer-based)
acoustic model (autoregressive / diffusion / codec-based)
vocoder (CNN-based waveform generator)

End-to-end export pipeline:

PyTorch → Quantization → ONNX → TensorRT engine(s)
Guidance or tooling for:
handling models not implemented in Hugging Face Transformers
exporting models with custom forward passes or generation loops
Optional: partial support for prefill/decode-style optimization where applicable (e.g., transformer submodules)

This would enable efficient deployment of modern TTS systems on NVIDIA GPUs with reduced latency and memory usage.

Describe alternatives you've considered

  1. torch AO library

Target hardware/use case

  1. NVIDIA GPUs (eg. A5000, etc.)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions