Skip to content

Conversation

@AlexanderMakarov
Copy link

Summary

This PR introduces streaming transcription functionality to VOXD, enabling real-time incremental typing as you speak. Additionally, it includes improvements to Python version handling in installation scripts (inspired by PR #15).

🎙️ Streaming Transcription Feature

Overview

VOXD now supports streaming transcription by default, which means text appears incrementally as you speak, not after recording stops. This provides a more natural and responsive voice-typing experience.

Key Features

  • Real-time typing: Text appears word-by-word or phrase-by-phrase as it's transcribed (typically every 2 seconds or 3 words)
  • Chunk-based processing: Audio is processed in overlapping chunks (default: 3 seconds) for continuous transcription
  • Incremental updates: Text is typed incrementally during recording, making it feel like natural voice-typing
  • Seamless experience: You see your words appear in real-time, providing immediate feedback

How It Works

  1. Press hotkey to start → VOXD begins recording and transcribing
  2. As you speak → Text appears incrementally in your focused application
  3. Press hotkey again → Finalizes any remaining transcription and copies to clipboard

Implementation Details

New Components:

  • StreamingWhisperTranscriber (src/voxd/core/streaming_transcriber.py): Processes audio in chunks and emits incremental text updates
  • StreamingCoreProcessThread (src/voxd/core/streaming_core.py): Orchestrates streaming recording, transcription, and typing for GUI/tray modes

Configuration Options:
streaming_enabled: true # Enable/disable streaming mode
streaming_chunk_seconds: 3.0 # Audio chunk size in seconds
streaming_overlap_seconds: 0.5 # Overlap between chunks
streaming_emit_interval_seconds: 2.0 # Minimum time between text updates
streaming_emit_word_count: 3 # Minimum words before emitting text
streaming_typing_delay: 0.01 # Delay between typed characters
streaming_min_chars_to_type: 3 # Minimum characters before typing

Modes Supported:

  • ✅ CLI mode (voxd --rh)
  • ✅ GUI mode (voxd --gui)
  • ✅ Tray mode (voxd --tray)

Backward Compatibility:
Streaming is enabled by default but can be disabled via config to use the traditional "record-then-transcribe" behavior.

🐍 Python Version Improvements

This PR also includes improvements from PR #15 that remove hard-coded Python version checks:

  • Before: Only supported specific versions (3.9, 3.10, 3.11, 3.12, 3.13)
  • After: Uses >= 3.9 check, making it compatible with future Python versions automatically

Changes:

  • Updated packaging/voxd.wrapper to use version comparison (>= 3.9) instead of hard-coded version lists
  • Improved Python version detection logic to be more flexible and future-proof
  • Updated venv creation to use latest available Python version

Testing

Tested on:

  • ✅ CLI mode with hotkey-controlled recording
  • ✅ GUI mode with button-triggered recording
  • ✅ Tray mode with hotkey-triggered recording
  • ✅ Python version detection with various Python versions

Streaming transcription works as expected, providing real-time feedback during dictation. The Python version improvements ensure compatibility with future Python releases.

Benefits

  1. Better UX: Users see their words appear in real-time, making voice-typing feel more natural
  2. Immediate feedback: No need to wait until recording stops to see transcribed text
  3. Future-proof: Python version handling supports upcoming Python versions automatically
  4. Backward compatible: Can be disabled if users prefer the old behavior

Related

@mattsn0w
Copy link

Hello @AlexanderMakarov,
This looks like a promising change. I very much would like streaming to be included and the default behavior.

I tested your PR on my Omarchy 3.2.x thinkpad (T490) and have some feedback.
I will point out that some of the issues I encountered are likely due to my initial install of voxd was done using the release package voxd-1.7.0-1-x86_64.pkg.tar.zst for Arch linux, and I tested your patch using the setup.sh.

  1. v1.7.0 installs packaging/voxd.wrapper to /usr/bin/voxd .
  2. The systemd service unit file packaging/voxd-tray.service has ExecStart= set to the voxd.wrapper script at /usr/bin/voxd. This should be updated to point to the user relative path.

If you do an actions build of the this in your fork I can test it.

@AlexanderMakarov
Copy link
Author

Hi @mattsn0w,
Thank you for the feedback and testing it!

I've not tried to use voxd.wrapper and worked only with setup.sh. I have Linux Mint 21.3. I would try to fix mentioned issues anyway.

While in general idea of making streaming for voxd led me to necessity to speed-up whisper.cpp and now I am making migration to https://github.com/SYSTRAN/faster-whisper which promises 4x speed for same Whisper models. Streaming requires at least 2x speed of transcribing while I don't have (proper) GPU on my laptop. Faster-whisper is a different beast but it tuned for real-time transcribing, provides embedded Python API and offers word-level timestamps which are very handy. So I first would try to implement this migration in my https://github.com/AlexanderMakarov/voxd due to I don't have proper speech-to-text on my laptop yet.

@AlexanderMakarov
Copy link
Author

@mattsn0w I've implemented the fix. BTW it is not something coming with my changes but in general behavior of the repo - installation from the packet uses different paths than setup.sh.

And about my idea to switch on faster-whisper - I have found out that updating VOXD repo with it is not the best way and switched on simpler "Soupawhisper" repo (no UI, only notifications). Implemented streaming in my fork of it - https://github.com/AlexanderMakarov/soupawhisper

Note that with streaming quality of transcription drops significantly (with Whisper models).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants