A system-level speech-to-text application for Linux with OS-level integration, designed for Wayland environments.
This repository contains development notes and specifications for a voice typing application that provides OS-level text entry on Linux. The goal is to create a unified, system-wide solution that works consistently across all applications - browsers, IDEs, email clients, and more.
- Cloud-first approach using Whisper API (OpenAI)
- Optional local Whisper support (Base/Large models)
- Future support for custom fine-tuned Whisper models
- Alternative: Gemini multimodal audio for single-stage processing
- Wayland support via ydotool virtual keyboard
- Works across all applications (browsers, IDEs, terminals, etc.)
- No need for application-specific extensions or plugins
- System tray integration with persistent background operation
- F13 key toggle (or custom hotkey) for recording control
- Tap to start, tap to stop recording
- Audio feedback with distinct beep tones (start/stop)
- ~15-second lag for optimal processing chunks (not real-time)
- Custom word/phrase replacement for technical terms
- Handles proper nouns, company names, domain-specific vocabulary
- Persistent storage with export capability
- Version control support for cross-system sync
- Two-stage approach: Whisper transcription → LLM formatting
- Single-stage approach: Gemini multimodal with integrated cleanup
- Automatic punctuation and paragraph spacing
- Removal of filler words (ums, ahs, pause words)
- Remember last microphone selection
- Auto-default to previously used device
- Built-in level testing tool (View → Check Level)
- Smart feedback on audio levels (clipping detection, volume suggestions)
- Decibel meter for quick diagnostics
- Secure API key storage (OpenAI or other providers)
- Microphone selection memory
- Personal dictionary persistence
- User preferences and settings
- System tray operation by default
- Minimal, non-intrusive window behavior
- No unnecessary configuration dialogs
- Smart defaults with executive decisions built-in
- Voice typing with OS-level keyboard entry via ydotool
- Cloud-based Whisper transcription
- Basic text cleanup and formatting
- F13 hotkey control with audio feedback
- System tray integration
- Personal dictionary with basic replacements
- Advanced microphone management and level testing
- LLM-based post-processing integration
- Personal dictionary export/import
- Enhanced UI feedback and status indicators
- Local Whisper model support
- Custom fine-tuned model integration
- Multi-language support (Hebrew, others)
- Advanced pre-processing options
- Clear speech - User articulation and speaking manner
- Microphone selection - Context-appropriate device (studio vs. mobile)
- Pre-processing - Optional noise removal and signal optimization
- STT Model - Whisper (cloud or local)
- Post-processing - Dictionary replacement and text cleanup
Option A: Two-Stage
Audio → Whisper API → Personal Dictionary → LLM Cleanup → Clipboard → ydotool
Option B: Single-Stage (Gemini)
Audio → Gemini Multimodal → Personal Dictionary → Clipboard → ydotool
- AI/LLM workflows - Rapid context input for AI tools
- Email composition - Conversational dictation with cleanup
- Blog writing - Outlines and draft generation
- Code documentation - Comments and technical documentation
- Prompt engineering - Complex prompts dictated naturally
- Linux with Wayland support
- ydotool for virtual keyboard
- Python 3.8+
- Audio capture capability (PipeWire/PulseAudio)
- Whisper API access (OpenAI) or local Whisper installation
- Optional: LLM API access (OpenAI GPT, Gemini, etc.)
- Audio libraries for recording and feedback
- System tray framework
- OpenAI API key for Whisper and optional GPT cleanup
- Alternative: Gemini API for multimodal processing
- Microphone (USB recommended for quality)
- Custom key mapping for F13 or alternative hotkey
Be conservative with simultaneous features:
- VAD (Voice Activity Detection)
- Background noise removal
- Personal dictionary application
- LLM-based formatting
Each adds latency, especially for local models.
- Punctuation - Critical for readability
- Paragraph spacing - Provides structure
- Personal dictionary accuracy
These three provide the most value for minimal overhead.
- Custom Whisper fine-tune integration
- Multi-language support (Hebrew, others)
- Advanced VAD options (user-configurable)
- Background noise removal integration
- Cloud sync for personal dictionary
- Usage analytics and accuracy tracking
This project follows a "spec-led development" approach where comprehensive context documentation drives implementation. The goal is to create a tool that solves a real personal need first, then share it if it proves valuable to others.
- Cloud-first: Prioritize speed and accuracy over cost savings
- OS-level: Work everywhere, not just in specific applications
- Simple controls: Hardware button > VAD/wake words
- Smart defaults: Make decisions, don't burden users with choices
- Progressive enhancement: Start simple, add complexity as needed
Currently in specification and planning phase.
The goal is to reach a viable prototype for personal use soon
- Fine-tuned Whisper model (in progress)
- Personal dictionary compilation
- Audio preprocessing pipeline exploration
- Development target: AMD GPU with ROCm
- Local models tested: Whisper Base, Large, Turbo
- Wayland environment: KDE Plasma on Ubuntu
- Initial: English only