Linux Voice Typing Application: Design / Project Notes

A system-level speech-to-text application for Linux with OS-level integration, designed for Wayland environments.

Overview

This repository contains development notes and specifications for a voice typing application that provides OS-level text entry on Linux. The goal is to create a unified, system-wide solution that works consistently across all applications - browsers, IDEs, email clients, and more.

Core Features

Speech-to-Text Engine

Cloud-first approach using Whisper API (OpenAI)
Optional local Whisper support (Base/Large models)
Future support for custom fine-tuned Whisper models
Alternative: Gemini multimodal audio for single-stage processing

OS-Level Integration

Wayland support via ydotool virtual keyboard
Works across all applications (browsers, IDEs, terminals, etc.)
No need for application-specific extensions or plugins
System tray integration with persistent background operation

Input Control

F13 key toggle (or custom hotkey) for recording control
Tap to start, tap to stop recording
Audio feedback with distinct beep tones (start/stop)
~15-second lag for optimal processing chunks (not real-time)

Post-Processing Pipeline

Personal Dictionary

Custom word/phrase replacement for technical terms
Handles proper nouns, company names, domain-specific vocabulary
Persistent storage with export capability
Version control support for cross-system sync

Text Cleanup Options

Two-stage approach: Whisper transcription → LLM formatting
Single-stage approach: Gemini multimodal with integrated cleanup
Automatic punctuation and paragraph spacing
Removal of filler words (ums, ahs, pause words)

Microphone Management

Remember last microphone selection
Auto-default to previously used device
Built-in level testing tool (View → Check Level)
Smart feedback on audio levels (clipping detection, volume suggestions)
Decibel meter for quick diagnostics

Persistent Storage

Secure API key storage (OpenAI or other providers)
Microphone selection memory
Personal dictionary persistence
User preferences and settings

User Interface

System tray operation by default
Minimal, non-intrusive window behavior
No unnecessary configuration dialogs
Smart defaults with executive decisions built-in

Initial Implementation Priorities

Phase 1 (MVP)

Voice typing with OS-level keyboard entry via ydotool
Cloud-based Whisper transcription
Basic text cleanup and formatting
F13 hotkey control with audio feedback
System tray integration
Personal dictionary with basic replacements

Phase 2 (Enhanced)

Advanced microphone management and level testing
LLM-based post-processing integration
Personal dictionary export/import
Enhanced UI feedback and status indicators

Phase 3 (Advanced)

Local Whisper model support
Custom fine-tuned model integration
Multi-language support (Hebrew, others)
Advanced pre-processing options

Technical Architecture

The Complete STT Stack

Clear speech - User articulation and speaking manner
Microphone selection - Context-appropriate device (studio vs. mobile)
Pre-processing - Optional noise removal and signal optimization
STT Model - Whisper (cloud or local)
Post-processing - Dictionary replacement and text cleanup

Processing Options

Option A: Two-Stage

Audio → Whisper API → Personal Dictionary → LLM Cleanup → Clipboard → ydotool

Option B: Single-Stage (Gemini)

Audio → Gemini Multimodal → Personal Dictionary → Clipboard → ydotool

Use Cases

AI/LLM workflows - Rapid context input for AI tools
Email composition - Conversational dictation with cleanup
Blog writing - Outlines and draft generation
Code documentation - Comments and technical documentation
Prompt engineering - Complex prompts dictated naturally

Requirements

System Requirements

Linux with Wayland support
ydotool for virtual keyboard
Python 3.8+
Audio capture capability (PipeWire/PulseAudio)

Dependencies

Whisper API access (OpenAI) or local Whisper installation
Optional: LLM API access (OpenAI GPT, Gemini, etc.)
Audio libraries for recording and feedback
System tray framework

Configuration

API Keys

OpenAI API key for Whisper and optional GPT cleanup
Alternative: Gemini API for multimodal processing

Hardware

Microphone (USB recommended for quality)
Custom key mapping for F13 or alternative hotkey

Performance Considerations

Processing Overhead

Be conservative with simultaneous features:

VAD (Voice Activity Detection)
Background noise removal
Personal dictionary application
LLM-based formatting

Each adds latency, especially for local models.

Priority Optimizations

Punctuation - Critical for readability
Paragraph spacing - Provides structure
Personal dictionary accuracy

These three provide the most value for minimal overhead.

Future Enhancements

Custom Whisper fine-tune integration
Multi-language support (Hebrew, others)
Advanced VAD options (user-configurable)
Background noise removal integration
Cloud sync for personal dictionary
Usage analytics and accuracy tracking

Development Philosophy

This project follows a "spec-led development" approach where comprehensive context documentation drives implementation. The goal is to create a tool that solves a real personal need first, then share it if it proves valuable to others.

Design Principles

Cloud-first: Prioritize speed and accuracy over cost savings
OS-level: Work everywhere, not just in specific applications
Simple controls: Hardware button > VAD/wake words
Smart defaults: Make decisions, don't burden users with choices
Progressive enhancement: Start simple, add complexity as needed

Project Status

Currently in specification and planning phase.

The goal is to reach a viable prototype for personal use soon

Related Projects

Fine-tuned Whisper model (in progress)
Personal dictionary compilation
Audio preprocessing pipeline exploration

Technical Notes

Hardware Context

Development target: AMD GPU with ROCm
Local models tested: Whisper Base, Large, Turbo
Wayland environment: KDE Plasma on Ubuntu

Language Support

Initial: English only

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
context		context
.mcp.json		.mcp.json
README.md		README.md

danielrosehill/Linux-Voice-Typing-App-Notes

Folders and files

Latest commit

History

Repository files navigation