Skip to content

danielrosehill/Linux-Voice-Typing-App-Notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Linux Voice Typing Application: Design / Project Notes

A system-level speech-to-text application for Linux with OS-level integration, designed for Wayland environments.

Overview

This repository contains development notes and specifications for a voice typing application that provides OS-level text entry on Linux. The goal is to create a unified, system-wide solution that works consistently across all applications - browsers, IDEs, email clients, and more.

Core Features

Speech-to-Text Engine

  • Cloud-first approach using Whisper API (OpenAI)
  • Optional local Whisper support (Base/Large models)
  • Future support for custom fine-tuned Whisper models
  • Alternative: Gemini multimodal audio for single-stage processing

OS-Level Integration

  • Wayland support via ydotool virtual keyboard
  • Works across all applications (browsers, IDEs, terminals, etc.)
  • No need for application-specific extensions or plugins
  • System tray integration with persistent background operation

Input Control

  • F13 key toggle (or custom hotkey) for recording control
  • Tap to start, tap to stop recording
  • Audio feedback with distinct beep tones (start/stop)
  • ~15-second lag for optimal processing chunks (not real-time)

Post-Processing Pipeline

Personal Dictionary

  • Custom word/phrase replacement for technical terms
  • Handles proper nouns, company names, domain-specific vocabulary
  • Persistent storage with export capability
  • Version control support for cross-system sync

Text Cleanup Options

  • Two-stage approach: Whisper transcription → LLM formatting
  • Single-stage approach: Gemini multimodal with integrated cleanup
  • Automatic punctuation and paragraph spacing
  • Removal of filler words (ums, ahs, pause words)

Microphone Management

  • Remember last microphone selection
  • Auto-default to previously used device
  • Built-in level testing tool (View → Check Level)
  • Smart feedback on audio levels (clipping detection, volume suggestions)
  • Decibel meter for quick diagnostics

Persistent Storage

  • Secure API key storage (OpenAI or other providers)
  • Microphone selection memory
  • Personal dictionary persistence
  • User preferences and settings

User Interface

  • System tray operation by default
  • Minimal, non-intrusive window behavior
  • No unnecessary configuration dialogs
  • Smart defaults with executive decisions built-in

Initial Implementation Priorities

Phase 1 (MVP)

  1. Voice typing with OS-level keyboard entry via ydotool
  2. Cloud-based Whisper transcription
  3. Basic text cleanup and formatting
  4. F13 hotkey control with audio feedback
  5. System tray integration
  6. Personal dictionary with basic replacements

Phase 2 (Enhanced)

  1. Advanced microphone management and level testing
  2. LLM-based post-processing integration
  3. Personal dictionary export/import
  4. Enhanced UI feedback and status indicators

Phase 3 (Advanced)

  1. Local Whisper model support
  2. Custom fine-tuned model integration
  3. Multi-language support (Hebrew, others)
  4. Advanced pre-processing options

Technical Architecture

The Complete STT Stack

  1. Clear speech - User articulation and speaking manner
  2. Microphone selection - Context-appropriate device (studio vs. mobile)
  3. Pre-processing - Optional noise removal and signal optimization
  4. STT Model - Whisper (cloud or local)
  5. Post-processing - Dictionary replacement and text cleanup

Processing Options

Option A: Two-Stage

Audio → Whisper API → Personal Dictionary → LLM Cleanup → Clipboard → ydotool

Option B: Single-Stage (Gemini)

Audio → Gemini Multimodal → Personal Dictionary → Clipboard → ydotool

Use Cases

  • AI/LLM workflows - Rapid context input for AI tools
  • Email composition - Conversational dictation with cleanup
  • Blog writing - Outlines and draft generation
  • Code documentation - Comments and technical documentation
  • Prompt engineering - Complex prompts dictated naturally

Requirements

System Requirements

  • Linux with Wayland support
  • ydotool for virtual keyboard
  • Python 3.8+
  • Audio capture capability (PipeWire/PulseAudio)

Dependencies

  • Whisper API access (OpenAI) or local Whisper installation
  • Optional: LLM API access (OpenAI GPT, Gemini, etc.)
  • Audio libraries for recording and feedback
  • System tray framework

Configuration

API Keys

  • OpenAI API key for Whisper and optional GPT cleanup
  • Alternative: Gemini API for multimodal processing

Hardware

  • Microphone (USB recommended for quality)
  • Custom key mapping for F13 or alternative hotkey

Performance Considerations

Processing Overhead

Be conservative with simultaneous features:

  • VAD (Voice Activity Detection)
  • Background noise removal
  • Personal dictionary application
  • LLM-based formatting

Each adds latency, especially for local models.

Priority Optimizations

  1. Punctuation - Critical for readability
  2. Paragraph spacing - Provides structure
  3. Personal dictionary accuracy

These three provide the most value for minimal overhead.

Future Enhancements

  • Custom Whisper fine-tune integration
  • Multi-language support (Hebrew, others)
  • Advanced VAD options (user-configurable)
  • Background noise removal integration
  • Cloud sync for personal dictionary
  • Usage analytics and accuracy tracking

Development Philosophy

This project follows a "spec-led development" approach where comprehensive context documentation drives implementation. The goal is to create a tool that solves a real personal need first, then share it if it proves valuable to others.

Design Principles

  • Cloud-first: Prioritize speed and accuracy over cost savings
  • OS-level: Work everywhere, not just in specific applications
  • Simple controls: Hardware button > VAD/wake words
  • Smart defaults: Make decisions, don't burden users with choices
  • Progressive enhancement: Start simple, add complexity as needed

Project Status

Currently in specification and planning phase.

The goal is to reach a viable prototype for personal use soon

Related Projects

  • Fine-tuned Whisper model (in progress)
  • Personal dictionary compilation
  • Audio preprocessing pipeline exploration

Technical Notes

Hardware Context

  • Development target: AMD GPU with ROCm
  • Local models tested: Whisper Base, Large, Turbo
  • Wayland environment: KDE Plasma on Ubuntu

Language Support

  • Initial: English only

About

Planning notes for a tool I've been working on for a while!

Topics

Resources

Stars

Watchers

Forks