A minimal, real-time voice assistant powered by WebRTC streaming, featuring ultra-low latency audio processing with professional-grade components.
- π WebRTC Streaming: Ultra-low latency audio streaming via FastRTC
- π― Advanced VAD: Silero voice activity detection with configurable parameters
- π€ Speech Recognition: Whisper ASR for accurate speech-to-text transcription
- π Neural TTS: High-quality Kokoro text-to-speech with natural-sounding voices
- π€ AI Integration: Support for both Ollama (local) and OpenRouter GPT-5 Nano
- π¬ Context-Aware: Maintains conversation history for coherent responses
βββββββββββββββ WebRTC ββββββββββββββββββββ
β Browser βββββββββββββββββΊβ FastRTC β
β (Client) β Ultra-low β Stream β
βββββββββββββββ latency ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ
β Silero VAD β
β (Voice Activity)β
ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ
β Whisper ASR β
β (Speech-to-Text)β
ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ
β LLM Processing β
β (Ollama/OpenAI) β
ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ
β Kokoro TTS β
β (Text-to-Speech)β
ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ
β Audio Stream β
β (Back to client)β
ββββββββββββββββββββ
- Python: 3.12 or higher
- uv: Modern Python package manager
- System Dependencies:
- Build tools (gcc/g++)
- Python development headers
- espeak-ng for phonemization
- AI Backend: Either:
- Ollama with
gemma3:4bmodel (local), OR - OpenRouter API key for GPT-5 Nano (cloud)
- Ollama with
git clone https://github.com/voidstarr/minimal-voice-assistant.git
cd minimal-voice-assistantThe setup script will automatically:
- Install
uvif not present - Install system dependencies (gcc, python3-dev, espeak)
- Create a virtual environment
- Install all Python dependencies
- Generate SSL certificates for HTTPS
chmod +x setup.sh
./setup.shChoose ONE of the following options:
- Install Ollama from ollama.ai
- Pull the required model:
ollama pull gemma3:4b
- Create a
.envfile:echo "OPENROUTER_API_KEY=your_api_key_here" > .env
- Get your API key from OpenRouter
chmod +x run.sh
./run.shThe assistant will start on https://localhost:7860
-
Open the Interface: Navigate to
https://localhost:7860in your browser- Accept the self-signed certificate warning (this is expected for local development)
-
Grant Microphone Access: Allow browser microphone permissions when prompted
-
Start Speaking: The assistant will automatically detect when you start and stop speaking using advanced VAD
-
Receive Responses: The AI will process your speech and respond with natural voice
Edit voice_assistant.py to adjust VAD parameters:
self.vad_options = SileroVadOptions(
threshold=0.5, # Speech detection sensitivity
min_speech_duration_ms=250, # Minimum speech length
max_speech_duration_s=30.0, # Maximum speech length
min_silence_duration_ms=500, # Silence before processing
window_size_samples=1024, # VAD processing window
speech_pad_ms=200 # Padding around speech
)Change the voice in the generate_tts method:
samples, sample_rate = self.kokoro.create(
text, voice="af_heart", speed=1.0, lang="en-us"
)Available voices depend on your Kokoro model configuration.
minimal-voice-assistant/
βββ voice_assistant.py # Main application
βββ pyproject.toml # Project dependencies
βββ setup.sh # Setup script
βββ run.sh # Run script
βββ models/ # TTS models
β βββ kokoro-v1.0.onnx # Kokoro TTS model
βββ ssl_certs/ # SSL certificates
β βββ cert.pem
β βββ key.pem
βββ README.md # This file
Issue: Browser shows security warning
Solution: This is expected for self-signed certificates. Click "Advanced" and "Proceed" to continue. For production, use proper SSL certificates.
Issue: No audio detected
Solution:
- Ensure HTTPS is enabled (required for WebRTC)
- Grant microphone permissions in browser
- Check browser console for errors
- Verify microphone works in other applications
Issue: "Ollama connection failed" error
Solution:
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Start Ollama if needed
ollama serve
# Pull the model
ollama pull gemma3:4bIssue: Phonemizer or espeak errors
Solution: Reinstall system dependencies:
# Ubuntu/Debian
sudo apt-get install espeak espeak-data libespeak-dev
# Fedora/RHEL
sudo dnf install espeak espeak-develIssue: TTS or ASR models fail to load
Solution:
- Ensure
models/kokoro-v1.0.onnxis present - Check sufficient RAM (4GB+ recommended)
- Verify Python 3.12+ is installed
If you prefer not to use the setup script:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment
uv venv
# Activate environment
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\activate # Windows
# Install dependencies
uv pip install -e .For development without HTTPS (note: WebRTC may not work):
# In voice_assistant.py, modify the launch call:
interface.launch(
server_name="0.0.0.0",
server_port=7860,
share=False
)- Latency: ~200-500ms end-to-end (depends on LLM)
- CPU Usage: Moderate (optimized for CPU-only operation)
- RAM Usage: ~2-4GB (with models loaded)
- Network: Minimal (except for OpenRouter API calls)
Contributions are welcome! Please feel free to submit issues or pull requests.