An event-driven, multi-service voice agent system that integrates speech-to-text (STT), language models (LLM), and text-to-speech (TTS) to create an intelligent voice assistant.
# Clone the repository
git clone https://github.com/ndwang/voice_agent.git
cd voice_agent
# Install dependencies using uv
uv syncNote: The default setup assumes CUDA 12.6. Optional components like blivedm, ChatTTS, Genie TTS, and Edge-TTS require separate installation or system setup. See the Installation Guide for details.
uv run python scripts/start_services.pyThis script starts the STT Service, TTS Service, and the Orchestrator.
To stop everything:
uv run python scripts/stop_services.py- Talk: Simply speak into your microphone.
- Toggle: Use
Ctrl+Shift+Lto enable/disable listening. - Cancel: Use
Ctrl+Shift+Cto stop the current response. - Web UI: Visit
http://localhost:8000/uifor the control panel.
- Orchestrator:
http://localhost:8000(UI:/ui) - STT:
http://localhost:8001 - TTS:
http://localhost:8003 - OCR (optional):
http://localhost:8004
For detailed guides on architecture, configuration, and service details, visit our Documentation Site.
To run the documentation site locally:
uv pip install mkdocs-material mkdocs-mermaid2-plugin
uv run mkdocs serve -a 127.0.0.1:8010Then visit http://localhost:8010.
The system uses a microservices architecture coordinated by an asynchronous Event Bus.
graph LR
User --> STT[STT Service]
STT --> ORCH[Orchestrator]
ORCH --> LLM[LLM Provider]
LLM --> ORCH
ORCH --> TTS[TTS Service]
TTS --> User
For a deep dive, see the Architecture Overview.
All settings are managed in config.yaml. See the Configuration Guide for details.
MIT License — see LICENSE.