A sophisticated voice-enabled AI assistant that combines Retrieval Augmented Generation (RAG) with speech capabilities to provide intelligent, context-aware responses through both voice and text interactions.
- Voice Interaction: Seamless voice input and output using advanced ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) models
- RAG-Powered Knowledge Base: Intelligent document retrieval and response generation using vector database technology
- Multi-Modal Interface: Supports both voice and text-based interactions
- Web Search Integration: Capability to search the internet for up-to-date information
- Conversational Memory: Maintains context through conversation history
- Extensible Tool System: Modular architecture with support for adding new tools and capabilities
- ASR Model: Faster Whisper (Medium variant) for accurate speech recognition
- Embedding Model: Sentence Transformers (all-MiniLM-L6-v2) for document embeddings
- LLM: TinyLlama-1.1B-Chat for response generation
- TTS Model: Glow-TTS for natural speech synthesis
- Vector Database: Qdrant for efficient similarity search
- Web Search: DuckDuckGo integration for real-time information
The system consists of several key components:
- VoiceRAGAgent: Core agent class that orchestrates all components and manages the interaction flow
- AudioHandler: Manages voice input/output, including recording and speech synthesis
- TaskHandler: Processes user queries and determines appropriate actions
- ComponentInitializer: Handles initialization of all AI models and components
- Tools: Modular system including:
- SearchDocumentsTool: RAG knowledge base search
- WebSearchTool: Internet search capability
- SaveNoteTool: Note-taking functionality
- Real-time voice input processing with automatic silence detection
- Natural-sounding speech output using advanced TTS
- Seamless switching between voice and text modes
- RAG-based document retrieval for accurate information access
- Web search integration for real-time information
- Context-aware response generation
- Conversation memory for maintaining context
- Voice input control ("voice" to start/stop)
- System status checks
- Memory management
- Tool listing and help commands
- Knowledge Base Queries: Access information from ingested documents with natural language
- Real-time Information: Get updated information through web searches
- Interactive Conversations: Engage in context-aware dialogue
- Voice-First Interaction: Hands-free operation for various tasks
- PDF document processing with chunking
- Vector embedding generation
- Efficient storage in Qdrant vector database
- Speech-to-text conversion
- Query understanding and routing
- Context-aware response generation
- Text-to-speech synthesis
- main_agent.py: Core agent implementation
- audio_handler.py: Voice I/O management
- task_handler.py: Query processing and routing
- Tools.py: Implementation of various tools
- data_ingestion.py & rebuild_db.py: Document processing and storage
- AgentState.py: State management
- config.py: System configuration
The system is highly configurable through config.py, allowing customization of:
- Model selections and parameters
- Audio processing settings
- Database configurations
- System behaviors and timeouts
- Enhanced multi-document support
- Improved context understanding
- Additional tool integrations
- Extended web search capabilities
- Sandeep (@sandeep231004)
- Last Updated: 2025-05-27
This voice-enabled RAG agent represents a sophisticated approach to combining various AI technologies into a cohesive, interactive system that can process both voice and text inputs while providing intelligent, context-aware responses.