VoiceVisionReasoner is a research prototype that processes speech + visual context together. It produces grounded responses, avoids punitive refusals, and respects user intent.
This project explores multimodal failures of current assistants and proposes a simple approach:
- Whisper ASR (speech → text)
- Vision captioning (image → description)
- Joint reasoning (text + vision → answer)
- Intent-preserving safety checks
Most AI assistants treat modalities as independent:
- Speech → transcription → answer
- Image → caption → answer
This siloed pipeline often produces:
- generic replies
- moralizing responses
- refusals framed as safety
- advice that ignores emotional context
- hallucinations unrelated to user environment
Most failures are not user attacks. They arise because the system fails to interpret human signals.
VoiceVisionReasoner attempts to reduce these failures by combining:
- Speech transcription
- Visual description
- Joint reasoning
- Safety and tone checks
Audio (wav) → Whisper ASR → Transcript Image (jpg/png) → Caption Model → Caption Transcript + Caption → LLM → Reasoned Answer ↓ Tone / Safety Analysis
Whisper generates transcription from spoken audio.
A captioning model describes the visible environment.
A language model produces answers grounded in transcript + caption.
Lightweight checks detect:
- toxic tone
- hallucinations
- punitive refusals
- moralizing language
The system repairs intent only when needed.
Typical LLMs treat:
- emotion as danger
- uncertainty as hallucination
- user vulnerability as policy violation
VoiceVisionReasoner takes a different stance:
Safety = collaboration, not punishment. Use context → propose constructive actions → preserve user intent.
git clone https://github.com/kritibehl/VoiceVisionReasoner.git
cd VoiceVisionReasoner pip install -r requirements.txt
python app.py --audio_path examples/query.wav --image_path examples/desk.png
The program outputs:
- speech transcript
- image caption
- joint reasoning answer
- safety/tone indicators
User (spoken): “Why does my workspace make me feel stressed?”
Image: A cluttered desk with cables, notebooks, and unopened mail.
System output: Clutter increases decision fatigue. Keep items you use daily within reach. Move infrequent objects into a drawer or box. Take a 5-minute break afterward.
The answer is grounded, actionable, and non-judgmental.
Human-aligned evaluation for LLM responses. Assesses clarity, relevance, tone, hallucination risk. https://github.com/kritibehl/FairEval-Suite
Intent-preserving prompt repair. Converts adversarial or distressed prompts into constructive goals. https://github.com/kritibehl/JailBreakDefense
Extends these principles to multimodal contexts.
-
Transparency
Show intermediate steps. -
User Respect
Avoid moralizing or punitive refusal. -
Visual Grounding
Anchor reasoning in observed context. -
Actionable Assistance
Offer specific steps, not disclaimers.
- Latency not optimized
- Captioning errors can propagate
- Safety layer minimal
- No personalization or memory
- Not clinical or crisis tooling
- Prototype, not production
- Real-time multimodal reasoning
- Accessibility for low-vision users
- Emotion-aware speech analysis
- Caption confidence scoring
- Integration with FairEval metrics
MIT License. Research contributions are welcome.
Kriti Behl
GitHub: https://github.com/kritibehl
Medium: https://medium.com/@kriti0608
Zenodo: (FairEval DOI linked in main repository)