An advanced speech recognition system that goes beyond transcription to detect emotional context in spoken language. This student project combines OpenAI's Whisper for speech recognition with emotion detection models to analyze both vocal patterns and textual content.
This is an individual student project developed to explore the intersection of speech recognition and emotion detection. The goal was to create a practical application that demonstrates how AI can understand not just the words we say, but the emotional context behind them.
- Dual Emotion Analysis: Detects emotions from both audio characteristics and textual content
- Multi-language Support: Automatic language detection with Whisper ASR
- Real-time Processing: Fast analysis with cached models for better performance
- Interactive Visualizations: Beautiful charts comparing audio vs. text emotions
- Web Interface: User-friendly Streamlit application
Check out the deployed website here !
-
Clone the repository
git clone https://github.com/antarades/emotion-aware-automatic-speech-recognition.git cd emotion-aware-automatic-speech-recognition -
Install dependencies
pip install -r requirements.txt
-
Install system dependencies (for audio processing)
# macOS brew install ffmpeg # Windows choco install ffmpeg
streamlit run app.py# Analyze an audio file
python src/pipeline.py --mode file --audio_file path/to/audio.wav
# Record and analyze audio via terminal
python src/pipeline.py --mode recordemotion-aware-asr/
├── src/
│ ├── asr_whisper.py
│ ├── emotion_model.py
│ ├── pipeline.py # CLI pipeline
│ ├── record_audio.py # Terminal audio recording utility
├── app.py # Streamlit web application
├── home-image.svg # Home Page illustration
├── requirements.txt
├── packages.txt
└── README.md
- OpenAI Whisper (small variant): Accurate speech-to-text conversion with multi-language support
- Audio Analysis:
superb/wav2vec2-base-superb-er- detects anger, happiness, neutrality, and sadness from vocal patterns - Text Analysis:
j-hartmann/emotion-english-distilroberta-base- detects joy, sadness, anger, fear, surprise, disgust, and neutrality from text content
- Web-based Audio Recording: Direct audio recording capability within the web interface
- Additional Language Support: Expanded emotion detection for non-English languages
You can customize the application by:
- Model Size: Adjust the Whisper model size in
app.py(tiny, base, small, medium, large) - Language Forcing: Force specific language detection in the transcribe_audio function
- Emotion Thresholds: Modify confidence thresholds in
emotion_model.py - UI Styling: Customize the Streamlit interface in the CSS section of
app.py
- The small Whisper model provides a good balance between accuracy and speed
- Emotion detection takes approximately 2-5 seconds depending on audio length
- Models are cached after first load for faster subsequent processing
- Current audio recording is terminal-based; web recording will be added in future versions
This project is licensed under the MIT License.
Built by Antara Srivastava 📧 [email protected] 🌐 github.com/antarades


