This repository contains my complete solution for the SHL Hiring Assessment hosted on Kaggle. The task is to predict grammar proficiency scores (0 to 5) from audio clips of spoken English by candidates. I explored multiple deep learning and machine learning approaches using both audio signals and transcribed text.
Final Best Approach: A multi-modal ensemble combining audio features + Whisper transcripts, fed into an MLP head.
Given a dataset of audio responses and grammar scores:
- Predict a continuous grammar score for new audio clips.
- Evaluation Metric: Mean Squared Error (MSE)
Each training sample includes:
.wavaudio file (spoken answer)label(grammar proficiency, float between 0 and 5)
- Used
librosato extract:- Waveform
- Models:
facebook/wav2vec2-base-960hfor audio featuresXGBoostRegressorwithRandomizedSearchCVfor tuning- Deep
MLP Regressorfor required output
- Insights:
- Fast to compute but limited by shallow semantics
- Used Whisper to transcribe audio to text
- Processed text with:
- BERT tokenizer + embeddings (
bert-base-uncased)
- BERT tokenizer + embeddings (
- Fed into:
- MLP
- Strength: Captured syntactic and grammatical errors well
- Used facebook/wav2vec2-base-960h to extract embeddings from raw waveforms
- Pros:
- Learned rich acoustic representations
- Combined:
- WaveLM audio embeddings
- Whisper transcripts β BERT embeddings
- Concatenated into a single feature vector
- Fed into a custom MLP regressor
- Result: Lowest MSE on validation set
Input (Audio Features + Text Embeddings)
β
Concatenation
β
BatchNorm1d
β
MLP
(ReLU + Dropout)
β
Linear Out
β
Grammar Score