Spoken "Hey Ditto" activation using deep learning. Model trained on both synthetic and real human voices along with samples of background noise from various scenes around the world.
- Install required packages:
pip install -r requirements.txt - Run:
python main.pyto test activation on your default mic.
The latest architecture removes the LSTM in favor of a pure CNN approach with Squeeze-and-Excitation attention blocks. Research shows that for 1-second keyword spotting, LSTMs are unnecessary and pure CNNs achieve state-of-the-art results.
Key Features:
- Depthwise Separable Convolutions - Efficient like MobileNet, reduces parameters
- Squeeze-and-Excitation (SE) Attention - Learns which frequency channels are important
- Global Average Pooling - More robust than flatten, replaces LSTM
- Progressive Dropout - 0.1 → 0.15 → 0.2 → 0.25 → 0.5 prevents overfitting
- L2 Regularization - On all conv and dense layers
- No LSTM - Pure CNN is sufficient for 1-second audio!
Training Metrics (v3):
The previous architecture using a simplified CNN backbone with Bidirectional LSTM for temporal modeling.
Training Metrics (v2):
The original architecture using TimeDistributed CNN wrapper to process split spectrograms.
For detailed interactive architecture diagrams, see the docs/ folder:
docs/index.html- Overview of all architecturesdocs/heydittonet_v1.html- v1 architecture detailsdocs/heydittonet_v2.html- v2 architecture detailsdocs/heydittonet_v3.html- v3 architecture details
Modern keyword spotting research (TC-ResNet, FCA-Net) shows that:
- 1-second audio doesn't need long-range temporal modeling
- Spectrograms already encode time on one axis - CNN can learn temporal patterns
- Pure CNN is simpler, faster, and easier to deploy to edge devices
- Attention mechanisms (SE blocks) effectively capture channel importance
Set the model version in main.py:
MODEL_SELECT = 2 # 0 for v1, 1 for v2, 2 for v3
TRAIN = True # Set to True to train



