Skip to content

omarzanji/ditto_activation

Repository files navigation

HeyDittoNet

Spoken "Hey Ditto" activation using deep learning. Model trained on both synthetic and real human voices along with samples of background noise from various scenes around the world.

Getting Started

  1. Install required packages: pip install -r requirements.txt
  2. Run: python main.py to test activation on your default mic.

Model Versions

HeyDittoNet v3 (Latest) - Pure CNN with Attention

The latest architecture removes the LSTM in favor of a pure CNN approach with Squeeze-and-Excitation attention blocks. Research shows that for 1-second keyword spotting, LSTMs are unnecessary and pure CNNs achieve state-of-the-art results.

Key Features:

  • Depthwise Separable Convolutions - Efficient like MobileNet, reduces parameters
  • Squeeze-and-Excitation (SE) Attention - Learns which frequency channels are important
  • Global Average Pooling - More robust than flatten, replaces LSTM
  • Progressive Dropout - 0.1 → 0.15 → 0.2 → 0.25 → 0.5 prevents overfitting
  • L2 Regularization - On all conv and dense layers
  • No LSTM - Pure CNN is sufficient for 1-second audio!

HeyDittoNet v3

Training Metrics (v3):

HeyDittoNet v3 Training


HeyDittoNet v2 - CNN + Bidirectional LSTM

The previous architecture using a simplified CNN backbone with Bidirectional LSTM for temporal modeling.

HeyDittoNet v2

Training Metrics (v2):

HeyDittoNet v2 Training Loss


HeyDittoNet v1 - TimeDistributed CNN + LSTM

The original architecture using TimeDistributed CNN wrapper to process split spectrograms.

HeyDittoNet v1

Architecture Documentation

For detailed interactive architecture diagrams, see the docs/ folder:

  • docs/index.html - Overview of all architectures
  • docs/heydittonet_v1.html - v1 architecture details
  • docs/heydittonet_v2.html - v2 architecture details
  • docs/heydittonet_v3.html - v3 architecture details

Why v3 Removes LSTM

Modern keyword spotting research (TC-ResNet, FCA-Net) shows that:

  1. 1-second audio doesn't need long-range temporal modeling
  2. Spectrograms already encode time on one axis - CNN can learn temporal patterns
  3. Pure CNN is simpler, faster, and easier to deploy to edge devices
  4. Attention mechanisms (SE blocks) effectively capture channel importance

Configuration

Set the model version in main.py:

MODEL_SELECT = 2  # 0 for v1, 1 for v2, 2 for v3
TRAIN = True      # Set to True to train

About

"Hey Ditto" activation model using CNNs with Squeeze-and-Excitation (SE) attention blocks.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •