HeyDittoNet

Spoken "Hey Ditto" activation using deep learning. Model trained on both synthetic and real human voices along with samples of background noise from various scenes around the world.

Getting Started

Install required packages: pip install -r requirements.txt
Run: python main.py to test activation on your default mic.

Model Versions

HeyDittoNet v3 (Latest) - Pure CNN with Attention

The latest architecture removes the LSTM in favor of a pure CNN approach with Squeeze-and-Excitation attention blocks. Research shows that for 1-second keyword spotting, LSTMs are unnecessary and pure CNNs achieve state-of-the-art results.

Key Features:

Depthwise Separable Convolutions - Efficient like MobileNet, reduces parameters
Squeeze-and-Excitation (SE) Attention - Learns which frequency channels are important
Global Average Pooling - More robust than flatten, replaces LSTM
Progressive Dropout - 0.1 → 0.15 → 0.2 → 0.25 → 0.5 prevents overfitting
L2 Regularization - On all conv and dense layers
No LSTM - Pure CNN is sufficient for 1-second audio!

Training Metrics (v3):

HeyDittoNet v2 - CNN + Bidirectional LSTM

The previous architecture using a simplified CNN backbone with Bidirectional LSTM for temporal modeling.

Training Metrics (v2):

HeyDittoNet v1 - TimeDistributed CNN + LSTM

The original architecture using TimeDistributed CNN wrapper to process split spectrograms.

Architecture Documentation

For detailed interactive architecture diagrams, see the docs/ folder:

docs/index.html - Overview of all architectures
docs/heydittonet_v1.html - v1 architecture details
docs/heydittonet_v2.html - v2 architecture details
docs/heydittonet_v3.html - v3 architecture details

Why v3 Removes LSTM

Modern keyword spotting research (TC-ResNet, FCA-Net) shows that:

1-second audio doesn't need long-range temporal modeling
Spectrograms already encode time on one axis - CNN can learn temporal patterns
Pure CNN is simpler, faster, and easier to deploy to edge devices
Attention mechanisms (SE blocks) effectively capture channel importance

Configuration

Set the model version in main.py:

MODEL_SELECT = 2  # 0 for v1, 1 for v2, 2 for v3
TRAIN = True      # Set to True to train

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
data		data
docs		docs
images		images
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
activation_requests.py		activation_requests.py
convert_tflite.py		convert_tflite.py
hey_ditto_net_embeddings.py		hey_ditto_net_embeddings.py
main.py		main.py
reinforce_from_samples.py		reinforce_from_samples.py
requirements.txt		requirements.txt
sample-recorder.py		sample-recorder.py
speaker_recognition.py		speaker_recognition.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HeyDittoNet

Getting Started

Model Versions

HeyDittoNet v3 (Latest) - Pure CNN with Attention

HeyDittoNet v2 - CNN + Bidirectional LSTM

HeyDittoNet v1 - TimeDistributed CNN + LSTM

Architecture Documentation

Why v3 Removes LSTM

Configuration

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

omarzanji/ditto_activation

Folders and files

Latest commit

History

Repository files navigation

HeyDittoNet

Getting Started

Model Versions

HeyDittoNet v3 (Latest) - Pure CNN with Attention

HeyDittoNet v2 - CNN + Bidirectional LSTM

HeyDittoNet v1 - TimeDistributed CNN + LSTM

Architecture Documentation

Why v3 Removes LSTM

Configuration

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages