Author: BillMillerCoding
This repository contains a character-level, CTC-based speech recognition system built from the ground up using PyTorch and torchaudio. The project explores how modern speech recognition systems work internally, without relying on large pretrained models.
The system supports:
- Training on the LibriSpeech dataset
- Offline transcription of
.wavfiles - Live microphone transcription
- GPU acceleration via CUDA
The codebase is structured as a clean, modular pipeline covering data loading, preprocessing, modeling, training, checkpointing, and inference.
This project makes extensive use of PyTorch checkpoint files with the .pth extension. These files are central to training, resuming experiments, and deploying trained models.
Trained model checkpoints (.pth) are stored using Git Large File Storage (LFS).
Ensure Git LFS is installed before cloning this repository.
A .pth file is a serialized PyTorch object created using torch.save(). In this project, checkpoints store:
- Model weights (
model_state_dict) - Optimizer state (
optimizer_state_dict) - Training progress (epoch number)
Rather than saving the entire model object, only the state dictionaries are saved. This approach:
- Avoids serialization issues
- Keeps checkpoints architecture-agnostic
- Requires the model class to be re-instantiated before loading
- Training: Training can resume from any saved checkpoint, restoring both model parameters and optimizer momentum.
- Inference: Trained checkpoints can be loaded for offline or live speech recognition.
- Experimentation: Multiple checkpoints represent different stages of training and allow comparison between runs.
Many of the main entry points in this repository allow interactive selection of a checkpoint from the current directory at runtime.
Note: Checkpoints are roughly ordered by training progress (usually reflected in their filenames). Higher-numbered checkpoints generally represent later stages of training. Some checkpoints were trained on CPU, while later ones were trained on GPU.
The speech recognition model follows a classic CNN + BiLSTM + CTC design:
-
Input: Log-scaled mel spectrograms
-
Convolutional Layers:
- Extract time-frequency features
- Reduce temporal resolution
-
Bidirectional LSTM:
- Models long-range temporal dependencies
-
Linear Classifier:
- Outputs character probabilities per timestep
-
CTC Loss:
- Aligns variable-length audio with variable-length text
The model operates at the character level, predicting letters, spaces, and special tokens.
This project began as an experiment to see whether a speech recognition system could be built from near-scratch with a deep understanding of each component.
After researching industry-standard speech recognition pipelines, mel spectrograms emerged as a way to transform audio into a format compatible with image-inspired neural architectures. This project applies those principles to speech, bridging concepts from computer vision and audio processing.
Working with raw audio datasets posed significant challenges. LibriSpeech consists of .wav files, which required:
- Resampling
- Mel spectrogram conversion
- Log-amplitude scaling
- Custom batching and padding logic
For a project of this complexity, purely conversational or trial-and-error coding was insufficient. While LLMs were useful for small, isolated problems, architectural decisions required careful research and deliberate design.
This project was largely completed approximately one year prior to its upload, and tooling and AI assistance may have since improved.
This project revealed how many interacting components exist in real ML systems:
- Padding and sequence alignment
- Length tracking through CNN layers
- CTC constraints
- Loss stability
While demanding, the experience was highly rewarding and has informed future ML work.
Building everything from scratch was educational but not always optimal. Pretrained models and established toolkits may provide a gentler learning curve. Feedback and suggestions are welcome.
- Python 3.10+
- PyTorch (CUDA recommended)
- torchaudio
- numpy
- scipy
- sounddevice (for live inference)
- Trained model checkpoints (
.pth) are stored using Git Large File Storage (LFS). - Ensure Git LFS is installed before cloning this repository.
If using Conda:
conda env create -f voiceenv.yml- Clone the repository
- Open it in your editor of choice
- Run any file containing a
__main__entry point:
-
VoiceRecognition.pyTrain a new model or resume training from an existing checkpoint. -
Predict.pyPerform offline transcription on a.wavfile. -
LivePredictionLoop.pyRecord audio from your microphone and perform live transcription.
Many scripts will prompt you to select an existing checkpoint interactively.
This repository represents an educational exploration into speech recognition rather than a production-ready system. The emphasis is on understanding how modern speech models work internally.
Feedback, suggestions, and improvements are welcome.