Speech Recognition ML Algorithm

Author: BillMillerCoding

Overview

This repository contains a character-level, CTC-based speech recognition system built from the ground up using PyTorch and torchaudio. The project explores how modern speech recognition systems work internally, without relying on large pretrained models.

The system supports:

Training on the LibriSpeech dataset
Offline transcription of .wav files
Live microphone transcription
GPU acceleration via CUDA

The codebase is structured as a clean, modular pipeline covering data loading, preprocessing, modeling, training, checkpointing, and inference.

Checkpoints (`.pth` Files)

This project makes extensive use of PyTorch checkpoint files with the .pth extension. These files are central to training, resuming experiments, and deploying trained models. Trained model checkpoints (.pth) are stored using Git Large File Storage (LFS). Ensure Git LFS is installed before cloning this repository.

https://git-lfs.github.com/

What is a `.pth` File?

A .pth file is a serialized PyTorch object created using torch.save(). In this project, checkpoints store:

Model weights (model_state_dict)
Optimizer state (optimizer_state_dict)
Training progress (epoch number)

Rather than saving the entire model object, only the state dictionaries are saved. This approach:

Avoids serialization issues
Keeps checkpoints architecture-agnostic
Requires the model class to be re-instantiated before loading

How Checkpoints Are Used

Training: Training can resume from any saved checkpoint, restoring both model parameters and optimizer momentum.
Inference: Trained checkpoints can be loaded for offline or live speech recognition.
Experimentation: Multiple checkpoints represent different stages of training and allow comparison between runs.

Many of the main entry points in this repository allow interactive selection of a checkpoint from the current directory at runtime.

Note: Checkpoints are roughly ordered by training progress (usually reflected in their filenames). Higher-numbered checkpoints generally represent later stages of training. Some checkpoints were trained on CPU, while later ones were trained on GPU.

Model Architecture

The speech recognition model follows a classic CNN + BiLSTM + CTC design:

Input: Log-scaled mel spectrograms
Convolutional Layers:
- Extract time-frequency features
- Reduce temporal resolution
Bidirectional LSTM:
- Models long-range temporal dependencies
Linear Classifier:
- Outputs character probabilities per timestep
CTC Loss:
- Aligns variable-length audio with variable-length text

The model operates at the character level, predicting letters, spaces, and special tokens.

Purpose and Outcome

This project began as an experiment to see whether a speech recognition system could be built from near-scratch with a deep understanding of each component.

Motivation

After researching industry-standard speech recognition pipelines, mel spectrograms emerged as a way to transform audio into a format compatible with image-inspired neural architectures. This project applies those principles to speech, bridging concepts from computer vision and audio processing.

Lessons Learned

External Datasets

Working with raw audio datasets posed significant challenges. LibriSpeech consists of .wav files, which required:

Resampling
Mel spectrogram conversion
Log-amplitude scaling
Custom batching and padding logic

The Limits of "Vibe Coding"

For a project of this complexity, purely conversational or trial-and-error coding was insufficient. While LLMs were useful for small, isolated problems, architectural decisions required careful research and deliberate design.

This project was largely completed approximately one year prior to its upload, and tooling and AI assistance may have since improved.

ML System Complexity

This project revealed how many interacting components exist in real ML systems:

Padding and sequence alignment
Length tracking through CNN layers
CTC constraints
Loss stability

While demanding, the experience was highly rewarding and has informed future ML work.

Drawbacks of a Naïve Approach

Building everything from scratch was educational but not always optimal. Pretrained models and established toolkits may provide a gentler learning curve. Feedback and suggestions are welcome.

How to Use

Requirements

Python 3.10+
PyTorch (CUDA recommended)
torchaudio
numpy
scipy
sounddevice (for live inference)
Trained model checkpoints (.pth) are stored using Git Large File Storage (LFS).
Ensure Git LFS is installed before cloning this repository.

If using Conda:

conda env create -f voiceenv.yml

Running the Project

Clone the repository
Open it in your editor of choice
Run any file containing a __main__ entry point:

VoiceRecognition.py Train a new model or resume training from an existing checkpoint.
Predict.py Perform offline transcription on a .wav file.
LivePredictionLoop.py Record audio from your microphone and perform live transcription.

Many scripts will prompt you to select an existing checkpoint interactively.

Final Notes

This repository represents an educational exploration into speech recognition rather than a production-ready system. The emphasis is on understanding how modern speech models work internally.

Feedback, suggestions, and improvements are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitattributes		.gitattributes
.gitignore		.gitignore
LibriSpeechDataLoader.py		LibriSpeechDataLoader.py
LibriSpeechMelDataset.py		LibriSpeechMelDataset.py
LivePredictionLoop.py		LivePredictionLoop.py
Predict.py		Predict.py
README.md		README.md
VoiceRecognition.py		VoiceRecognition.py
checkpoint.pth		checkpoint.pth
checkpointGPU.pth		checkpointGPU.pth
checkpointGPU10other.pth		checkpointGPU10other.pth
checkpointGPU2.pth		checkpointGPU2.pth
checkpointGPU3.pth		checkpointGPU3.pth
checkpointGPU4.pth		checkpointGPU4.pth
checkpointGPU5.pth		checkpointGPU5.pth
checkpointGPU6.pth		checkpointGPU6.pth
checkpointGPU7other.pth		checkpointGPU7other.pth
checkpointGPU8other.pth		checkpointGPU8other.pth
checkpointGPU9other.pth		checkpointGPU9other.pth
harvard.wav.zip		harvard.wav.zip
voiceenv.yaml		voiceenv.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Recognition ML Algorithm

Overview

Checkpoints (`.pth` Files)

What is a `.pth` File?

How Checkpoints Are Used

Model Architecture

Purpose and Outcome

Motivation

Lessons Learned

External Datasets

The Limits of "Vibe Coding"

ML System Complexity

Drawbacks of a Naïve Approach

How to Use

Requirements

Running the Project

Final Notes

About

Uh oh!

Releases

Packages

Languages

BillMillerCoding/SpeechRecognitionMLAlgorithm

Folders and files

Latest commit

History

Repository files navigation

Speech Recognition ML Algorithm

Overview

Checkpoints (.pth Files)

What is a .pth File?

How Checkpoints Are Used

Model Architecture

Purpose and Outcome

Motivation

Lessons Learned

External Datasets

The Limits of "Vibe Coding"

ML System Complexity

Drawbacks of a Naïve Approach

How to Use

Requirements

Running the Project

Final Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Checkpoints (`.pth` Files)

What is a `.pth` File?

Packages