Skip to content

BillMillerCoding/SpeechRecognitionMLAlgorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech Recognition ML Algorithm

Author: BillMillerCoding


Overview

This repository contains a character-level, CTC-based speech recognition system built from the ground up using PyTorch and torchaudio. The project explores how modern speech recognition systems work internally, without relying on large pretrained models.

The system supports:

  • Training on the LibriSpeech dataset
  • Offline transcription of .wav files
  • Live microphone transcription
  • GPU acceleration via CUDA

The codebase is structured as a clean, modular pipeline covering data loading, preprocessing, modeling, training, checkpointing, and inference.


Checkpoints (.pth Files)

This project makes extensive use of PyTorch checkpoint files with the .pth extension. These files are central to training, resuming experiments, and deploying trained models. Trained model checkpoints (.pth) are stored using Git Large File Storage (LFS). Ensure Git LFS is installed before cloning this repository.

https://git-lfs.github.com/

What is a .pth File?

A .pth file is a serialized PyTorch object created using torch.save(). In this project, checkpoints store:

  • Model weights (model_state_dict)
  • Optimizer state (optimizer_state_dict)
  • Training progress (epoch number)

Rather than saving the entire model object, only the state dictionaries are saved. This approach:

  • Avoids serialization issues
  • Keeps checkpoints architecture-agnostic
  • Requires the model class to be re-instantiated before loading

How Checkpoints Are Used

  • Training: Training can resume from any saved checkpoint, restoring both model parameters and optimizer momentum.
  • Inference: Trained checkpoints can be loaded for offline or live speech recognition.
  • Experimentation: Multiple checkpoints represent different stages of training and allow comparison between runs.

Many of the main entry points in this repository allow interactive selection of a checkpoint from the current directory at runtime.

Note: Checkpoints are roughly ordered by training progress (usually reflected in their filenames). Higher-numbered checkpoints generally represent later stages of training. Some checkpoints were trained on CPU, while later ones were trained on GPU.


Model Architecture

The speech recognition model follows a classic CNN + BiLSTM + CTC design:

  1. Input: Log-scaled mel spectrograms

  2. Convolutional Layers:

    • Extract time-frequency features
    • Reduce temporal resolution
  3. Bidirectional LSTM:

    • Models long-range temporal dependencies
  4. Linear Classifier:

    • Outputs character probabilities per timestep
  5. CTC Loss:

    • Aligns variable-length audio with variable-length text

The model operates at the character level, predicting letters, spaces, and special tokens.


Purpose and Outcome

This project began as an experiment to see whether a speech recognition system could be built from near-scratch with a deep understanding of each component.

Motivation

After researching industry-standard speech recognition pipelines, mel spectrograms emerged as a way to transform audio into a format compatible with image-inspired neural architectures. This project applies those principles to speech, bridging concepts from computer vision and audio processing.

Lessons Learned

External Datasets

Working with raw audio datasets posed significant challenges. LibriSpeech consists of .wav files, which required:

  • Resampling
  • Mel spectrogram conversion
  • Log-amplitude scaling
  • Custom batching and padding logic

The Limits of "Vibe Coding"

For a project of this complexity, purely conversational or trial-and-error coding was insufficient. While LLMs were useful for small, isolated problems, architectural decisions required careful research and deliberate design.

This project was largely completed approximately one year prior to its upload, and tooling and AI assistance may have since improved.

ML System Complexity

This project revealed how many interacting components exist in real ML systems:

  • Padding and sequence alignment
  • Length tracking through CNN layers
  • CTC constraints
  • Loss stability

While demanding, the experience was highly rewarding and has informed future ML work.

Drawbacks of a Naïve Approach

Building everything from scratch was educational but not always optimal. Pretrained models and established toolkits may provide a gentler learning curve. Feedback and suggestions are welcome.


How to Use

Requirements

  • Python 3.10+
  • PyTorch (CUDA recommended)
  • torchaudio
  • numpy
  • scipy
  • sounddevice (for live inference)
  • Trained model checkpoints (.pth) are stored using Git Large File Storage (LFS).
  • Ensure Git LFS is installed before cloning this repository.

If using Conda:

conda env create -f voiceenv.yml

Running the Project

  1. Clone the repository
  2. Open it in your editor of choice
  3. Run any file containing a __main__ entry point:
  • VoiceRecognition.py Train a new model or resume training from an existing checkpoint.

  • Predict.py Perform offline transcription on a .wav file.

  • LivePredictionLoop.py Record audio from your microphone and perform live transcription.

Many scripts will prompt you to select an existing checkpoint interactively.


Final Notes

This repository represents an educational exploration into speech recognition rather than a production-ready system. The emphasis is on understanding how modern speech models work internally.

Feedback, suggestions, and improvements are welcome.

About

I made a voice recognition ML model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages