Skip to content

A comprehensive collection of machine learning laboratories covering audio processing, computer vision, sequence modeling, and generative models

Notifications You must be signed in to change notification settings

adityaravi9034/Lab-of-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Lab of Machine Learning

A comprehensive collection of machine learning laboratories covering audio processing, computer vision, sequence modeling, and generative models. This repository demonstrates practical implementations of state-of-the-art deep learning techniques across multiple modalities.

πŸ“‹ Table of Contents

🎯 Overview

This repository contains 9 comprehensive machine learning laboratories that explore various aspects of deep learning:

  • Audio Processing & Generation: Signal processing, classification, and synthesis
  • Computer Vision: Image classification and semantic segmentation
  • Sequence Modeling: Text and speech recognition using CTC
  • Generative Models: GANs, VAEs, and Diffusion models for image and audio generation

Each lab includes detailed Jupyter notebooks with implementations, visualizations, and analysis.

πŸ”¬ Labs

Lab 1: Basic Audio Processing

Directory: LAb -1 basic and Lab 2 audio/basic_audio/ Notebook: LAB1_BASIC_PROCESSING.ipynb

Introduction to audio signal processing fundamentals using PyTorch and torchaudio.

Key Concepts:

  • Audio file loading and manipulation (WAV, MP3 formats)
  • Waveform visualization and spectrogram analysis
  • Audio resampling (upsampling/downsampling)
  • Signal-to-Noise Ratio (SNR) manipulation
  • Audio filtering (low-pass filters)
  • Musical chord generation

Datasets:

  • VOiCES dataset samples (16kHz voice recordings)
  • Steam train whistle audio (44.1kHz MP3)
  • Piano sound samples (C1, E1, G1, A1, B1 notes)

Technical Highlights:

  • Demonstrated audio format conversions and compression
  • Created musical combinations (bichords and triads) through signal mixing
  • Illustrated how downsampling affects audio quality

Lab 2: Audio Classification

Directory: LAb -1 basic and Lab 2 audio/Audioclass/ Notebook: Lab2_AudioClassification_(1).ipynb

Binary classification of audio samples using Convolutional Neural Networks.

Key Concepts:

  • CNN architectures for audio processing
  • Spectrogram-based feature extraction
  • Audio augmentation techniques
  • Training and validation loops

Technical Highlights:

  • Demonstrates end-to-end audio classification pipeline
  • Uses spectrograms as visual representations of audio for CNN processing

Lab 3: Pet Classification

Directory: Petclassification/ Notebook: MainLAB2_Pet_classification.ipynb

Binary image classification distinguishing between cats and dogs using CNNs.

Key Concepts:

  • Convolutional Neural Networks with residual blocks
  • Batch Normalization and Dropout for regularization
  • Data augmentation (RandomHorizontalFlip, RandomResizedCrop)
  • Transfer learning concepts
  • Confusion matrix analysis

Dataset:

  • Oxford Pet Dataset: 7,349 images (6,349 training, 1,000 test)
  • Images resized to 160Γ—160 pixels
  • Classes: Cat (2,047 training), Dog (4,302 training)

Results:

  • Training accuracy: ~85-92%
  • Test accuracy: ~72-73% with data augmentation
  • Model parameters: 1,181,009
  • Best performance at epoch 8-10

Technical Highlights:

  • 5 convolutional blocks with feature extraction + classifier
  • Binary Cross Entropy loss with Adam optimizer
  • Visualization of misclassified images
  • Generalization testing on new animals (jaguar, fox, lion)

Lab 4: Pet Segmentation

Directory: petsegmentation/ Notebook: MAIN_LAB3_Pet_segmentation_(1)_(1).ipynb

Pixel-level semantic segmentation of pet images into background, cat, and dog classes.

Key Concepts:

  • Encoder-Decoder CNN architecture (Fully Convolutional Network)
  • U-Net architecture with skip connections
  • Convolutional blocks with batch normalization
  • MaxPooling and Upsampling layers
  • Intersection over Union (IoU) metric

Dataset:

  • Oxford Pet Dataset with segmentation masks
  • 7,349 images with pixel-level annotations
  • Classes: 0=background, 1=cat, 2=dog

Results:

  • Overall accuracy: 84.5%
  • Overall IoU: 68.5%
  • Model parameters: 1,614,570
  • Training showed improvement from 56% to 90% accuracy

Technical Highlights:

  • 3-level encoder-decoder with 64 base width
  • Cross Entropy Loss averaged over all pixels
  • IoU metric penalizes false positives more heavily than accuracy
  • Comparison of basic Encoder-Decoder vs U-Net architectures

Lab 5: Handwritten Text Recognition with CTC

Directory: handwritten/ Notebook: HandwrittenCTC.ipynb

Handwritten text line recognition using Connectionist Temporal Classification for alignment-free sequence learning.

Key Concepts:

  • CTC Loss for sequence-to-sequence learning
  • Hybrid CNN-RNN architecture (Residual CNN + Bidirectional LSTM)
  • Residual blocks with stride and dropout
  • Edit distance (Levenshtein distance) for evaluation
  • CTC decoding with blank removal

Dataset:

  • IAM Handwriting Database (5,000 line images)
  • 4,500 training, 500 test samples
  • Images: 400Γ—32 pixels (grayscale)
  • Character set: 74 characters (letters, digits, punctuation, space)

Architecture:

  1. Feature extraction: 9 residual CNN blocks (1β†’32β†’64β†’128 channels)
  2. Sequence conversion: 400Γ—32 β†’ 100Γ—128 features
  3. Temporal modeling: Bidirectional LSTM (128β†’256)
  4. Classification: Linear layer to 74 character classes

Results:

  • Character accuracy: Improves with training (requires 50+ epochs)
  • CTC Loss: Decreased from 3.66 to ~1.28 over 5 epochs
  • Model parameters: 1,614,570

Technical Highlights:

  • CTC enables alignment-free training without character-level timestamps
  • Bidirectional LSTM outperforms simple LSTM
  • Correlation between text length and recognition errors

Lab 6: CTC and Wave2Vec for ASR

Directory: ctc/ Notebook: Copy_of_Lab_3_CTC_and_Wave2Vec_for_ASR_PROF.ipynb

Advanced Automatic Speech Recognition using CTC and Wave2Vec models.

Key Concepts:

  • CTC applications for speech recognition
  • Wave2Vec models for self-supervised speech representation learning
  • End-to-end ASR systems

Technical Highlights:

  • Demonstrates state-of-the-art speech recognition techniques
  • Combines traditional CTC with modern self-supervised learning

Lab 7: Manga Character Generation with DDPM

Directory: manga/ Notebook: Manga_DDPM.ipynb

Generate manga-style character faces using Denoising Diffusion Probabilistic Models.

Key Concepts:

  • Diffusion models with forward and reverse processes
  • U-Net architecture for noise prediction
  • Variance scheduling (Ξ² schedule)
  • Reparameterization trick for sampling
  • Residual blocks with attention mechanisms

Dataset:

  • Anime character faces (12,000 images)
  • Images resized to 32Γ—32 pixels
  • RGB images normalized to [-1, 1]

Diffusion Process:

  1. Forward: Gradually add Gaussian noise over T=300 timesteps
    • x_t = √(Ξ±_t) x_{t-1} + √(1-Ξ±_t) Ξ΅
  2. Model: U-Net with (16, 32, 64, 128) features predicts noise
  3. Sampling: Start from pure noise x_T, iteratively denoise to x_0

Results:

  • Model parameters: 2,430,595
  • Training loss: 0.28 β†’ 0.13-0.19 over 20 epochs
  • Generated recognizable manga-style faces
  • Sampling requires 300 forward passes

Technical Highlights:

  • Experimented with different timestep counts (50, 100, 200, 300)
  • Fewer timesteps = faster but lower quality
  • More timesteps = better quality but slower generation

Lab 8: Variational Autoencoder (VAE)

Directory: VAE/ Notebook: VAE.ipynb

Generate realistic face images using Variational Autoencoders for learning compressed latent representations.

Key Concepts:

  • Encoder-Decoder architecture with probabilistic latent space
  • Reparameterization trick: z = ΞΌ + σ·Ρ
  • ELBO (Evidence Lower Bound) loss
  • KL divergence regularization
  • Latent space interpolation

Dataset:

  • CelebA dataset (subset)
  • Images resized to 64Γ—48 pixels
  • RGB images normalized to [-1, 1]

Architecture:

  • Encoder: 3 conv layers (3β†’32β†’64β†’128) producing ΞΌ and log_var
  • Latent sampling: z ~ N(ΞΌ, σ²) using reparameterization
  • Decoder: 3 transposed conv layers (128β†’64β†’32β†’3)

Loss Function:

  • Reconstruction error: MAE between input and output
  • KL divergence: 0.5 Γ— (Ξ£ + ΞΌΒ² - log(Ξ£))
  • Total: rec_error + λ·KL_div (Ξ»=1e-4)

Results:

  • Encoder parameters: 952,160
  • Decoder parameters: 564,835
  • Training loss: 0.46 β†’ 0.33 over 100 epochs
  • Generated faces show recognizable features

Technical Highlights:

  • Latent space enables smooth interpolation between faces
  • LAMBDA experimentation (1e-5, 1e-7, 1e-9)
  • Different latent space sizes tested (32, 64, 128)

Lab 9: Audio GAN

Directory: Gans/ Notebook: Lab_6_Audio_GAN_(1)_(1).ipynb

Generate musical instrument audio (tambourine sounds) using Generative Adversarial Networks.

Key Concepts:

  • Generator: ConvTranspose1D + Linear layers
  • Discriminator: Multi-layer perceptron with dropout
  • Adversarial training with BCE loss
  • Label smoothing for stability
  • Audio normalization and rescaling

Dataset:

  • FSD Kaggle 2018 dataset (Freesound Dataset)
  • Tambourine samples: 4 seconds at 16kHz (64,000 samples)
  • Audio normalized to [-1, 1]

Architecture:

  • Generator:
    • Input: 100-dim noise vector
    • Output: 64,000-sample waveform
  • Discriminator:
    • Input: 64,000-sample waveform
    • Output: Binary classification (real/fake)

Training Strategy:

  • Alternating discriminator and generator updates
  • Generator trained multiple times per discriminator step
  • Label smoothing (0.9 for real, 0.1 for fake)
  • Batch size: 128

Technical Highlights:

  • Generated tambourine-like sounds after sufficient training
  • Tested stronger discriminator architectures
  • Experimented with training ratios (3:1 G:D updates)

πŸ›  Technologies Used

Deep Learning Frameworks

  • PyTorch: Primary deep learning framework
  • torchaudio: Audio processing
  • torchvision: Computer vision utilities

Key Libraries

  • librosa: Audio analysis and feature extraction
  • numpy: Numerical computations
  • matplotlib: Visualization
  • Pillow (PIL): Image processing

Architectures Implemented

  • Convolutional Neural Networks (CNNs)
  • Residual Networks (ResNets)
  • U-Net
  • Recurrent Neural Networks (RNNs/LSTMs)
  • Encoder-Decoder architectures
  • Generative Adversarial Networks (GANs)
  • Variational Autoencoders (VAEs)
  • Denoising Diffusion Probabilistic Models (DDPM)

Training Techniques

  • Data augmentation (flipping, cropping, noise addition)
  • Batch/Instance normalization
  • Dropout regularization
  • Adam optimizer
  • Learning rate scheduling
  • Label smoothing

Loss Functions

  • Binary Cross Entropy (BCE)
  • Cross Entropy Loss
  • CTC Loss
  • ELBO (Reconstruction + KL Divergence)
  • L1/L2 Loss

Evaluation Metrics

  • Accuracy, Precision, Recall
  • Confusion matrices
  • Intersection over Union (IoU)
  • Edit distance (Levenshtein)
  • Signal-to-Noise Ratio (SNR)

πŸš€ Setup and Requirements

Prerequisites

# Python 3.8+
pip install torch torchvision torchaudio
pip install librosa numpy matplotlib pillow
pip install jupyter notebook

Running the Labs

  1. Clone this repository:
git clone https://github.com/<your-username>/Lab-of-Machine-Learning.git
cd Lab-of-Machine-Learning
  1. Navigate to the desired lab directory:
cd "LAb -1 basic and Lab 2 audio/basic_audio"
# or
cd Petclassification
# etc.
  1. Open the Jupyter notebook:
jupyter notebook
  1. Run the cells sequentially to reproduce the results.

Dataset Notes

Some labs require downloading datasets:

  • Oxford Pet Dataset: Automatically downloaded by torchvision
  • IAM Handwriting Database: May require registration
  • CelebA: Available through torchvision
  • FSD Kaggle 2018: Available on Kaggle
  • Anime Faces: Custom dataset (subset from Kaggle)

πŸ“Š Summary of Applications

Lab Domain Task Key Technique
1 Audio Signal Processing Resampling, Filtering
2 Audio Classification CNN
3 Vision Classification ResNet-style CNN
4 Vision Segmentation U-Net
5 Vision/Text Recognition CTC + BiLSTM
6 Audio/Speech ASR CTC + Wave2Vec
7 Vision Generation DDPM
8 Vision Generation VAE
9 Audio Generation GAN

πŸ“ License

This repository is for educational purposes. Please refer to individual dataset licenses for usage restrictions.


🀝 Contributing

This is a collection of laboratory work. For suggestions or improvements, please open an issue or submit a pull request.


πŸ“§ Contact

For questions or collaborations, please reach out through GitHub issues.


Note: Training times and results may vary depending on hardware specifications. GPU acceleration is recommended for Labs 3-9.

About

A comprehensive collection of machine learning laboratories covering audio processing, computer vision, sequence modeling, and generative models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •