A comprehensive collection of machine learning laboratories covering audio processing, computer vision, sequence modeling, and generative models. This repository demonstrates practical implementations of state-of-the-art deep learning techniques across multiple modalities.
This repository contains 9 comprehensive machine learning laboratories that explore various aspects of deep learning:
- Audio Processing & Generation: Signal processing, classification, and synthesis
- Computer Vision: Image classification and semantic segmentation
- Sequence Modeling: Text and speech recognition using CTC
- Generative Models: GANs, VAEs, and Diffusion models for image and audio generation
Each lab includes detailed Jupyter notebooks with implementations, visualizations, and analysis.
Directory: LAb -1 basic and Lab 2 audio/basic_audio/
Notebook: LAB1_BASIC_PROCESSING.ipynb
Introduction to audio signal processing fundamentals using PyTorch and torchaudio.
Key Concepts:
- Audio file loading and manipulation (WAV, MP3 formats)
- Waveform visualization and spectrogram analysis
- Audio resampling (upsampling/downsampling)
- Signal-to-Noise Ratio (SNR) manipulation
- Audio filtering (low-pass filters)
- Musical chord generation
Datasets:
- VOiCES dataset samples (16kHz voice recordings)
- Steam train whistle audio (44.1kHz MP3)
- Piano sound samples (C1, E1, G1, A1, B1 notes)
Technical Highlights:
- Demonstrated audio format conversions and compression
- Created musical combinations (bichords and triads) through signal mixing
- Illustrated how downsampling affects audio quality
Directory: LAb -1 basic and Lab 2 audio/Audioclass/
Notebook: Lab2_AudioClassification_(1).ipynb
Binary classification of audio samples using Convolutional Neural Networks.
Key Concepts:
- CNN architectures for audio processing
- Spectrogram-based feature extraction
- Audio augmentation techniques
- Training and validation loops
Technical Highlights:
- Demonstrates end-to-end audio classification pipeline
- Uses spectrograms as visual representations of audio for CNN processing
Directory: Petclassification/
Notebook: MainLAB2_Pet_classification.ipynb
Binary image classification distinguishing between cats and dogs using CNNs.
Key Concepts:
- Convolutional Neural Networks with residual blocks
- Batch Normalization and Dropout for regularization
- Data augmentation (RandomHorizontalFlip, RandomResizedCrop)
- Transfer learning concepts
- Confusion matrix analysis
Dataset:
- Oxford Pet Dataset: 7,349 images (6,349 training, 1,000 test)
- Images resized to 160Γ160 pixels
- Classes: Cat (2,047 training), Dog (4,302 training)
Results:
- Training accuracy: ~85-92%
- Test accuracy: ~72-73% with data augmentation
- Model parameters: 1,181,009
- Best performance at epoch 8-10
Technical Highlights:
- 5 convolutional blocks with feature extraction + classifier
- Binary Cross Entropy loss with Adam optimizer
- Visualization of misclassified images
- Generalization testing on new animals (jaguar, fox, lion)
Directory: petsegmentation/
Notebook: MAIN_LAB3_Pet_segmentation_(1)_(1).ipynb
Pixel-level semantic segmentation of pet images into background, cat, and dog classes.
Key Concepts:
- Encoder-Decoder CNN architecture (Fully Convolutional Network)
- U-Net architecture with skip connections
- Convolutional blocks with batch normalization
- MaxPooling and Upsampling layers
- Intersection over Union (IoU) metric
Dataset:
- Oxford Pet Dataset with segmentation masks
- 7,349 images with pixel-level annotations
- Classes: 0=background, 1=cat, 2=dog
Results:
- Overall accuracy: 84.5%
- Overall IoU: 68.5%
- Model parameters: 1,614,570
- Training showed improvement from 56% to 90% accuracy
Technical Highlights:
- 3-level encoder-decoder with 64 base width
- Cross Entropy Loss averaged over all pixels
- IoU metric penalizes false positives more heavily than accuracy
- Comparison of basic Encoder-Decoder vs U-Net architectures
Directory: handwritten/
Notebook: HandwrittenCTC.ipynb
Handwritten text line recognition using Connectionist Temporal Classification for alignment-free sequence learning.
Key Concepts:
- CTC Loss for sequence-to-sequence learning
- Hybrid CNN-RNN architecture (Residual CNN + Bidirectional LSTM)
- Residual blocks with stride and dropout
- Edit distance (Levenshtein distance) for evaluation
- CTC decoding with blank removal
Dataset:
- IAM Handwriting Database (5,000 line images)
- 4,500 training, 500 test samples
- Images: 400Γ32 pixels (grayscale)
- Character set: 74 characters (letters, digits, punctuation, space)
Architecture:
- Feature extraction: 9 residual CNN blocks (1β32β64β128 channels)
- Sequence conversion: 400Γ32 β 100Γ128 features
- Temporal modeling: Bidirectional LSTM (128β256)
- Classification: Linear layer to 74 character classes
Results:
- Character accuracy: Improves with training (requires 50+ epochs)
- CTC Loss: Decreased from 3.66 to ~1.28 over 5 epochs
- Model parameters: 1,614,570
Technical Highlights:
- CTC enables alignment-free training without character-level timestamps
- Bidirectional LSTM outperforms simple LSTM
- Correlation between text length and recognition errors
Directory: ctc/
Notebook: Copy_of_Lab_3_CTC_and_Wave2Vec_for_ASR_PROF.ipynb
Advanced Automatic Speech Recognition using CTC and Wave2Vec models.
Key Concepts:
- CTC applications for speech recognition
- Wave2Vec models for self-supervised speech representation learning
- End-to-end ASR systems
Technical Highlights:
- Demonstrates state-of-the-art speech recognition techniques
- Combines traditional CTC with modern self-supervised learning
Directory: manga/
Notebook: Manga_DDPM.ipynb
Generate manga-style character faces using Denoising Diffusion Probabilistic Models.
Key Concepts:
- Diffusion models with forward and reverse processes
- U-Net architecture for noise prediction
- Variance scheduling (Ξ² schedule)
- Reparameterization trick for sampling
- Residual blocks with attention mechanisms
Dataset:
- Anime character faces (12,000 images)
- Images resized to 32Γ32 pixels
- RGB images normalized to [-1, 1]
Diffusion Process:
- Forward: Gradually add Gaussian noise over T=300 timesteps
- x_t = β(Ξ±_t) x_{t-1} + β(1-Ξ±_t) Ξ΅
- Model: U-Net with (16, 32, 64, 128) features predicts noise
- Sampling: Start from pure noise x_T, iteratively denoise to x_0
Results:
- Model parameters: 2,430,595
- Training loss: 0.28 β 0.13-0.19 over 20 epochs
- Generated recognizable manga-style faces
- Sampling requires 300 forward passes
Technical Highlights:
- Experimented with different timestep counts (50, 100, 200, 300)
- Fewer timesteps = faster but lower quality
- More timesteps = better quality but slower generation
Directory: VAE/
Notebook: VAE.ipynb
Generate realistic face images using Variational Autoencoders for learning compressed latent representations.
Key Concepts:
- Encoder-Decoder architecture with probabilistic latent space
- Reparameterization trick: z = ΞΌ + ΟΒ·Ξ΅
- ELBO (Evidence Lower Bound) loss
- KL divergence regularization
- Latent space interpolation
Dataset:
- CelebA dataset (subset)
- Images resized to 64Γ48 pixels
- RGB images normalized to [-1, 1]
Architecture:
- Encoder: 3 conv layers (3β32β64β128) producing ΞΌ and log_var
- Latent sampling: z ~ N(ΞΌ, ΟΒ²) using reparameterization
- Decoder: 3 transposed conv layers (128β64β32β3)
Loss Function:
- Reconstruction error: MAE between input and output
- KL divergence: 0.5 Γ (Ξ£ + ΞΌΒ² - log(Ξ£))
- Total: rec_error + λ·KL_div (λ=1e-4)
Results:
- Encoder parameters: 952,160
- Decoder parameters: 564,835
- Training loss: 0.46 β 0.33 over 100 epochs
- Generated faces show recognizable features
Technical Highlights:
- Latent space enables smooth interpolation between faces
- LAMBDA experimentation (1e-5, 1e-7, 1e-9)
- Different latent space sizes tested (32, 64, 128)
Directory: Gans/
Notebook: Lab_6_Audio_GAN_(1)_(1).ipynb
Generate musical instrument audio (tambourine sounds) using Generative Adversarial Networks.
Key Concepts:
- Generator: ConvTranspose1D + Linear layers
- Discriminator: Multi-layer perceptron with dropout
- Adversarial training with BCE loss
- Label smoothing for stability
- Audio normalization and rescaling
Dataset:
- FSD Kaggle 2018 dataset (Freesound Dataset)
- Tambourine samples: 4 seconds at 16kHz (64,000 samples)
- Audio normalized to [-1, 1]
Architecture:
- Generator:
- Input: 100-dim noise vector
- Output: 64,000-sample waveform
- Discriminator:
- Input: 64,000-sample waveform
- Output: Binary classification (real/fake)
Training Strategy:
- Alternating discriminator and generator updates
- Generator trained multiple times per discriminator step
- Label smoothing (0.9 for real, 0.1 for fake)
- Batch size: 128
Technical Highlights:
- Generated tambourine-like sounds after sufficient training
- Tested stronger discriminator architectures
- Experimented with training ratios (3:1 G:D updates)
- PyTorch: Primary deep learning framework
- torchaudio: Audio processing
- torchvision: Computer vision utilities
- librosa: Audio analysis and feature extraction
- numpy: Numerical computations
- matplotlib: Visualization
- Pillow (PIL): Image processing
- Convolutional Neural Networks (CNNs)
- Residual Networks (ResNets)
- U-Net
- Recurrent Neural Networks (RNNs/LSTMs)
- Encoder-Decoder architectures
- Generative Adversarial Networks (GANs)
- Variational Autoencoders (VAEs)
- Denoising Diffusion Probabilistic Models (DDPM)
- Data augmentation (flipping, cropping, noise addition)
- Batch/Instance normalization
- Dropout regularization
- Adam optimizer
- Learning rate scheduling
- Label smoothing
- Binary Cross Entropy (BCE)
- Cross Entropy Loss
- CTC Loss
- ELBO (Reconstruction + KL Divergence)
- L1/L2 Loss
- Accuracy, Precision, Recall
- Confusion matrices
- Intersection over Union (IoU)
- Edit distance (Levenshtein)
- Signal-to-Noise Ratio (SNR)
# Python 3.8+
pip install torch torchvision torchaudio
pip install librosa numpy matplotlib pillow
pip install jupyter notebook- Clone this repository:
git clone https://github.com/<your-username>/Lab-of-Machine-Learning.git
cd Lab-of-Machine-Learning- Navigate to the desired lab directory:
cd "LAb -1 basic and Lab 2 audio/basic_audio"
# or
cd Petclassification
# etc.- Open the Jupyter notebook:
jupyter notebook- Run the cells sequentially to reproduce the results.
Some labs require downloading datasets:
- Oxford Pet Dataset: Automatically downloaded by torchvision
- IAM Handwriting Database: May require registration
- CelebA: Available through torchvision
- FSD Kaggle 2018: Available on Kaggle
- Anime Faces: Custom dataset (subset from Kaggle)
| Lab | Domain | Task | Key Technique |
|---|---|---|---|
| 1 | Audio | Signal Processing | Resampling, Filtering |
| 2 | Audio | Classification | CNN |
| 3 | Vision | Classification | ResNet-style CNN |
| 4 | Vision | Segmentation | U-Net |
| 5 | Vision/Text | Recognition | CTC + BiLSTM |
| 6 | Audio/Speech | ASR | CTC + Wave2Vec |
| 7 | Vision | Generation | DDPM |
| 8 | Vision | Generation | VAE |
| 9 | Audio | Generation | GAN |
This repository is for educational purposes. Please refer to individual dataset licenses for usage restrictions.
This is a collection of laboratory work. For suggestions or improvements, please open an issue or submit a pull request.
For questions or collaborations, please reach out through GitHub issues.
Note: Training times and results may vary depending on hardware specifications. GPU acceleration is recommended for Labs 3-9.