Skip to content

pratos/ai-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

31 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

AI Learning

Learnings from implementing deep learning models from scratch with configuration management (using Hydra) and NVTX profiling.

๐Ÿ“ Project Structure

ai-learning/
โ”œโ”€โ”€ src/encoding_101/
โ”‚   โ”œโ”€โ”€ models/              # Autoencoder implementations
โ”‚   โ”‚   โ”œโ”€โ”€ base.py         # Base autoencoder class with MAR@5 visualization
โ”‚   โ”‚   โ”œโ”€โ”€ vanilla_autoencoder.py  # Vanilla + NVTX-enabled autoencoders
โ”‚   โ”‚   โ””โ”€โ”€ cnn_autoencoder.py      # CNN + NVTX-enabled autoencoders
โ”‚   โ”œโ”€โ”€ mixins/             # Reusable functionality
โ”‚   โ”‚   โ””โ”€โ”€ nvtx.py         # NVTX profiling mixin for any model
โ”‚   โ”œโ”€โ”€ training/           # Training utilities
โ”‚   โ”‚   โ””โ”€โ”€ trainer.py      # Main training orchestration
โ”‚   โ”œโ”€โ”€ data.py             # CIFAR-10 data module (Lightning)
โ”‚   โ”œโ”€โ”€ metrics.py          # Evaluation metrics (MAR@5)
โ”‚   โ”œโ”€โ”€ utils.py            # Visualization utilities
โ”‚   โ””โ”€โ”€ visualization/      # Advanced plotting and analysis tools
โ”œโ”€โ”€ configs/                # Hydra configuration management
โ”‚   โ”œโ”€โ”€ config.yaml         # Main configuration with defaults
โ”‚   โ”œโ”€โ”€ model/              # Model configurations
โ”‚   โ”‚   โ”œโ”€โ”€ nvtx_vanilla_autoencoder.yaml
โ”‚   โ”‚   โ””โ”€โ”€ cnn_autoencoder.yaml
โ”‚   โ”œโ”€โ”€ training/           # Training configurations
โ”‚   โ”‚   โ”œโ”€โ”€ default.yaml
โ”‚   โ”‚   โ””โ”€โ”€ profiling.yaml
โ”‚   โ”œโ”€โ”€ data/               # Data module configurations
โ”‚   โ”‚   โ””โ”€โ”€ cifar10.yaml
โ”‚   โ”œโ”€โ”€ profiling/          # NVTX profiling settings
โ”‚   โ”‚   โ””โ”€โ”€ default.yaml
โ”‚   โ””โ”€โ”€ experiment/         # Pre-configured experiments
โ”‚       โ””โ”€โ”€ cnn_comparison.yaml
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ hydra_training.py   # Hydra-powered training script (pure loguru)
โ”‚   โ””โ”€โ”€ test_hydra_config.py # Configuration validation
โ”œโ”€โ”€ bin/
โ”‚   โ””โ”€โ”€ run                 # Docker runtime script (uv-powered)
โ”œโ”€โ”€ logs/                   # Comprehensive logging system
โ”‚   โ”œโ”€โ”€ hydra_runs/         # Default Hydra runs with full logs
โ”‚   โ”œโ”€โ”€ experiments/        # Experiment-specific runs
โ”‚   โ””โ”€โ”€ tensorboard_logs/   # TensorBoard visualization data
โ”œโ”€โ”€ data/                   # CIFAR-10 dataset storage
โ”œโ”€โ”€ pyproject.toml          # Project dependencies (uv managed)
โ””โ”€โ”€ uv.lock                 # Locked dependencies

๐ŸŽฏ Hydra Configuration Management

This project uses Hydra for configuration management, enabling:

  • Hierarchical configurations with composable components
  • Experiment tracking with automatic logging
  • Parameter sweeps and multi-run experiments
  • Reproducible research with complete configuration capture

๐Ÿš€ Quick Start with Hydra

# Local training (no Docker) - recommended for development
bin/run train-local

# Override specific parameters
bin/run train-local model=cnn_autoencoder training.max_epochs=20

# Run pre-configured experiments
bin/run train-local --config-name=experiment/cnn_comparison

# Multi-run parameter sweeps
bin/run train-local model=nvtx_vanilla_autoencoder,cnn_autoencoder training.max_epochs=5,10,20 --multirun

# Profiling mode
bin/run train-local training=profiling

# Direct script usage (alternative)
uv run python scripts/hydra_training.py
uv run python scripts/hydra_training.py model=cnn_autoencoder training.max_epochs=20

๐Ÿ“Š Configuration Examples

Model Configurations

# configs/model/cnn_autoencoder.yaml
model:
  _target_: src.encoding_101.models.cnn_autoencoder.NVTXCNNAutoencoder
  latent_dim: 128
  visualize_mar: true
  enable_nvtx: true

Training Configurations

# configs/training/default.yaml
max_epochs: 10
batch_size: 64
learning_rate: 1e-4
trainer:
  accelerator: auto
  devices: [0]
  precision: 16-mixed

Experiment Configurations

# configs/experiment/cnn_comparison.yaml
defaults:
  - /model: cnn_autoencoder
  - /training: default

model:
  latent_dim: 256  # Larger latent space
training:
  max_epochs: 20
  batch_size: 128

๐Ÿ“ Pure Loguru Logging System

๐Ÿ“ Log Structure

logs/
โ”œโ”€โ”€ hydra_runs/                    # Default configuration runs
โ”‚   โ””โ”€โ”€ [experiment_name]/
โ”‚       โ”œโ”€โ”€ .hydra/               # Complete Hydra metadata
โ”‚       โ”‚   โ”œโ”€โ”€ config.yaml       # Resolved configuration
โ”‚       โ”‚   โ”œโ”€โ”€ overrides.yaml    # CLI overrides used
โ”‚       โ”‚   โ””โ”€โ”€ hydra.yaml        # Hydra internal settings
โ”‚       โ””โ”€โ”€ [experiment_name].log # Full training logs
โ”œโ”€โ”€ experiments/                   # Custom experiment runs
โ”‚   โ””โ”€โ”€ [experiment_name]/        # Same structure as above
โ””โ”€โ”€ tensorboard_logs/             # TensorBoard data

๐ŸŽจ Log Output Examples

๐Ÿ“ Loguru logging configured - logs will be saved to: /path/to/log
๐Ÿ”ฎ Hydra Configuration:
๐Ÿ—๏ธ Instantiating model...
โœ… Model class loaded: NVTXCNNAutoencoder
๐Ÿ“Š Instantiating data module...
โœ… Data module loaded: CIFAR10DataModule
๐Ÿš€ Starting training process...
โœ… Training complete!

๐Ÿš€ Training Options

This project supports both local training (recommended for development) and Docker training (for full containerization).

๐Ÿ  Local Training (Recommended)

  • Faster iteration: No Docker overhead
  • Direct access: Work with your local environment
  • Easy debugging: Direct Python debugging
  • Resource efficient: Uses your system resources directly
# Setup once (installs uv and dependencies)
bin/run setup

# Train locally with Hydra
bin/run train-local
bin/run train-local model=cnn_autoencoder training.max_epochs=20

๐Ÿณ Docker Training (Full Containerization)

The project uses uv for ultra-fast Python package management in Docker, providing 10-100x faster dependency installation compared to pip.

  • Consistent environment: Same environment everywhere
  • GPU isolation: Containerized CUDA drivers
  • Production-ready: Matches deployment environment
  • Full profiling: Complete NVTX profiling support
# Train with Docker
bin/run hydra-train
bin/run hydra-train model=cnn_autoencoder training.max_epochs=20

Prerequisites

For Local Training (Recommended)

# One-time setup - installs uv and all dependencies
bin/run setup

For Docker Training (Optional)

The bin/run script will automatically install missing Docker dependencies:

  1. Docker & Docker Compose: Install Docker
    • Auto-installed: bin/run detects and installs Docker + Docker Compose automatically
  2. NVIDIA Docker Runtime: Install nvidia-docker2
    • Auto-installed: bin/run detects and installs NVIDIA Docker runtime automatically
  3. GPU Support: No host GPU drivers needed!
    • Container-only: All CUDA drivers and GPU libraries are provided by the NVIDIA PyTorch container
    • Server-safe: No destructive changes to your host system

๐Ÿ’ก Quick Start: For most users, just run bin/run setup then bin/run train-local! ๐Ÿณ Docker Users: Run any bin/run hydra-train command and it will guide you through Docker setup! ๐Ÿ›ก๏ธ Server-Safe: All GPU drivers stay safely contained within Docker containers!

๐Ÿ  Local Development Quick Start

# One-time setup (installs uv and dependencies)
bin/run setup

# Basic training
bin/run train-local

# Run experiments
bin/run train-local --config-name=experiment/cnn_comparison

# Override parameters
bin/run train-local model=cnn_autoencoder training.max_epochs=20

# Multi-run parameter sweeps
bin/run train-local model=nvtx_vanilla_autoencoder,cnn_autoencoder --multirun

# Start TensorBoard
bin/run tensorboard-uv

# List available models
bin/run list-models-local

๐Ÿณ Docker Quick Start

# Basic training with Docker
bin/run hydra-train

# Run experiments with Docker
bin/run hydra-train --config-name=experiment/cnn_comparison

# Override parameters with Docker
bin/run hydra-train model=cnn_autoencoder training.max_epochs=20

# Multi-run parameter sweeps with Docker
bin/run hydra-train model=nvtx_vanilla_autoencoder,cnn_autoencoder --multirun

# Start TensorBoard with Docker
bin/run tensorboard

# Interactive development shell
bin/run shell

# List available models
bin/run list-models

๐Ÿ”ง Advanced Features

# Setup local development environment (required for local training)
bin/run setup

# Legacy training commands (still supported for backwards compatibility)
bin/run train 20 128        # Legacy: 20 epochs, batch size 128 (Docker)
bin/run train-nvtx 5 32     # Legacy: NVTX training, 5 epochs, batch size 32 (Docker)

# Development tools
bin/run shell               # Interactive development shell (Docker)
bin/run cleanup             # Clean up Docker resources
bin/run export-requirements # Export filtered requirements.txt
bin/run list-models         # List available models (Docker)
bin/run list-models-local   # List available models (local)

๐Ÿ” NVTX Profiling for Performance Optimization

NVTX (NVIDIA Tools Extension) annotations provide semantic information about your training loop, enabling detailed performance analysis and bottleneck identification.

๐ŸŽจ NVTX Integration

All models support NVTX profiling through the NVTXProfilingMixin:

# Available NVTX-enabled models
from src.encoding_101.models.vanilla_autoencoder import NVTXVanillaAutoencoder
from src.encoding_101.models.cnn_autoencoder import NVTXCNNAutoencoder

# NVTX can be enabled/disabled per model
model = NVTXVanillaAutoencoder(enable_nvtx=True)

๐ŸŽจ NVTX Color Scheme

Operation Color Purpose
Training Steps Green Forward/backward training passes
Validation Blue Validation forward passes
Forward Pass Yellow Model forward computation
Loss Computation Orange Loss calculation
Metrics Purple MAR@5 and other metrics
Visualization Pink Image processing and logging
Data Transfer Cyan CPUโ†”GPU data movement
Memory Ops Magenta Memory allocation tracking

๐Ÿ“Š Using NVTX Profiling

# With Hydra configuration (local)
bin/run train-local training=profiling

# With Hydra configuration (Docker)
bin/run hydra-train training=profiling

# Direct script usage (local)
uv run python scripts/hydra_training.py training=profiling

# Legacy profiling with Docker (still supported)
bin/run profile

# Manual profiling in container
docker-compose run --rm profiling \
    nsys profile --trace=nvtx,cuda --output=/app/profiling_output/profile \
    uv run scripts/hydra_training.py training=profiling

# Analyze results with local Mac/Windows system. Download the file and load up the *.qdrep file

NVTX Annotation Hierarchy

๐Ÿš€ EPOCH N - TRAINING START
โ”œโ”€โ”€ Training Step
โ”‚   โ”œโ”€โ”€ Data Unpack          [should be fast]
โ”‚   โ”œโ”€โ”€ Forward Pass         [main computation]
โ”‚   โ”‚   โ”œโ”€โ”€ Encoder Forward  [first half]
โ”‚   โ”‚   โ””โ”€โ”€ Decoder Forward  [second half]
โ”‚   โ””โ”€โ”€ Loss Computation     [should be minimal]
๐Ÿ” EPOCH N - VALIDATION START
โ”œโ”€โ”€ Validation Step
โ”‚   โ”œโ”€โ”€ Val Forward Pass
โ”‚   โ”œโ”€โ”€ MAR@5 Computation    [periodic, can be expensive]
โ”‚   โ””โ”€โ”€ Visualization        [periodic, I/O bound]
โœ… EPOCH N - VALIDATION END

๐Ÿงช Advanced Usage

Multi-Run Experiments

# Compare different models (local - recommended)
bin/run train-local model=nvtx_vanilla_autoencoder,cnn_autoencoder --multirun

# Compare different models (Docker)
bin/run hydra-train model=nvtx_vanilla_autoencoder,cnn_autoencoder --multirun

# Parameter sweeps (local - recommended)
bin/run train-local training.learning_rate=1e-3,1e-4,1e-5 model.latent_dim=64,128,256 --multirun

# Parameter sweeps (Docker)
bin/run hydra-train training.learning_rate=1e-3,1e-4,1e-5 model.latent_dim=64,128,256 --multirun

# Batch size optimization (local)
bin/run train-local training.batch_size=32,64,128,256 --multirun

# Batch size optimization (Docker)
bin/run hydra-train training.batch_size=32,64,128,256 --multirun

# Direct script usage (alternative)
uv run python scripts/hydra_training.py model=nvtx_vanilla_autoencoder,cnn_autoencoder --multirun

๐Ÿ“š Learning Resources


Need Help? Check the configuration examples or run:

# Main help and commands
bin/run help

# Local training help
bin/run train-local --help

# Docker training help
bin/run hydra-train --help

# List available models (local)
bin/run list-models-local

# List available models (Docker)
bin/run list-models

# Direct script help (alternative)
uv run python scripts/hydra_training.py --help

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published