Learnings from implementing deep learning models from scratch with configuration management (using Hydra) and NVTX profiling.
ai-learning/
โโโ src/encoding_101/
โ โโโ models/ # Autoencoder implementations
โ โ โโโ base.py # Base autoencoder class with MAR@5 visualization
โ โ โโโ vanilla_autoencoder.py # Vanilla + NVTX-enabled autoencoders
โ โ โโโ cnn_autoencoder.py # CNN + NVTX-enabled autoencoders
โ โโโ mixins/ # Reusable functionality
โ โ โโโ nvtx.py # NVTX profiling mixin for any model
โ โโโ training/ # Training utilities
โ โ โโโ trainer.py # Main training orchestration
โ โโโ data.py # CIFAR-10 data module (Lightning)
โ โโโ metrics.py # Evaluation metrics (MAR@5)
โ โโโ utils.py # Visualization utilities
โ โโโ visualization/ # Advanced plotting and analysis tools
โโโ configs/ # Hydra configuration management
โ โโโ config.yaml # Main configuration with defaults
โ โโโ model/ # Model configurations
โ โ โโโ nvtx_vanilla_autoencoder.yaml
โ โ โโโ cnn_autoencoder.yaml
โ โโโ training/ # Training configurations
โ โ โโโ default.yaml
โ โ โโโ profiling.yaml
โ โโโ data/ # Data module configurations
โ โ โโโ cifar10.yaml
โ โโโ profiling/ # NVTX profiling settings
โ โ โโโ default.yaml
โ โโโ experiment/ # Pre-configured experiments
โ โโโ cnn_comparison.yaml
โโโ scripts/
โ โโโ hydra_training.py # Hydra-powered training script (pure loguru)
โ โโโ test_hydra_config.py # Configuration validation
โโโ bin/
โ โโโ run # Docker runtime script (uv-powered)
โโโ logs/ # Comprehensive logging system
โ โโโ hydra_runs/ # Default Hydra runs with full logs
โ โโโ experiments/ # Experiment-specific runs
โ โโโ tensorboard_logs/ # TensorBoard visualization data
โโโ data/ # CIFAR-10 dataset storage
โโโ pyproject.toml # Project dependencies (uv managed)
โโโ uv.lock # Locked dependencies
This project uses Hydra for configuration management, enabling:
- Hierarchical configurations with composable components
- Experiment tracking with automatic logging
- Parameter sweeps and multi-run experiments
- Reproducible research with complete configuration capture
# Local training (no Docker) - recommended for development
bin/run train-local
# Override specific parameters
bin/run train-local model=cnn_autoencoder training.max_epochs=20
# Run pre-configured experiments
bin/run train-local --config-name=experiment/cnn_comparison
# Multi-run parameter sweeps
bin/run train-local model=nvtx_vanilla_autoencoder,cnn_autoencoder training.max_epochs=5,10,20 --multirun
# Profiling mode
bin/run train-local training=profiling
# Direct script usage (alternative)
uv run python scripts/hydra_training.py
uv run python scripts/hydra_training.py model=cnn_autoencoder training.max_epochs=20# configs/model/cnn_autoencoder.yaml
model:
_target_: src.encoding_101.models.cnn_autoencoder.NVTXCNNAutoencoder
latent_dim: 128
visualize_mar: true
enable_nvtx: true# configs/training/default.yaml
max_epochs: 10
batch_size: 64
learning_rate: 1e-4
trainer:
accelerator: auto
devices: [0]
precision: 16-mixed# configs/experiment/cnn_comparison.yaml
defaults:
- /model: cnn_autoencoder
- /training: default
model:
latent_dim: 256 # Larger latent space
training:
max_epochs: 20
batch_size: 128logs/
โโโ hydra_runs/ # Default configuration runs
โ โโโ [experiment_name]/
โ โโโ .hydra/ # Complete Hydra metadata
โ โ โโโ config.yaml # Resolved configuration
โ โ โโโ overrides.yaml # CLI overrides used
โ โ โโโ hydra.yaml # Hydra internal settings
โ โโโ [experiment_name].log # Full training logs
โโโ experiments/ # Custom experiment runs
โ โโโ [experiment_name]/ # Same structure as above
โโโ tensorboard_logs/ # TensorBoard data
๐ Loguru logging configured - logs will be saved to: /path/to/log
๐ฎ Hydra Configuration:
๐๏ธ Instantiating model...
โ
Model class loaded: NVTXCNNAutoencoder
๐ Instantiating data module...
โ
Data module loaded: CIFAR10DataModule
๐ Starting training process...
โ
Training complete!
This project supports both local training (recommended for development) and Docker training (for full containerization).
- Faster iteration: No Docker overhead
- Direct access: Work with your local environment
- Easy debugging: Direct Python debugging
- Resource efficient: Uses your system resources directly
# Setup once (installs uv and dependencies)
bin/run setup
# Train locally with Hydra
bin/run train-local
bin/run train-local model=cnn_autoencoder training.max_epochs=20The project uses uv for ultra-fast Python package management in Docker, providing 10-100x faster dependency installation compared to pip.
- Consistent environment: Same environment everywhere
- GPU isolation: Containerized CUDA drivers
- Production-ready: Matches deployment environment
- Full profiling: Complete NVTX profiling support
# Train with Docker
bin/run hydra-train
bin/run hydra-train model=cnn_autoencoder training.max_epochs=20# One-time setup - installs uv and all dependencies
bin/run setupThe bin/run script will automatically install missing Docker dependencies:
- Docker & Docker Compose: Install Docker
- Auto-installed:
bin/rundetects and installs Docker + Docker Compose automatically
- Auto-installed:
- NVIDIA Docker Runtime: Install nvidia-docker2
- Auto-installed:
bin/rundetects and installs NVIDIA Docker runtime automatically
- Auto-installed:
- GPU Support: No host GPU drivers needed!
- Container-only: All CUDA drivers and GPU libraries are provided by the NVIDIA PyTorch container
- Server-safe: No destructive changes to your host system
๐ก Quick Start: For most users, just run
bin/run setupthenbin/run train-local! ๐ณ Docker Users: Run anybin/run hydra-traincommand and it will guide you through Docker setup! ๐ก๏ธ Server-Safe: All GPU drivers stay safely contained within Docker containers!
# One-time setup (installs uv and dependencies)
bin/run setup
# Basic training
bin/run train-local
# Run experiments
bin/run train-local --config-name=experiment/cnn_comparison
# Override parameters
bin/run train-local model=cnn_autoencoder training.max_epochs=20
# Multi-run parameter sweeps
bin/run train-local model=nvtx_vanilla_autoencoder,cnn_autoencoder --multirun
# Start TensorBoard
bin/run tensorboard-uv
# List available models
bin/run list-models-local# Basic training with Docker
bin/run hydra-train
# Run experiments with Docker
bin/run hydra-train --config-name=experiment/cnn_comparison
# Override parameters with Docker
bin/run hydra-train model=cnn_autoencoder training.max_epochs=20
# Multi-run parameter sweeps with Docker
bin/run hydra-train model=nvtx_vanilla_autoencoder,cnn_autoencoder --multirun
# Start TensorBoard with Docker
bin/run tensorboard
# Interactive development shell
bin/run shell
# List available models
bin/run list-models# Setup local development environment (required for local training)
bin/run setup
# Legacy training commands (still supported for backwards compatibility)
bin/run train 20 128 # Legacy: 20 epochs, batch size 128 (Docker)
bin/run train-nvtx 5 32 # Legacy: NVTX training, 5 epochs, batch size 32 (Docker)
# Development tools
bin/run shell # Interactive development shell (Docker)
bin/run cleanup # Clean up Docker resources
bin/run export-requirements # Export filtered requirements.txt
bin/run list-models # List available models (Docker)
bin/run list-models-local # List available models (local)NVTX (NVIDIA Tools Extension) annotations provide semantic information about your training loop, enabling detailed performance analysis and bottleneck identification.
All models support NVTX profiling through the NVTXProfilingMixin:
# Available NVTX-enabled models
from src.encoding_101.models.vanilla_autoencoder import NVTXVanillaAutoencoder
from src.encoding_101.models.cnn_autoencoder import NVTXCNNAutoencoder
# NVTX can be enabled/disabled per model
model = NVTXVanillaAutoencoder(enable_nvtx=True)| Operation | Color | Purpose |
|---|---|---|
| Training Steps | Green | Forward/backward training passes |
| Validation | Blue | Validation forward passes |
| Forward Pass | Yellow | Model forward computation |
| Loss Computation | Orange | Loss calculation |
| Metrics | Purple | MAR@5 and other metrics |
| Visualization | Pink | Image processing and logging |
| Data Transfer | Cyan | CPUโGPU data movement |
| Memory Ops | Magenta | Memory allocation tracking |
# With Hydra configuration (local)
bin/run train-local training=profiling
# With Hydra configuration (Docker)
bin/run hydra-train training=profiling
# Direct script usage (local)
uv run python scripts/hydra_training.py training=profiling
# Legacy profiling with Docker (still supported)
bin/run profile
# Manual profiling in container
docker-compose run --rm profiling \
nsys profile --trace=nvtx,cuda --output=/app/profiling_output/profile \
uv run scripts/hydra_training.py training=profiling
# Analyze results with local Mac/Windows system. Download the file and load up the *.qdrep file๐ EPOCH N - TRAINING START
โโโ Training Step
โ โโโ Data Unpack [should be fast]
โ โโโ Forward Pass [main computation]
โ โ โโโ Encoder Forward [first half]
โ โ โโโ Decoder Forward [second half]
โ โโโ Loss Computation [should be minimal]
๐ EPOCH N - VALIDATION START
โโโ Validation Step
โ โโโ Val Forward Pass
โ โโโ MAR@5 Computation [periodic, can be expensive]
โ โโโ Visualization [periodic, I/O bound]
โ
EPOCH N - VALIDATION END
# Compare different models (local - recommended)
bin/run train-local model=nvtx_vanilla_autoencoder,cnn_autoencoder --multirun
# Compare different models (Docker)
bin/run hydra-train model=nvtx_vanilla_autoencoder,cnn_autoencoder --multirun
# Parameter sweeps (local - recommended)
bin/run train-local training.learning_rate=1e-3,1e-4,1e-5 model.latent_dim=64,128,256 --multirun
# Parameter sweeps (Docker)
bin/run hydra-train training.learning_rate=1e-3,1e-4,1e-5 model.latent_dim=64,128,256 --multirun
# Batch size optimization (local)
bin/run train-local training.batch_size=32,64,128,256 --multirun
# Batch size optimization (Docker)
bin/run hydra-train training.batch_size=32,64,128,256 --multirun
# Direct script usage (alternative)
uv run python scripts/hydra_training.py model=nvtx_vanilla_autoencoder,cnn_autoencoder --multirun- Hydra Documentation - Configuration management framework
- Loguru Documentation - Beautiful Python logging
- uv Documentation - Ultra-fast Python package manager
- uv Docker Integration - Best practices for uv in Docker
- NVIDIA Nsight Systems - Visual profiler for analyzing results
- PyTorch Lightning - Training framework used
Need Help? Check the configuration examples or run:
# Main help and commands
bin/run help
# Local training help
bin/run train-local --help
# Docker training help
bin/run hydra-train --help
# List available models (local)
bin/run list-models-local
# List available models (Docker)
bin/run list-models
# Direct script help (alternative)
uv run python scripts/hydra_training.py --help