Skip to content

This repository contains a comprehensive solution for the Stanford RNA 3D Folding Kaggle competition, implementing advanced machine learning techniques for predicting RNA 3D structures from sequence data. The project demonstrates expertise in bioinformatics, deep learning, and scientific computing.

License

Notifications You must be signed in to change notification settings

firebitsbr/Stanford-RNA-3D-Folding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

23 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Stanford RNA 3D Folding โ€” Kaggle Competition

License: MIT Python 3.13 PyTorch CUDA 12.8

Predicting 3D RNA structure from sequence using deep learning

Competition โ€ข Documentation โ€ข Notebooks


๐Ÿ“‹ Table of Contents


๐ŸŽฏ Overview

This repository contains a complete solution pipeline for the Stanford RNA 3D Folding Kaggle competition, which challenges participants to predict 3D atomic coordinates of RNA molecules from their nucleotide sequences.

Competition Goal: Predict the 3D spatial coordinates (x, y, z) for each nucleotide in RNA sequences to advance computational biology and drug discovery.

Approach: Deep learning models (LSTM/Transformer-based) trained on RNA sequences with multiple sequence alignments (MSA) features.


โœจ Key Features

  • ๐Ÿงฌ End-to-end RNA 3D structure prediction pipeline
  • ๐Ÿš€ GPU-accelerated training (CUDA 11.8 support for sm_61+ architectures)
  • ๐Ÿ“Š Interactive Jupyter notebooks for EDA, training, and submission
  • ๐Ÿ”„ Automated preprocessing with MSA feature extraction
  • ๐Ÿ“ค One-click Kaggle submission with secure credential handling
  • ๐Ÿงช Production-ready model deployment with validation and post-processing
  • ๐Ÿ“ Comprehensive documentation and code examples
  • ๐Ÿ› ๏ธ Modular architecture for easy experimentation

๐Ÿ“ Project Structure

Stanford-RNA-3D-Folding/
โ”œโ”€โ”€ stanford_rna3d/              # Main project directory
โ”‚   โ”œโ”€โ”€ data/                    # Data storage
โ”‚   โ”‚   โ”œโ”€โ”€ raw/                 # Competition data (sequences, labels, MSA)
โ”‚   โ”‚   โ”œโ”€โ”€ processed/           # Preprocessed features
โ”‚   โ”‚   โ”œโ”€โ”€ interim/             # Intermediate processing files
โ”‚   โ”‚   โ””โ”€โ”€ external/            # External datasets
โ”‚   โ”œโ”€โ”€ notebooks/               # Jupyter notebooks
โ”‚   โ”‚   โ”œโ”€โ”€ 00_competition_overview.ipynb
โ”‚   โ”‚   โ”œโ”€โ”€ 01_eda.ipynb         # Exploratory data analysis
โ”‚   โ”‚   โ”œโ”€โ”€ 02_baseline.ipynb   # Baseline model training
โ”‚   โ”‚   โ”œโ”€โ”€ 03_advanced.ipynb   # Advanced architectures
โ”‚   โ”‚   โ””โ”€โ”€ 04_submission.ipynb # Submission generation + Kaggle upload
โ”‚   โ”œโ”€โ”€ src/                     # Source code modules
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ models.py            # Neural network architectures
โ”‚   โ”‚   โ””โ”€โ”€ data_processing.py  # Data preprocessing utilities
โ”‚   โ”œโ”€โ”€ scripts/                 # Utility scripts
โ”‚   โ”‚   โ”œโ”€โ”€ 00_environment_manager.py
โ”‚   โ”‚   โ”œโ”€โ”€ 01_create_env.py    # Virtual environment setup
โ”‚   โ”‚   โ”œโ”€โ”€ 02_setup_project.py # Project structure setup
โ”‚   โ”‚   โ”œโ”€โ”€ 03_submit_late.py   # CLI submission tool
โ”‚   โ”‚   โ”œโ”€โ”€ pii_scanner.py
โ”‚   โ”‚   โ”œโ”€โ”€ system_specs_checker.py
โ”‚   โ”‚   โ””โ”€โ”€ .venv/              # Virtual environment (after setup)
โ”‚   โ”œโ”€โ”€ checkpoints/             # Saved model weights
โ”‚   โ”œโ”€โ”€ submissions/             # Generated submission files
โ”‚   โ”œโ”€โ”€ configs/                 # Configuration files
โ”‚   โ”œโ”€โ”€ docs/                    # Detailed documentation
โ”‚   โ”œโ”€โ”€ tests/                   # Unit tests
โ”‚   โ”œโ”€โ”€ requirements.txt         # Python dependencies
โ”‚   โ””โ”€โ”€ Makefile                 # Build automation
โ”œโ”€โ”€ scripts/                     # Root-level scripts
โ”‚   โ””โ”€โ”€ setup_dev_env.py
โ”œโ”€โ”€ LICENSE                      # MIT License
โ”œโ”€โ”€ README.md                    # This file
โ”œโ”€โ”€ .gitignore
โ””โ”€โ”€ mypy.ini                     # Type checking config

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.13.5+ (Tested with Python 3.13.9)
  • CUDA 12.8 (for GPU support, optional)
  • Git
  • 8GB+ RAM (16GB+ recommended)
  • NVIDIA GPU with sm_61+ compute capability (e.g., GTX 1060 or better)

Installation

# Clone the repository
git clone https://github.com/maurorisonho/Stanford-RNA-3D-Folding.git
cd Stanford-RNA-3D-Folding

# Navigate to the project directory
cd stanford_rna3d/scripts

# Run automated setup (creates virtual environment and project structure)
python3.13 01_create_env.py
python3.13 02_setup_project.py

# Activate virtual environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Navigate back to stanford_rna3d directory
cd ..

# Install all dependencies (PyTorch 2.9.1 with CUDA 12.8 support)
pip install -r requirements.txt

# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "from src import RNADataProcessor, SimpleRNAPredictor; print('โœ“ Modules imported successfully')"

Alternative: Manual Installation

cd Stanford-RNA-3D-Folding/stanford_rna3d

# Create virtual environment manually
python3.13 -m venv scripts/.venv
source scripts/.venv/bin/activate

# Install dependencies
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

Verify System Configuration

cd scripts
source .venv/bin/activate
python3.13 system_specs_checker.py

This will display your system specifications and verify that all required libraries are installed.

Download Competition Data

# Option 1: Manual download from Kaggle
# Visit https://www.kaggle.com/competitions/stanford-rna-3d-folding/data
# Download and extract to stanford_rna3d/data/raw/

# Option 2: Using Kaggle API (requires kaggle.json in ~/.kaggle/)
kaggle competitions download -c stanford-rna-3d-folding -p data/raw/
unzip data/raw/stanford-rna-3d-folding.zip -d data/raw/

๐Ÿ”ง Environment Setup

Automated Setup (Recommended)

cd stanford_rna3d/scripts

# Step 1: Create virtual environment
python3.13 01_create_env.py

# Step 2: Set up project structure
python3.13 02_setup_project.py

# Step 3: Activate environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Step 4: Install dependencies
cd ..
pip install -r requirements.txt

Manual Setup

cd stanford_rna3d

# Create virtual environment in scripts directory
python3.13 -m venv scripts/.venv
source scripts/.venv/bin/activate

# Upgrade pip and install dependencies
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

Verify Installation

cd scripts
source .venv/bin/activate
python3.13 system_specs_checker.py

Expected output:

โœ“ Python 3.13.9
โœ“ PyTorch 2.9.1+cu128
โœ“ NumPy 2.3.5
โœ“ Pandas 2.3.3
โœ“ Matplotlib 3.10.7

Note: CUDA availability depends on your GPU drivers. The installed PyTorch version (2.9.1) supports CUDA 12.8.


๐Ÿ’ป Usage

Starting Jupyter Lab

cd stanford_rna3d
source scripts/.venv/bin/activate
jupyter lab

This will open Jupyter Lab in your browser at http://localhost:8888.

1. Exploratory Data Analysis

Open and run notebooks/01_eda.ipynb to:

  • Explore RNA sequences and structure
  • Analyze label distributions
  • Visualize MSA features
  • Understand data characteristics

2. Train Baseline Model

Open and run notebooks/02_baseline.ipynb to:

  • Preprocess RNA sequences
  • Train a simple LSTM-based model
  • Evaluate model performance
  • Save trained model to checkpoints/

3. Train Advanced Model

Open and run notebooks/03_advanced.ipynb to:

  • Experiment with Transformer architectures
  • Implement attention mechanisms
  • Try graph neural networks
  • Compare model performances

4. Generate Submission

Open and run notebooks/04_submission.ipynb to:

  • Load best model from checkpoints/
  • Generate predictions for test set
  • Apply post-processing (smoothing, normalization)
  • Create submission CSV in submissions/
  • Upload to Kaggle (interactive button)

๐ŸŽฎ GPU Support

CUDA Configuration

The project automatically detects and uses available GPUs. PyTorch is installed with CUDA 12.8 support for compatibility with compute capability 6.1+ GPUs (e.g., GTX 1060, RTX series).

Verification:

cd stanford_rna3d
source scripts/.venv/bin/activate
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda if torch.cuda.is_available() else \"N/A\"}')"

Expected output (with GPU):

PyTorch: 2.9.1+cu128
CUDA available: True
CUDA version: 12.8

Expected output (CPU only):

PyTorch: 2.9.1+cu128
CUDA available: False
CUDA version: N/A

Note: If CUDA is not available, ensure you have the appropriate NVIDIA drivers installed for your GPU.

Disable CUDA (CPU-only mode)

If you need to force CPU execution:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''

๐Ÿ“ค Kaggle Submission

Method 1: Interactive Notebook Upload (Recommended)

  1. Open notebooks/04_submission.ipynb
  2. Run all cells to generate submission
  3. Click "๐Ÿš€ Upload to Kaggle" button in the last cell
  4. Enter credentials when prompted (input is hidden for security)

Method 2: CLI Upload

python scripts/03_submit_late.py submissions/submission_20251103_215343.csv -m "First submission"

Method 3: Manual Upload

  1. Generate submission: Run 04_submission.ipynb
  2. Download CSV from submissions/
  3. Upload at: https://www.kaggle.com/competitions/stanford-rna-3d-folding/submit

Kaggle API Setup

Option A: Environment Variables (Recommended for security)

export KAGGLE_USERNAME="your_username"
export KAGGLE_KEY="your_api_key"

Option B: Kaggle JSON file

mkdir -p ~/.kaggle
# Create ~/.kaggle/kaggle.json with:
# {"username": "your_username", "key": "your_api_key"}
chmod 600 ~/.kaggle/kaggle.json

โš ๏ธ Security Note: Never commit kaggle.json or API keys to the repository!


๐Ÿ““ Notebooks

Notebook Description Key Features
00_competition_overview.ipynb Competition introduction and objectives Problem statement, evaluation metrics
01_eda.ipynb Exploratory data analysis Sequence statistics, label distributions, visualizations
02_baseline.ipynb Baseline model training Simple LSTM, data preprocessing, training loop
03_advanced.ipynb Advanced architectures Transformers, attention mechanisms, GNNs
04_submission.ipynb Submission generation Model loading, inference, post-processing, Kaggle upload

๐Ÿ“š Documentation

Comprehensive documentation is available in stanford_rna3d/docs/:


๐Ÿ“Š Performance

Hardware Requirements

  • Minimum: 8GB RAM, CPU-only (slow training)
  • Recommended: 16GB RAM, NVIDIA GPU with 4GB+ VRAM
  • Optimal: 32GB RAM, NVIDIA RTX GPU with 8GB+ VRAM

Training Times (Approximate)

Configuration Baseline Model Advanced Model
CPU-only ~4 hours ~12 hours
GTX 1060 (6GB) ~45 minutes ~2 hours
RTX 3080 (10GB) ~20 minutes ~1 hour

Evaluation Metric

The competition uses Mean Absolute Error (MAE) between predicted and true atomic coordinates:

$$ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $$


๐Ÿค Contributing

Contributions are welcome! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit changes: git commit -m 'Add amazing feature'
  4. Push to branch: git push origin feature/amazing-feature
  5. Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements.txt
pip install pytest mypy flake8 black

# Run tests
pytest tests/

# Type checking
mypy src/

# Code formatting
black src/ notebooks/

๐Ÿ“„ License

This project is licensed under the MIT License โ€” see the LICENSE file for details.

MIT License

Copyright (c) 2025 Mauro Risonho de Paula Assumpรงรฃo

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

๐Ÿ“ง Contact

Author: Mauro Risonho de Paula Assumpรงรฃo


๐Ÿ™ Acknowledgments

  • Stanford University for hosting the competition
  • Kaggle for providing the platform
  • PyTorch Team for the deep learning framework
  • BioPython for bioinformatics utilities
  • The RNA structure prediction research community

๐Ÿ”— Related Resources


โญ If you find this project helpful, please consider giving it a star! โญ

Made with โค๏ธ for the RNA structure prediction community

About

This repository contains a comprehensive solution for the Stanford RNA 3D Folding Kaggle competition, implementing advanced machine learning techniques for predicting RNA 3D structures from sequence data. The project demonstrates expertise in bioinformatics, deep learning, and scientific computing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published