Predicting 3D RNA structure from sequence using deep learning
Competition โข Documentation โข Notebooks
- Overview
- Key Features
- Project Structure
- Quick Start
- Environment Setup
- Usage
- GPU Support
- Kaggle Submission
- Notebooks
- Documentation
- Performance
- Contributing
- License
- Contact
This repository contains a complete solution pipeline for the Stanford RNA 3D Folding Kaggle competition, which challenges participants to predict 3D atomic coordinates of RNA molecules from their nucleotide sequences.
Competition Goal: Predict the 3D spatial coordinates (x, y, z) for each nucleotide in RNA sequences to advance computational biology and drug discovery.
Approach: Deep learning models (LSTM/Transformer-based) trained on RNA sequences with multiple sequence alignments (MSA) features.
- ๐งฌ End-to-end RNA 3D structure prediction pipeline
- ๐ GPU-accelerated training (CUDA 11.8 support for sm_61+ architectures)
- ๐ Interactive Jupyter notebooks for EDA, training, and submission
- ๐ Automated preprocessing with MSA feature extraction
- ๐ค One-click Kaggle submission with secure credential handling
- ๐งช Production-ready model deployment with validation and post-processing
- ๐ Comprehensive documentation and code examples
- ๐ ๏ธ Modular architecture for easy experimentation
Stanford-RNA-3D-Folding/
โโโ stanford_rna3d/ # Main project directory
โ โโโ data/ # Data storage
โ โ โโโ raw/ # Competition data (sequences, labels, MSA)
โ โ โโโ processed/ # Preprocessed features
โ โ โโโ interim/ # Intermediate processing files
โ โ โโโ external/ # External datasets
โ โโโ notebooks/ # Jupyter notebooks
โ โ โโโ 00_competition_overview.ipynb
โ โ โโโ 01_eda.ipynb # Exploratory data analysis
โ โ โโโ 02_baseline.ipynb # Baseline model training
โ โ โโโ 03_advanced.ipynb # Advanced architectures
โ โ โโโ 04_submission.ipynb # Submission generation + Kaggle upload
โ โโโ src/ # Source code modules
โ โ โโโ __init__.py
โ โ โโโ models.py # Neural network architectures
โ โ โโโ data_processing.py # Data preprocessing utilities
โ โโโ scripts/ # Utility scripts
โ โ โโโ 00_environment_manager.py
โ โ โโโ 01_create_env.py # Virtual environment setup
โ โ โโโ 02_setup_project.py # Project structure setup
โ โ โโโ 03_submit_late.py # CLI submission tool
โ โ โโโ pii_scanner.py
โ โ โโโ system_specs_checker.py
โ โ โโโ .venv/ # Virtual environment (after setup)
โ โโโ checkpoints/ # Saved model weights
โ โโโ submissions/ # Generated submission files
โ โโโ configs/ # Configuration files
โ โโโ docs/ # Detailed documentation
โ โโโ tests/ # Unit tests
โ โโโ requirements.txt # Python dependencies
โ โโโ Makefile # Build automation
โโโ scripts/ # Root-level scripts
โ โโโ setup_dev_env.py
โโโ LICENSE # MIT License
โโโ README.md # This file
โโโ .gitignore
โโโ mypy.ini # Type checking config
- Python 3.13.5+ (Tested with Python 3.13.9)
- CUDA 12.8 (for GPU support, optional)
- Git
- 8GB+ RAM (16GB+ recommended)
- NVIDIA GPU with sm_61+ compute capability (e.g., GTX 1060 or better)
# Clone the repository
git clone https://github.com/maurorisonho/Stanford-RNA-3D-Folding.git
cd Stanford-RNA-3D-Folding
# Navigate to the project directory
cd stanford_rna3d/scripts
# Run automated setup (creates virtual environment and project structure)
python3.13 01_create_env.py
python3.13 02_setup_project.py
# Activate virtual environment
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Navigate back to stanford_rna3d directory
cd ..
# Install all dependencies (PyTorch 2.9.1 with CUDA 12.8 support)
pip install -r requirements.txt
# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "from src import RNADataProcessor, SimpleRNAPredictor; print('โ Modules imported successfully')"cd Stanford-RNA-3D-Folding/stanford_rna3d
# Create virtual environment manually
python3.13 -m venv scripts/.venv
source scripts/.venv/bin/activate
# Install dependencies
pip install --upgrade pip setuptools wheel
pip install -r requirements.txtcd scripts
source .venv/bin/activate
python3.13 system_specs_checker.pyThis will display your system specifications and verify that all required libraries are installed.
# Option 1: Manual download from Kaggle
# Visit https://www.kaggle.com/competitions/stanford-rna-3d-folding/data
# Download and extract to stanford_rna3d/data/raw/
# Option 2: Using Kaggle API (requires kaggle.json in ~/.kaggle/)
kaggle competitions download -c stanford-rna-3d-folding -p data/raw/
unzip data/raw/stanford-rna-3d-folding.zip -d data/raw/cd stanford_rna3d/scripts
# Step 1: Create virtual environment
python3.13 01_create_env.py
# Step 2: Set up project structure
python3.13 02_setup_project.py
# Step 3: Activate environment
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Step 4: Install dependencies
cd ..
pip install -r requirements.txtcd stanford_rna3d
# Create virtual environment in scripts directory
python3.13 -m venv scripts/.venv
source scripts/.venv/bin/activate
# Upgrade pip and install dependencies
pip install --upgrade pip setuptools wheel
pip install -r requirements.txtcd scripts
source .venv/bin/activate
python3.13 system_specs_checker.pyExpected output:
โ Python 3.13.9
โ PyTorch 2.9.1+cu128
โ NumPy 2.3.5
โ Pandas 2.3.3
โ Matplotlib 3.10.7
Note: CUDA availability depends on your GPU drivers. The installed PyTorch version (2.9.1) supports CUDA 12.8.
cd stanford_rna3d
source scripts/.venv/bin/activate
jupyter labThis will open Jupyter Lab in your browser at http://localhost:8888.
Open and run notebooks/01_eda.ipynb to:
- Explore RNA sequences and structure
- Analyze label distributions
- Visualize MSA features
- Understand data characteristics
Open and run notebooks/02_baseline.ipynb to:
- Preprocess RNA sequences
- Train a simple LSTM-based model
- Evaluate model performance
- Save trained model to
checkpoints/
Open and run notebooks/03_advanced.ipynb to:
- Experiment with Transformer architectures
- Implement attention mechanisms
- Try graph neural networks
- Compare model performances
Open and run notebooks/04_submission.ipynb to:
- Load best model from
checkpoints/ - Generate predictions for test set
- Apply post-processing (smoothing, normalization)
- Create submission CSV in
submissions/ - Upload to Kaggle (interactive button)
The project automatically detects and uses available GPUs. PyTorch is installed with CUDA 12.8 support for compatibility with compute capability 6.1+ GPUs (e.g., GTX 1060, RTX series).
Verification:
cd stanford_rna3d
source scripts/.venv/bin/activate
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda if torch.cuda.is_available() else \"N/A\"}')"Expected output (with GPU):
PyTorch: 2.9.1+cu128
CUDA available: True
CUDA version: 12.8
Expected output (CPU only):
PyTorch: 2.9.1+cu128
CUDA available: False
CUDA version: N/A
Note: If CUDA is not available, ensure you have the appropriate NVIDIA drivers installed for your GPU.
If you need to force CPU execution:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''- Open
notebooks/04_submission.ipynb - Run all cells to generate submission
- Click "๐ Upload to Kaggle" button in the last cell
- Enter credentials when prompted (input is hidden for security)
python scripts/03_submit_late.py submissions/submission_20251103_215343.csv -m "First submission"- Generate submission: Run
04_submission.ipynb - Download CSV from
submissions/ - Upload at: https://www.kaggle.com/competitions/stanford-rna-3d-folding/submit
Option A: Environment Variables (Recommended for security)
export KAGGLE_USERNAME="your_username"
export KAGGLE_KEY="your_api_key"Option B: Kaggle JSON file
mkdir -p ~/.kaggle
# Create ~/.kaggle/kaggle.json with:
# {"username": "your_username", "key": "your_api_key"}
chmod 600 ~/.kaggle/kaggle.jsonkaggle.json or API keys to the repository!
| Notebook | Description | Key Features |
|---|---|---|
| 00_competition_overview.ipynb | Competition introduction and objectives | Problem statement, evaluation metrics |
| 01_eda.ipynb | Exploratory data analysis | Sequence statistics, label distributions, visualizations |
| 02_baseline.ipynb | Baseline model training | Simple LSTM, data preprocessing, training loop |
| 03_advanced.ipynb | Advanced architectures | Transformers, attention mechanisms, GNNs |
| 04_submission.ipynb | Submission generation | Model loading, inference, post-processing, Kaggle upload |
Comprehensive documentation is available in stanford_rna3d/docs/:
- ENVIRONMENT_SETUP.md โ Environment configuration guide
- DATA_DOWNLOAD.md โ Data acquisition instructions
- EXECUTION_PIPELINE.md โ End-to-end workflow
- TECHNICAL_DETAILS.md โ Model architectures and algorithms
- SOLUTION_WRITEUP.md โ Complete solution documentation
- WORKFLOW_README.md โ Development workflow
- pii_scanner_README.md โ Security scanning tools
- Minimum: 8GB RAM, CPU-only (slow training)
- Recommended: 16GB RAM, NVIDIA GPU with 4GB+ VRAM
- Optimal: 32GB RAM, NVIDIA RTX GPU with 8GB+ VRAM
| Configuration | Baseline Model | Advanced Model |
|---|---|---|
| CPU-only | ~4 hours | ~12 hours |
| GTX 1060 (6GB) | ~45 minutes | ~2 hours |
| RTX 3080 (10GB) | ~20 minutes | ~1 hour |
The competition uses Mean Absolute Error (MAE) between predicted and true atomic coordinates:
Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
# Install development dependencies
pip install -r requirements.txt
pip install pytest mypy flake8 black
# Run tests
pytest tests/
# Type checking
mypy src/
# Code formatting
black src/ notebooks/This project is licensed under the MIT License โ see the LICENSE file for details.
MIT License
Copyright (c) 2025 Mauro Risonho de Paula Assumpรงรฃo
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Author: Mauro Risonho de Paula Assumpรงรฃo
- Email: [email protected]
- GitHub: @maurorisonho
- Kaggle: Stanford RNA 3D Folding Competition
- Stanford University for hosting the competition
- Kaggle for providing the platform
- PyTorch Team for the deep learning framework
- BioPython for bioinformatics utilities
- The RNA structure prediction research community
โญ If you find this project helpful, please consider giving it a star! โญ
Made with โค๏ธ for the RNA structure prediction community