Ensemble Machine Learning solution for predicting melting points of organic compounds using SMILES molecular descriptors. This repository contains my solution for the Kaggle Thermophysical Property: Melting Point competition.
Predicting the melting point of organic molecules is a long-standing challenge in chemistry and chemical engineering. Melting point is critical for drug design, material selection, and process safety, yet experimental measurements are often costly, time-consuming, or unavailable.
Build ML models that predict melting point (in Kelvin) for organic compounds given molecular descriptors.
Mean Absolute Error (MAE) - Lower is better.
| Split | Samples | Percentage |
|---|---|---|
| Train | 2,662 | 80% |
| Test | 666 | 20% |
| Total | 3,328 | 100% |
train.csv: Features (SMILES) + Target (Tm)test.csv: Features only, no targetsample_submission.csv: Template with columns [id, Tm]
id: Unique identifierSMILES: Molecular string representationGroup 1..N: Descriptor featuresTm: Melting point in Kelvin (train only)
Extensive molecular feature extraction using RDKit:
- Basic Descriptors: MolWt, LogP, TPSA, HBond donors/acceptors, Rotatable bonds
- Ring Features: Aromatic/Aliphatic/Saturated ring counts, Ring density
- Charge Features: Gasteiger charges (mean, std, max, min)
- Fragment Counts: Benzene, phenol, ester, ether, aldehyde, ketone, etc.
- Morgan Fingerprints: 1024-bit with radius 3
- MACCS Keys: 167-bit structural keys
- Interaction Features: HBond capacity, flexibility, polarity index, etc.
Total Features: ~1,233 features after processing
Ensemble of three gradient boosting models with optimized hyperparameters:
| Model | n_estimators | learning_rate | max_depth |
|---|---|---|---|
| LightGBM | 2000 | 0.02 | 10 |
| XGBoost | 1500 | 0.03 | 8 |
| CatBoost | 1500 | 0.03 | 8 |
- 5-Fold Cross Validation
- Optimal weight optimization using scipy.optimize
- Final weights: ~35% LightGBM, ~32% XGBoost, ~33% CatBoost
| Model | CV MAE |
|---|---|
| LightGBM | 28.70 |
| XGBoost | 28.57 |
| CatBoost | 28.75 |
| Ensemble | 28.16 |
pip install rdkit pandas numpy scikit-learn lightgbm xgboost catboost optuna scipy joblib tqdm matplotlib seaborn# Clone the repository
git clone https://github.com/adityapawar327/melting-point-prediction.git
cd melting-point-prediction
# Run the notebook
jupyter notebook ensemble-ml-for-melting-point-prediction.ipynbmelting-point-prediction/
├── README.md
├── LICENSE
├── .gitignore
├── ensemble-ml-for-melting-point-prediction.ipynb # Main solution notebook
└── submission.csv # Final predictions
rdkit- Molecular descriptor calculationpandas,numpy- Data manipulationscikit-learn- ML utilitieslightgbm- LightGBM modelxgboost- XGBoost modelcatboost- CatBoost modeloptuna- Hyperparameter optimizationscipy- Weight optimizationmatplotlib,seaborn- Visualization
- SMILES Canonicalization: Standardizing molecular representations
- Comprehensive Feature Engineering: 1200+ molecular features
- Ensemble Learning: Combining multiple gradient boosting models
- Optimal Weight Finding: Using scipy optimization for ensemble weights
- 5-Fold Cross Validation: Robust model evaluation
Kaggle: Thermophysical Property - Melting Point
This project is licensed under the MIT License - see the LICENSE file for details.
Aditya Pawar
- GitHub: @adityapawar327
- Kaggle: Profile
- Kaggle for hosting the competition
- RDKit developers for the excellent cheminformatics library
- The machine learning community for open-source implementations
⭐ If you found this helpful, please star the repository!