Heart disease remains one of the leading causes of mortality worldwide. Early detection can significantly improve patient outcomes by enabling timely diagnosis and preventive measures.
This project builds two complementary approaches to predict heart disease:
- Classical Machine Learning models (Logistic Regression, Random Forest, XGBoost) with explainability and calibration.
- Deep Learning model (Keras ANN) using TensorFlow with data balancing (SMOTE) and EarlyStopping.
The combination of ML and DL allows us to compare interpretability vs. flexibility in predictive modeling.
- Dataset Preprocessing with
PipelineandColumnTransformer - Imbalanced Data Handling (Stratified split, CV evaluation)
- Model Training & Comparison: Logistic Regression, Random Forest, and optional XGBoost
- Robust Evaluation: ROC-AUC, F1-score, Precision-Recall curves
- Explainability: SHAP feature importance and local explanations
- Model Calibration for reliable probability predictions
- Artifacts Saved: trained pipeline + test predictions
- StandardScaler preprocessing
- SMOTE applied for class imbalance
- Model Architecture: Multi-layer Perceptron (Dense layers with ReLU + Sigmoid)
- Initialization: He-uniform for better convergence
- Regularization: EarlyStopping to avoid overfitting
- Evaluation: Confusion Matrix, Classification Report, ROC Curve, AUC
graph TD
A[Data] --> B[Preprocessing]
subgraph "Classical ML"
B --> C1[Logistic Regression]
B --> C2[Random Forest]
B --> C3[XGBoost]
end
subgraph "Deep Learning - Keras"
B --> D["ANN (Dense Layers)"]
end
C1 --> E[Evaluation]
C2 --> E[Evaluation]
C3 --> E[Evaluation]
D --> E[Evaluation]
E --> F["Explainability (ML only)"]
E --> G[Deployment Artifacts]
- Python (NumPy, Pandas, Matplotlib, Seaborn)
- Scikit-learn (Pipelines, Logistic Regression, Random Forest)
- TensorFlow / Keras (ANN model)
- Imbalanced-learn (SMOTE for handling imbalance)
- Joblib (Model saving)
- SHAP (Interpretability, optional)
- Jupyter Notebook (Experiments & documentation)
- Source: UCI Heart Disease Dataset
- Size: 303 samples Γ 14 features
- Target:
1 = Disease,0 = No Disease
Preprocessing:
- Median imputation for numeric features (ML pipeline)
- Most frequent imputation for categorical (ML pipeline)
- Standard scaling (numeric)
- One-hot encoding (categorical, ML pipeline)
- SMOTE oversampling (DL pipeline)
git clone <your-repo-url>
cd HeartDiseasePrediction
pip install -r requirements.txtDependencies
- numpy
- pandas
- scikit-learn
- matplotlib
- seaborn
- shap (optional)
- joblib
- tensorflow
- imbalanced-learn
jupyter notebook HeartDiseasePrediction.ipynbjupyter notebook Heart_Disease.ipynbExample Inference with ML Saved Model
import joblib
model = joblib.load("artifacts/best_pipeline_logreg.joblib")
pred = model.predict(new_data) # new_data must be preprocessed format| Model | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression | 86.9% | 85.7% | 90.9% | 88.2% | 0.91 |
| Random Forest | (evaluated, not selected) | β | β | β | β |
| XGBoost (optional) | (evaluated, not selected) | β | β | β | β |
| Model | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| ANN (Keras) | 85.3% | 85.0% | 86.4% | 85.7% | 0.91 |
Visualizations:
- Class balance
- Confusion Matrix (ML & DL)
- ROC Curves
- Precision-Recall Curves
- Calibration Plot (ML only)
- Assist healthcare professionals in early screening of heart disease.
- Can be integrated into clinical decision support systems.
- Demonstrates the tradeoff between explainability (ML) and flexibility (DL).
- Small dataset size (303 samples).
- External validation needed on real-world clinical data.
- Deep learning model can benefit from more data + hyperparameter tuning.
- Explore 1D CNNs or hybrid ensemble models.
- Potential for mobile deployment & real-time monitoring.
HeartDiseasePrediction/
βββ HeartDiseasePrediction.ipynb # Classical ML models
βββ Heart_Disease.ipynb # Deep Learning (Keras ANN)
βββ README.md
βββ LICENSE
Pull requests are welcome! For major changes, please open an issue first.
MIT License