- Project Overview
- Project Structure
- Installation
- Makefile Commands
- Modeling Details
- MLflow Tracking
- Notebooks
- Notes
- Reproducibility
- Authors & Contacts
- License
This repository contains the full data science pipeline for preprocessing, modeling, evaluating, and explaining clinical outcomes related to laser circumcision procedures. It focuses specifically on predicting the Bleeding_Edema_Outcome complication using multiple supervised learning approaches. The workflow includes data cleaning, feature engineering, model training with different sampling strategies, evaluation, and SHAP-based explainability.
circ_milan/
├── assets/ # Slide decks and static visuals
│ ├── CUT_MD.svg
│ └── my_slides.html
├── data/ # Datasets at different stages
│ ├── external/ # Original source files
│ ├── raw/ # Raw ingested data
│ │ └── Laser_Circumcision_Excel_31.03.2024.xlsx
│ ├── interim/ # Intermediate cleaned files
│ └── processed/ # Final data for modeling
│ ├── training/ # Training features and labels
│ │ ├── X.parquet
│ │ └── y_Bleeding_Edema_Outcome.parquet
│ └── inference/ # Inference features and outputs
│ ├── df_inference_process.parquet
│ └── X.parquet
├── images/ # Exported plots and figures
│ └── figures/
├── mlruns/ # MLflow tracking server backend logs
├── preprocessing/ # Data cleaning & feature engineering
│ ├── __init__.py
│ ├── preprocessing.py # Cleans raw data and saves interim/processed
│ └── feat_gen.py # Generates model-ready feature sets
├── modeling/ # Modeling & explainability scripts
│ ├── __init__.py
│ ├── train.py # Train LR, RF, SVM with sampling pipelines
│ ├── evaluation.py # Evaluate model performance
│ ├── explainer.py # Select best model & build SHAP explainer
│ ├── explanations_training.py # Compute SHAP values on training data
│ ├── explanations_inference.py # Compute SHAP values on inference data
│ └── predict.py # Run production predictions
├── models/ # Stored model artifacts & metrics
│ ├── results/ # Logs & metrics per outcome
│ │ └── Bleeding_Edema_Outcome/
│ └── eval/ # Evaluation reports per outcome
│ └── Bleeding_Edema_Outcome/
├── notebooks/ # Jupyter notebooks for analysis & reporting
│ ├── circ_milan_eda.ipynb
│ ├── circ_milan_model_artifacts_dash.ipynb
│ ├── circ_milan_model_results.ipynb
│ ├── circ_milan_model_explanations.ipynb
│ └── post_modeling_eda.ipynb
├── unittests/ # Unit tests for core modules
├── config.py # Central configuration settings
├── constants.py # Global constants
├── functions.py # General helper functions
├── project_functions.py # Project-specific utilities
├── requirements.txt # Python dependencies
├── setup.py # Packaging/install script
├── Makefile # Automates setup, training, evaluation, inference
└── README.md # Project overview and usage instructions
-
Clone the repo
git clone https://github.com/your-username/circ_milan.git cd circ_milan -
Create environment
- Conda:
conda create -n conda_circ_311 python=3.11 conda activate conda_circ_311
- venv:
python -m venv venv_circ_311 source venv_circ_311/bin/activate
- Conda:
-
Install dependencies
pip install -r requirements.txt
| Command | Description |
|---|---|
make create_venv |
Create a virtual environment |
make requirements |
Install dependencies |
make preproc_pipeline |
Run preprocessing + feature generation for training |
make train_all_models |
Train LR, RF, and SVM models |
make eval_all_models |
Evaluate all trained models |
make preproc_train_eval |
Full pipeline: preprocessing → training → evaluation |
make model_explaining_training |
Run SHAP explainability on training data |
make preproc_pipeline_inf |
Run preprocessing + feature generation for inference |
make predict |
Run inference and output predictions |
make mlflow_ui |
Launch MLflow UI on port 5501 |
To list available commands:
make help- Outcome:
Bleeding_Edema_Outcome - Sampling Pipelines:
orig(original)smote(Synthetic Minority Oversampling)over(Random Oversampling)
- Models:
- Logistic Regression (
lr) - Random Forest (
rf) - Support Vector Machine (
svm)
- Logistic Regression (
- Metric:
average_precision - Explainability: SHAP feature attributions via
explainer.py
All runs, parameters, and metrics are tracked with MLflow.
Launch UI:
make mlflow_uicirc_milan_eda.ipynb– Exploratory Data Analysiscirc_milan_model_results.ipynb– Model performance visualscirc_milan_model_explanations.ipynb– SHAP visualizationspost_modeling_eda.ipynb– Further diagnostics
- SHAP outputs and model artifacts are in
data/processed/andmodels/ - Inference predictions are saved to
./data/processed/inference/predictions_Bleeding_Edema_Outcome.csv
Run the full pipeline with:
make preproc_train_eval- Leonid Shpaner, M.S., Data Scientist | Adjunct Professor
- Giuseppe Saitta, M.D., Medical Consultant (data provider and clinical insights)
This project is licensed under the MIT License. Research and educational use only, all rights reserved unless stated. otherwise