SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis
π’ Official PyTorch implementation of SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis
N. A. Adarsh Pritam, Jeba Shiney O, Sanyam Jain
[Paper] | [Code]
- First Systematic Benchmark: Controlled evaluation of preprocessing complexity (basic geometric vs. advanced artifact removal) across GANs (StyleGAN2-ADA) and Diffusion Models (DDPM) for dermoscopic melanoma synthesis.
- Architecture > Preprocessing: Demonstrates that generative model choice has stronger impact than preprocessing complexity on both image fidelity and diagnostic utility.
- StyleGAN2-ADA Superiority: Achieves lowest FID (β65.5) and KID (β0.05) with better class anchoring, while diffusion models produce higher variance at the cost of perceptual fidelity.
- Significant Clinical Impact: Synthetic augmentation delivers 8-15% absolute melanoma F1-score improvements, with ViT-B/16 reaching F1β0.88 and ROC-AUCβ0.98 (β14% improvement over baselines).
- Reproducible Framework: Unified assessment combining generative metrics (FID, IS, KID), downstream performance across five architectures (CNNs and transformers), and interpretability analysis via Grad-CAM on 14,116 dermoscopic images.
Overall experimental design showing dual preprocessing pipelines, generative model training, synthetic data augmentation, and downstream classifier evaluation for melanoma diagnosis.
Table: Overview of curated dermatology dataset used in our study. The dataset combines ISIC 2025 (MLK10k) and HAM10000 sources.
| Class | Abbr. | Images | Percentage |
|---|---|---|---|
| Nevus | NV | 7,424 | 52.60% |
| Basal Cell Carcinoma | BCC | 3,026 | 21.43% |
| Benign Keratosis-like | BKL | 1,637 | 11.60% |
| Melanoma | MEL | 1,563 | 11.03% |
| Squamous Cell Carcinoma | SCC | 466 | 3.34% |
| Total | 14,116 | 100% |
Dataset Sources:
Figure: General framework of SkinGenBench showing the two preprocessing pipelines (Basic and Advanced), generative model training (StyleGAN2-ADA and DDPM), and evaluation through image quality metrics and downstream classification tasks.
Image Subset Nomenclature:
| Source | Basic Preprocessing (BS) | Advanced Preprocessing (AD) |
|---|---|---|
| Ground Truth | BS_GT | AD_GT |
| StyleGAN2-ADA | BS_GN | AD_GN |
| DDPM | BS_DF | AD_DF |
| Ground-Truth Aug. | BS_GTA | AD_GTA |
SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis
N. A. Adarsh Pritam, Jeba Shiney O, Sanyam Jain
Alliance University, Bangalore & Γstfold University College, Norway
[GitHub] | [PDF]
- Clone the repository:
git clone https://github.com/adarsh-crafts/SkinGenBench.git
cd SkinGenBench- Create a virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Install PyTorch: Install PyTorch following instructions from PyTorch official site.
requirements.txt
torch>=2.0.0
torchvision>=0.15.0
numpy
opencv-python
matplotlib
scikit-learn
scipy
tqdm
h5py
pandas
Pillow
Pretrained Models for StyleGAN2-ADA and DDPM which were finetuned are available here in table below:
| Model | Configuration | File |
|---|---|---|
| StyleGAN2-ADA | FFHQ 256Γ256 pretrained | NVIDIA CDN |
| DDPM | CELEBA-HQ 256Γ256 pretrained | Hugging Face |
Train StyleGAN2-ADA, DDPM and the classifiers with the provided configurations in each nested directory.
Training Details:
| Configuration | Minimum | Maximum |
|---|---|---|
| GPU | NVIDIA RTX 4060 8GB Γ 1 | NVIDIA L4 22GB Γ 1 |
| RAM | 8 GB | 22 GB |
| Input Resolution | 256Γ256Γ3 | 256Γ256Γ3 |
Figure: t-SNE embeddings showing ground truth (GT), StyleGAN2-ADA (GN), and DDPM (DF) distributions for basic (left) and advanced (right) preprocessing pipelines.
FrΓ©chet Inception Distance (FID) - Lower is better
| Model | Basic Pipeline (BS) | Advanced Pipeline (AD) |
|---|---|---|
| StyleGAN2-ADA (BSGN/ADGN) | 79.36 | 65.47 |
| DDPM (BSDF/ADDF) | 83.04 | 90.22 |
Kernel Inception Distance (KID) - Lower is better
| Model | Basic Pipeline (BS) | Advanced Pipeline (AD) |
|---|---|---|
| StyleGAN2-ADA (BSGN/ADGN) | 0.0664 | 0.0546 |
| DDPM (BSDF/ADDF) | 0.0684 | 0.0772 |
Inception Score (IS) - Higher is better
| Model | Basic Pipeline (BS) | Advanced Pipeline (AD) |
|---|---|---|
| StyleGAN2-ADA (BSGN/ADGN) | 3.22 | 2.77 |
| DDPM (BSDF/ADDF) | 2.50 | 2.45 |
Best Results: Pipeline A2 (Basic Preprocessing + StyleGAN2-ADA Augmentation)
| Model | Macro-F1 | Balanced Acc | MCC | ROC-AUC | Accuracy | Brier Score β |
|---|---|---|---|---|---|---|
| ViT-B/16 | 0.8393 | 0.8348 | 0.8515 | 0.9822 | 0.8985 | 0.0302 |
| ResNet-50 | 0.8393 | 0.8433 | 0.8525 | 0.9802 | 0.8989 | 0.0314 |
| VGG-16 | 0.8181 | 0.8167 | 0.8243 | 0.9774 | 0.8797 | 0.0365 |
| EfficientNet-B0 | 0.7977 | 0.7871 | 0.8033 | 0.9698 | 0.8657 | 0.0402 |
| ResNet-18 | 0.7525 | 0.7424 | 0.7594 | 0.9611 | 0.8360 | 0.0479 |
Best Results: Pipeline A2 (Basic Preprocessing + StyleGAN2-ADA Augmentation)
| Model | MEL F1 | Sensitivity | Specificity | Precision | ROC-AUC | PR-AUC | DOR |
|---|---|---|---|---|---|---|---|
| ViT-B/16 | 0.8831 | 0.8564 | 0.9798 | 0.9115 | 0.9802 | 0.9511 | 288.94 |
| ResNet-50 | 0.8663 | 0.8401 | 0.9758 | 0.8941 | 0.9787 | 0.9445 | 211.93 |
| VGG-16 | 0.8438 | 0.8108 | 0.9730 | 0.8796 | 0.9729 | 0.9228 | 154.56 |
| EfficientNet-B0 | 0.8126 | 0.7781 | 0.9667 | 0.8503 | 0.9633 | 0.9043 | 101.76 |
| ResNet-18 | 0.7724 | 0.7390 | 0.9576 | 0.8089 | 0.9542 | 0.8774 | 63.88 |
- MEL F1-score gains: +8β15% across all architectures
- ViT-B/16: MEL F1 improved from 0.7401 β 0.8831 (+14.3 percentage points)
- ResNet-50: MEL F1 improved from 0.7362 β 0.8663 (+13.0 percentage points)
- All models achieved ROC-AUC > 0.96 for melanoma detection
| Pipeline | Description | Best Use Case |
|---|---|---|
| A2 | Basic preprocessing + StyleGAN2-ADA | Recommended: Best overall performance |
| A3 | Basic preprocessing + DDPM | Good diversity, lower fidelity |
| B2 | Advanced preprocessing + StyleGAN2-ADA | Marginal gains over A2 |
| B3 | Advanced preprocessing + DDPM | Lowest performance |
| A4/B4 | Standard augmentation only (no synthetic) | Baseline comparison |
Key Finding: Generative architecture choice (GAN vs Diffusion) has a stronger influence on diagnostic performance than preprocessing complexity (Basic vs Advanced).
Figure: Grad-CAM visualizations comparing ResNet-50 and ViT-B/16 across different preprocessing pipelines and generative models. ResNet-50 produces compact, lesion-aligned saliency maps, while ViT-B/16 shows broader attention patterns. Synthetic samples exhibit more irregular activations, with ADDF showing the smoothest, most anatomically coherent results.
If you find this work useful, please cite our paper:
@misc{pritam2025skingenbenchgenerativemodelpreprocessing,
title={SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis},
author={N. A. Adarsh Pritam and Jeba Shiney O and Sanyam Jain},
year={2025},
eprint={2512.17585},
archivePrefix={arXiv},
primaryClass={eess.IV},
url={https://arxiv.org/abs/2512.17585},
}