DownScaleXR is a controlled architectural study that isolates the effect of early spatial downsampling operators on:
- generalization under noisy supervision
- decision bias (false positives vs false negatives)
- CPU inference latency and stability
The study uses intentionally simple CNNs to prevent representation capacity from masking architectural behavior.
This is not a model performance exercise. This is a mechanistic investigation of inductive bias under real deployment constraints.
This project was driven by three practical realities:
-
CPU-only deployment
Many clinical and edge environments cannot rely on GPUs. -
Noisy, limited data
Medical datasets amplify architectural bias. -
Architecture literacy gap
Pooling and strided convolutions are often treated as interchangeable — they are not.
How does spatial compression itself shape decision boundaries under limited supervision and CPU constraints?
To keep the study interpretable and controlled:
- A LeNet-style CNN was used to minimize confounding factors.
- Modern architectures (ResNet, MobileNet, EfficientNet) were intentionally avoided.
- Skip connections, depthwise convolutions, and compound scaling dilute the observable effect of early downsampling.
This project isolates downsampling behavior — not representational capacity.
All variants share identical depth, width, and parameter count (~11M).
- Smooths spatial activations
- Acts as an implicit regularizer
- Produces conservative decision boundaries
- Amplifies high-activation regions
- Improves recall but increases false positives
- Prone to pathology over-prediction
- Learnable downsampling
- Under limited data, collapses to MaxPool-like behavior
- Quantify performance vs bias trade-offs
- Measure real CPU latency and throughput
- Examine generalization gaps under noise
- Track everything via MLflow + DagsHub
DownScaleXR/
├─ configs/ # YAML-driven experiment configuration
├─ data/ # Raw and preprocessed CXR data
├─ model/ # Best checkpoints per variant
├─ artifacts/ # Metrics, plots, and inference visualizations
├─ notebooks/ # MLflow analysis & comparison
├─ scripts/ # Entry points and preprocessing
├─ src/ # Core training, models, experiments
├─ requirements.txt
└─ README.md
Three LeNet variants were trained on the same chest X-ray dataset with identical hyperparameters to evaluate how different downsampling strategies affect performance and efficiency.
| Model Name | Downsampling | Val AUC | Val F1 | Val Precision | Val Recall | Val Accuracy | Train Accuracy | Inference Time (ms) | Throughput (FPS) | Parameters | Model Size (MB) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| lenet_strided | strided | 0.895 | 0.820 | 0.697 | 0.997 | 0.727 | 0.989 | 50.77 | 608.65 | 11.4 M | 43.47 |
| lenet_avgpool | avgpool | 0.890 | 0.814 | 0.688 | 0.997 | 0.715 | 0.980 | 78.20 | 395.13 | 11.4 M | 43.47 |
| lenet_maxpool | maxpool | 0.854 | 0.837 | 0.723 | 0.992 | 0.757 | 0.996 | 168.70 | 183.17 | 11.4 M | 43.47 |
Observation:
All models have roughly the same parameter count and model size (~11M params, 43 MB). Differences arise primarily from downsampling strategy, impacting inference speed, throughput, and class-specific decision biases.
The test set predictions highlight differences in model behavior based on downsampling strategy:
| Model | Confusion Matrix |
|---|---|
| lenet_avgpool | ![]() |
| lenet_maxpool | ![]() |
| lenet_strided | ![]() |
Insights:
- AvgPool: Balanced errors; moderate false positives and false negatives. Conservative decision boundaries.
- MaxPool: High recall for pneumonia but over-predicts pathology. Bias toward positive class.
- Strided Conv: Behavior similar to MaxPool; collapses to same decision bias on limited data.
- FLOPs do not correlate with wall-clock latency on CPU.
- Memory access patterns and operator behavior dominate runtime.
- All models have the same parameter count and model size.
- Observed differences arise solely from downsampling behavior.
- AvgPool → Best stability–accuracy balance
- MaxPool → Highest F1, worst latency
- Strided Conv → Fast throughput, unstable bias
CPU realism exposes architectural costs often hidden by theoretical efficiency metrics.
Key comparative plots:
- Downsampling is a bias control mechanism, not a neutral operation
- Small datasets amplify pooling-induced decision bias
- Strided convolution is not inherently superior under limited data
- CPU deployment reshuffles architectural trade-offs
- Downsampling is a bias control mechanism, not a neutral operation
- Small datasets amplify pooling-induced decision bias
- Strided convolution is not inherently superior under limited data
- CPU deployment reshuffles architectural trade-offs
All experiments were logged to DagsHub MLflow to ensure reproducibility, allow easy comparison, and facilitate structured analysis.
-
Tracking URI:
https://dagshub.com/Y-R-A-V-R-5/DownScaleXR.mlflow -
Experiment Name:
DownScaleXR -
Purpose:
- Store all metrics (train/validation), parameters, and tags
- Log artifacts such as plots and model checkpoints
- Enable side-by-side comparisons of different downsampling strategies
-
Minimal Usage:
import mlflow
# Set tracking URI
mlflow.set_tracking_uri("https://dagshub.com/Y-R-A-V-R-5/DownScaleXR.mlflow")
# Select experiment
mlflow.set_experiment("DownScaleXR")- Model checkpoints:
model/<variant>/best_model.pt - Plots:
artifacts/comparision/*.pngandartifacts/inference/*.png - Configuration files:
configs/*.yaml
- Performance:
accuracy,precision,recall,f1_score,auc - Efficiency:
inference_time_ms,throughput_fps,model_parameters,model_size_mb - Tracking: Metrics logged per epoch for both training and validation
Using MLflow + DagsHub ensures reproducibility, enables easy experiment comparisons, and provides structured logging of both performance and efficiency metrics.
This work demonstrates:
- Constraint-first thinking
- Architectural literacy beyond plug-and-play models
- Ability to isolate variables and reason about bias
- CPU-realistic performance evaluation
- Reproducible, inspectable R&D workflow









