Skip to content

Architecture-level study of how early CNN downsampling choices affect bias, generalization, and CPU inference behavior under constrained settings.

Notifications You must be signed in to change notification settings

Y-R-A-V-R-5/DownScaleXR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DownScaleXR — Downsampling-Induced Bias Under CPU Constraints in Chest X-ray CNNs

Overview

DownScaleXR is a controlled architectural study that isolates the effect of early spatial downsampling operators on:

  • generalization under noisy supervision
  • decision bias (false positives vs false negatives)
  • CPU inference latency and stability

The study uses intentionally simple CNNs to prevent representation capacity from masking architectural behavior.

This is not a model performance exercise. This is a mechanistic investigation of inductive bias under real deployment constraints.


Motivation

This project was driven by three practical realities:

  • CPU-only deployment
    Many clinical and edge environments cannot rely on GPUs.

  • Noisy, limited data
    Medical datasets amplify architectural bias.

  • Architecture literacy gap
    Pooling and strided convolutions are often treated as interchangeable — they are not.

Core Question

How does spatial compression itself shape decision boundaries under limited supervision and CPU constraints?


Architectural Scope & Design Choices

To keep the study interpretable and controlled:

  • A LeNet-style CNN was used to minimize confounding factors.
  • Modern architectures (ResNet, MobileNet, EfficientNet) were intentionally avoided.
  • Skip connections, depthwise convolutions, and compound scaling dilute the observable effect of early downsampling.

This project isolates downsampling behavior — not representational capacity.


Downsampling Strategies Studied

All variants share identical depth, width, and parameter count (~11M).

AvgPool

  • Smooths spatial activations
  • Acts as an implicit regularizer
  • Produces conservative decision boundaries

MaxPool

  • Amplifies high-activation regions
  • Improves recall but increases false positives
  • Prone to pathology over-prediction

Strided Convolutions

  • Learnable downsampling
  • Under limited data, collapses to MaxPool-like behavior

Objectives

  • Quantify performance vs bias trade-offs
  • Measure real CPU latency and throughput
  • Examine generalization gaps under noise
  • Track everything via MLflow + DagsHub

Project Structure

DownScaleXR/
├─ configs/ # YAML-driven experiment configuration
├─ data/ # Raw and preprocessed CXR data
├─ model/ # Best checkpoints per variant
├─ artifacts/ # Metrics, plots, and inference visualizations
├─ notebooks/ # MLflow analysis & comparison
├─ scripts/ # Entry points and preprocessing
├─ src/ # Core training, models, experiments
├─ requirements.txt
└─ README.md


Experiments

Three LeNet variants were trained on the same chest X-ray dataset with identical hyperparameters to evaluate how different downsampling strategies affect performance and efficiency.

Model Name Downsampling Val AUC Val F1 Val Precision Val Recall Val Accuracy Train Accuracy Inference Time (ms) Throughput (FPS) Parameters Model Size (MB)
lenet_strided strided 0.895 0.820 0.697 0.997 0.727 0.989 50.77 608.65 11.4 M 43.47
lenet_avgpool avgpool 0.890 0.814 0.688 0.997 0.715 0.980 78.20 395.13 11.4 M 43.47
lenet_maxpool maxpool 0.854 0.837 0.723 0.992 0.757 0.996 168.70 183.17 11.4 M 43.47

Observation:
All models have roughly the same parameter count and model size (~11M params, 43 MB). Differences arise primarily from downsampling strategy, impacting inference speed, throughput, and class-specific decision biases.


Confusion Matrices

The test set predictions highlight differences in model behavior based on downsampling strategy:

Model Confusion Matrix
lenet_avgpool lenet_avgpool
lenet_maxpool lenet_maxpool
lenet_strided lenet_strided

Insights:

  • AvgPool: Balanced errors; moderate false positives and false negatives. Conservative decision boundaries.
  • MaxPool: High recall for pneumonia but over-predicts pathology. Bias toward positive class.
  • Strided Conv: Behavior similar to MaxPool; collapses to same decision bias on limited data.

Performance vs CPU Efficiency

Observations

  • FLOPs do not correlate with wall-clock latency on CPU.
  • Memory access patterns and operator behavior dominate runtime.
  • All models have the same parameter count and model size.
  • Observed differences arise solely from downsampling behavior.

Trade-offs

  • AvgPool → Best stability–accuracy balance
  • MaxPool → Highest F1, worst latency
  • Strided Conv → Fast throughput, unstable bias

CPU realism exposes architectural costs often hidden by theoretical efficiency metrics.


Visual Analysis

Key comparative plots:

  • Accuracy vs Latency:
    Accuracy vs Latency

  • Accuracy vs Model Size:
    Accuracy vs Model Size

  • Accuracy vs Throughput:
    Accuracy vs Throughput

  • CPU Efficiency Overview:
    CPU Efficiency

  • Generalization Gap (Train vs Val):
    Generalization Gap

  • Validation Performance Summary:
    Validation Performance


Architectural Conclusions

  1. Downsampling is a bias control mechanism, not a neutral operation
  2. Small datasets amplify pooling-induced decision bias
  3. Strided convolution is not inherently superior under limited data
  4. CPU deployment reshuffles architectural trade-offs

Inference Visualizations

  • Side-by-Side Comparisons (Confusion Matrix, ROC, P-R Curve):
    lenet_avgpool
    lenet_maxpool
    lenet_strided

  • Model Comparison (Accuracy & F1 Score):
    Model Comparison


Architectural Conclusions

  1. Downsampling is a bias control mechanism, not a neutral operation
  2. Small datasets amplify pooling-induced decision bias
  3. Strided convolution is not inherently superior under limited data
  4. CPU deployment reshuffles architectural trade-offs

MLflow Tracking

All experiments were logged to DagsHub MLflow to ensure reproducibility, allow easy comparison, and facilitate structured analysis.

  • Tracking URI:
    https://dagshub.com/Y-R-A-V-R-5/DownScaleXR.mlflow

  • Experiment Name:
    DownScaleXR

  • Purpose:

    • Store all metrics (train/validation), parameters, and tags
    • Log artifacts such as plots and model checkpoints
    • Enable side-by-side comparisons of different downsampling strategies
  • Minimal Usage:

import mlflow

# Set tracking URI
mlflow.set_tracking_uri("https://dagshub.com/Y-R-A-V-R-5/DownScaleXR.mlflow")

# Select experiment
mlflow.set_experiment("DownScaleXR")

Logged Artifacts

  • Model checkpoints: model/<variant>/best_model.pt
  • Plots: artifacts/comparision/*.png and artifacts/inference/*.png
  • Configuration files: configs/*.yaml

Logged Metrics

  • Performance: accuracy, precision, recall, f1_score, auc
  • Efficiency: inference_time_ms, throughput_fps, model_parameters, model_size_mb
  • Tracking: Metrics logged per epoch for both training and validation

Using MLflow + DagsHub ensures reproducibility, enables easy experiment comparisons, and provides structured logging of both performance and efficiency metrics.


What This Project Signals

This work demonstrates:

  • Constraint-first thinking
  • Architectural literacy beyond plug-and-play models
  • Ability to isolate variables and reason about bias
  • CPU-realistic performance evaluation
  • Reproducible, inspectable R&D workflow

About

Architecture-level study of how early CNN downsampling choices affect bias, generalization, and CPU inference behavior under constrained settings.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published