SafetyLens is a research project investigating multi-dimensional content safety detection in conversational AI systems through a combined lens of model architecture and explainable AI.
Using the DICES-350 dataset of human–AI conversations, this project implements and evaluates three modeling paradigms:
- Logistic Regression (TF-IDF baseline) – F1: 0.422
- Single-Task RoBERTa (specialized binary classifier) – F1: 0.543
- Multi-Task RoBERTa (shared encoder with 2-head and 4-head variants) – F1: 0.469, 0.461
Beyond predictive performance, SafetyLens emphasizes how models reason about safety decisions. Each architecture is analyzed using three complementary explainability methods:
- LIME (Local Interpretable Model-agnostic Explanations)
- SHAP (SHapley Additive exPlanations)
- Integrated Gradients (gradient-based attribution for transformers)
Key Research Insight: While transformer-based models substantially outperform linear baselines, multi-task learning underperforms single-task specialization under severe class imbalance. Safety dimensions with fewer than 10% positive examples induce negative transfer, degrading performance on otherwise stronger tasks. However, multi-task models exhibit more semantically grounded and interpretable reasoning, highlighting a trade-off between performance and explanation quality.
Web Application: An interactive interface is provided for model comparison and real-time explainability visualization. See the Interactive Demo section.
| Model | F1 Score | Precision | Recall | Architecture |
|---|---|---|---|---|
| Logistic Regression | 0.422 | 0.419 | 0.425 | TF-IDF + sklearn |
| Single-Task RoBERTa | 0.543 | 0.531 | 0.556 | Binary classifier |
| Multi-Task 2-head | 0.469 | 0.458 | 0.481 | Overall + Harmful |
| Multi-Task 4-head | 0.461 | 0.449 | 0.473 | All dimensions |
The single-task transformer achieves the strongest overall performance, confirming the value of contextual representations for safety detection. Multi-task variants underperform despite weighted loss functions, suggesting limits to shared representations when auxiliary tasks are highly imbalanced.
RQ1: Can transformers outperform traditional baselines for multidimensional safety detection?
RQ2: Does multi-task learning improve performance across safety dimensions?
RQ3: How do models reason differently about high-confidence vs. uncertain predictions?
-
Class imbalance significantly impacts multi-task learning
- Q3_bias: ~9% unsafe examples
- Q6_policy: ~10–11% unsafe examples
- Tasks with <10% positives show 30–40% lower F1
- Adding these tasks causes negative transfer in shared encoders
-
Single-task models excel in focused classification
- +29% F1 improvement over logistic regression baseline
- Dedicated capacity enables sharper decision boundaries
- Particularly effective for moderately balanced tasks
-
Explainability reveals systematic reasoning differences
- Logistic regression relies heavily on explicit keywords
- Single-task transformers emphasize syntactic and structural cues
- Multi-task transformers balance semantic content and structure
- Q_overall: Overall safety assessment across all categories
- Q2_harmful: Harmful, offensive, or dangerous content
- Q3_bias: Bias, stereotypes, and discriminatory language
- Q6_policy: Platform policy violations and governance issues
Each dimension captures distinct linguistic and semantic phenomena, making the dataset a strong benchmark for studying architectural trade-offs in safety modeling.
DICES-350: Diverse Conversational AI Safety Examples
- 350 adversarial human–AI conversations
- Multi-dimensional safety annotations from a diverse rater pool
- Conversation-level train/validation/test splits to prevent leakage
- Severe class imbalance across safety dimensions
Citation: DICES-350
Run notebooks/eda.ipynb to process the raw dataset. This creates data/processed/dices_350_binary.csv with binary safety labels for four dimensions:
Q_overall_binary– Overall safetyQ2_harmful_binary– Harmful contentQ3_bias_binary– Bias and stereotypesQ6_policy_binary– Policy violations
Key preprocessing steps:
- Conversion from multi-class to binary labels
- Removal of ambiguous annotations
- Conversation-level data splitting
- Class distribution analysis and visualization
Logistic Regression Baseline
- TF-IDF vectorization (max 5,000 features)
- L2 regularization
- Class-weight balancing
Single-Task RoBERTa
- RoBERTa-base pretrained encoder
- Binary classification head
- Fine-tuned on Q_overall only
- Max sequence length: 512 tokens
Multi-Task RoBERTa
- Shared RoBERTa encoder
- Task-specific classification heads
- 2-head variant: Q_overall + Q2_harmful
- 4-head variant: all safety dimensions
- Weighted BCE loss to address imbalance
LIME
- Model-agnostic local explanations
- Perturbation-based surrogate modeling
- Applicable to all architectures
SHAP
- Game-theoretic feature attribution
- Exact explanations for linear models
- Unreliable for transformer architectures
Integrated Gradients
- Gradient-based token attribution
- Transformer-specific method
- Captures attribution concentration and diffuseness
SafetyLens/
├── app/ # Interactive web application
│ ├── backend/ # FastAPI backend
│ │ ├── api/ # Prediction & explanation endpoints
│ │ │ ├── explain.py
│ │ │ └── predict.py
│ │ ├── explainability/ # Backend explainability logic
│ │ │ ├── ig_explainer.py
│ │ │ ├── lime_explainer.py
│ │ │ └── shap_explainer.py
│ │ ├── models/ # Model loading utilities
│ │ │ └── loader.py
│ │ ├── utils/ # Backend utilities
│ │ │ └── download_models.py
│ │ ├── main.py # FastAPI entry point
│ │ └── requirements.txt
│ └── frontend/ # React + Vite frontend
│ ├── src/
│ │ ├── components/ # UI stages (input, model, explain, output)
│ │ └── pages/ # Home / About pages
│ ├── public/ # Static assets
│ │ ├── paper.pdf
│ │ └── poster.pdf
│ ├── index.html
│ ├── vite.config.js
│ └── package.json
│
│
├── data/
│ ├── raw/ # Original DICES-350 dataset
│ │ └── diverse_safety_adversarial_dialog_350.csv
│ └── processed/ # Processed binary dataset
│ ├── dices_350_binary.csv
│ └── processed_distribution.png
│
├── docs/ # Research documentation
│ ├── paper.pdf # Final paper
│ ├── poster.pdf # Conference poster
│ ├── experiments.md # Experimental design details
│ └── hypotheses.md # Research hypotheses
│
├── notebooks/ # Exploratory & prototyping notebooks
│ ├── eda.ipynb
│ ├── integrated_gradients.ipynb
│ ├── multitask_2_head_training.ipynb
│ └── multitask_4_head_training.ipynb
│
├── models/
│ ├── checkpoints/ # Trained model artifacts
│ │ ├── best_singletask.pt
│ │ ├── best_multitask_2.pt
│ │ ├── best_multitask_4.pt
│ │ ├── logistic_regression_model.pkl
│ │ └── tfidf_vectorizer.pkl
│ ├── logistic_regression.py
│ ├── single_task_transformer.py
│ └── multi_task_transformer.py
│
├── scripts/ # Reproducible experiment scripts
│ ├── models/
│ │ ├── train_logistic_regression.py
│ │ ├── train_singletask.py
│ │ └── train_multitask.py
│ ├── explainability/
│ │ ├── run_integrated_gradients.py
│ │ ├── run_lime.py
│ │ └── run_shap.py
│ └── evaluation/
│ ├── experiment1_2.py
│ └── experiment3.py
│
├── results/
│ ├── evaluation/ # Quantitative experiment outputs
│ │ ├── experiment1.csv
│ │ ├── experiment2.csv
│ │ └── experiment3/
│ ├── explainability/
│ │ ├── ig/ # Integrated Gradients outputs
│ │ ├── lime/ # LIME explanations
│ │ └── shap/ # SHAP outputs
│ ├── figures/ # Plots used in paper/poster
│ │ ├── experiment1/
│ │ ├── experiment2/
│ │ ├── experiment3/
│ │ └── lime/
│ └── predictions/ # Test-set predictions
│
├── src/
│ ├── data_preprocessing.py
│ ├── utils/paths.py # Shared utilities
│ └── visualizations.py
│
├── Dockerfile
├── docker-start.sh
├── docker-compose.yaml
├── render.yaml
├── requirements.txt
├── LICENSE
└── README.md
An interactive web application is provided for real-time model comparison and explainability visualization. This interface serves as a supplementary tool for exploring research results without running the full training pipeline.
Live Demo: https://atremante26.github.io/SafetyLens/
Choose the method that best fits your needs:
Simply visit the live demo - no setup required:
Run the complete application with all models preloaded. Choose between manual download (faster) or automatic download (easier):
Option 2A: Pre-download Models (Faster)
Download models once, reuse in future builds:
# Download pre-trained models (one-time setup)
mkdir -p models/checkpoints && cd models/checkpoints
gdown 1NSGLiM2M8l_h2N0m0DDKjVWsh2aQlnL4 # logreg (~5 MB)
gdown 1WHjq8UaTlRb2SqudGZCv5RiUQL42NQSB # vectorizer (~10 MB)
gdown 1DX2oY2zPX7DgH6_F2j6BxvL6CNAd1IUH # singletask (~500 MB)
gdown 1FZdKRT3E4mISFQ2HnRCODxXJQkF1xeBf # multitask_2 (~500 MB)
gdown 11AQtrY6veTF_j337g2f9YaSBxipfsdor # multitask_4 (~500 MB)
cd ../..
# Start Docker containers (builds in ~2-3 minutes)
./docker-start.shOption 2B: Auto-download During Build (Easier)
Let Docker download models automatically:
# Just start Docker (models download during first build)
./docker-start.sh
# First build: ~5-10 minutes (downloads models)
# Subsequent builds: ~2-3 minutes (models cached in image)Access: http://localhost:5173/SafetyLens/
Stop:
# Press Ctrl+C, then:
docker-compose downRun backend and frontend separately for development work:
Backend:
# Activate virtual environment
source safetylens/bin/activate # or safetylens\Scripts\activate on Windows
# Download models (if not already done - see Option 2) or train them.
# Start FastAPI server
uvicorn app.backend.main:app --reload --port 8000Frontend (in new terminal):
# Install Node dependencies
cd app/frontend
npm install
# Start development server
npm run devAccess:
- Frontend: http://localhost:5173
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/docs
# Python 3.11+
python --version
# Install dependencies
pip install -r requirements.txtpython scripts/models/train_logistic_regression.py \
--mod_out models/checkpoints/logistic_regression_model.pkl \
--preds_out results/predictions/logistic_regression/test_preds.csvpython scripts/models/train_singletask.py \
--ckpt_out models/checkpoints/best_singletask.pt \
--preds_out results/predictions/single_task_transformer/test_preds.csvpython scripts/models/train_multitask.py \
--tasks 2 \
--ckpt_out models/checkpoints/best_multitask_2.pt \
--preds_out results/predictions/multi_task_transformer/test_preds_2.csv python scripts/models/train_multitask.py \
--tasks 4 \
--ckpt_out models/checkpoints/best_multitask_4.pt \
--preds_out results/predictions/multi_task_transformer/test_preds_4.csvSingle-Task:
python scripts/explainability/run_integrated_gradients.py \
--checkpoint models/checkpoints/best_singletask.pt \
--model_type singletask \
--task Q_overall \
--n_samples 10
--output_dir results/explainability/ig/ig_single_q.csvMulti-Task 2-Head:
python scripts/explainability/run_integrated_gradients.py \
--checkpoint models/checkpoints/best_multitask_2.pt \
--model_type multitask \
--task Q_overall \
--n_samples 30
--output_dir results/explainability/ig/ig_2task_q.csvMulti-Task 4-Head:
python scripts/explainability/run_integrated_gradients.py \
--checkpoint models/checkpoints/best_multitask_4.pt \
--model_type multitask \
--task Q_overall \
--n_samples 30
--output_dir results/explainability/ig/ig_4task_q.csvLogistic Regression:
python scripts/explainability/run_shap.py \
--ckpt models/checkpoints/best_multitask_4.pt \
--data_csv data/processed/dices_350_binary.csv \
--model_type multitask \
--task Q_overall \
--out_csv results/explainability/shap/multi_shap_q_overall.csvSingle-Task Transformer:*
python scripts/explainability/run_shap.py \
--ckpt models/checkpoints/best_singletask.pt \
--data_csv data/processed/dices_350_binary.csv \
--model_type singletask \
--task Q_overall \
--out_csv results/explainability/shap/single_shap_q_overall.csvMulti-Task Transformer:
python scripts/explainability/run_shap.py \
--ckpt models/checkpoints/logistic_regression_model.pkl \
--data_csv data/processed/dices_350_binary.csv \
--vectorizer models/checkpoints/tfidf_vectorizer.pkl \
--model_type logreg \
--task Q_overall \
--out_csv results/explainability/shap/logreg_shap_q_overall.csvLogistic Regression:
python scripts/explainability/run_lime.py \
--model_type logreg \
--checkpoint models/checkpoints/logistic_regression_model.pkl \
--output_dir results/explainability/limeSingle-Task Transformer:*
python scripts/explainability/run_lime.py \
--model_type singletask \
--checkpoint models/checkpoints/best_singletask.pt \
--output_dir results/explainability/limeMulti-Task Transformer:
python scripts/explainability/run_lime.py \
--model_type multitask \
--checkpoint models/checkpoints/best_multitask_4.pt \
--output_dir results/explainability/limepython scripts/evaluation/experiment1_2.py --experiment 1python scripts/evaluation/experiment1_2.py --experiment 2Multi-Task Transformer:
python scripts/explainability/run_integrated_gradients.py \
--checkpoint models/checkpoints/best_multitask_4.pt \
--model_type multitask \
--task Q_overall \
--n_samples 30 \
--confidence_split high_vs_borderline \
--n_steps 50 \
--output_dir results/evaluation/experiment3/experiment3_multi4.csvSingle-Task Transformer:
python scripts/explainability/run_integrated_gradients.py \
--checkpoint models/checkpoints/best_singletask.pt \
--model_type singletask \
--task Q_overall \
--n_samples 30 \
--confidence_split high_vs_borderline \
--n_steps 50 \
--output_dir results/evaluation/experiment3/experiment3_single.csvSingle-Task Summary:
python scripts/evaluation/experiment3.py \
--ig_results results/evaluation/experiment3_multi4.csv \
--metric concentration \
--output results/evaluation/experiment3/experiment3_multi4_summary.csvMulti-Task Summary:
python scripts/evaluation/experiment3.py \
--ig_results results/evaluation/experiment3_single.csv \
--metric concentration \
--output results/evaluation/experiment3/experiment3_single_summary.csvpython src/visualizations.py # Creates plots in results/figures/Academic project completed for COSC-243: Natural Language Processing at Amherst College (Fall 2025).
