Skip to content

Encrypted VPN traffic classification using Random Forest, HistGradientBoosting and SHAP

Notifications You must be signed in to change notification settings

Siddarthkutumbaka/Encrypted-Traffic-Classification-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Encrypted Traffic Classification with Random Forest + SHAP

This project builds a supervised ML model to classify encrypted VPN and non-VPN traffic from flow-level features.

  • Dataset: CIC VPN 2016 (VPN and non-VPN traffic with 14 application labels)
  • Models:
    • Baseline: RandomForestClassifier
    • Boosted: HistGradientBoostingClassifier (tree-based boosting)
  • Performance (test set):
    • Random Forest accuracy: ~0.90
    • Boosted model accuracy: ~0.90–0.91
    • Macro ROC-AUC: ~0.93
  • Explainability: SHAP TreeExplainer to understand which flow features (duration, total bytes, inter-arrival time, etc.) drive predictions.

Project Structure

encrypted-traffic-classification/
├── notebooks/
│   └── 01_exploration.ipynb   # Data loading, RF + HGB models, confusion matrix, SHAP plots
├── reports/
│   ├── confusion_matrix_rf.png
│   └── confusion_matrix_hgb.png
├── requirements.txt           # Python dependencies
└── .gitignore

How to Run

# 1. Clone the repo
git clone https://github.com/Siddarthkutumbaka/Encrypted-Traffic-Classification-ML.git
cd Encrypted-Traffic-Classification-ML

# 2. Create virtualenv (optional but recommended)
python3 -m venv venv
source venv/bin/activate  # (Mac/Linux)

# 3. Install dependencies
pip install -r requirements.txt

# 4. Open the notebook
jupyter notebook notebooks/01_exploration.ipynb

Key Results
	•	High accuracy across 14 encrypted traffic classes (browsing, chat, VoIP, VPN subtypes, etc.).
	•	Confusion matrices show strong separation between VPN sub-classes.
	•	SHAP analysis highlights:
	•	duration, total_biat, mean_biat, flowPktsPerSecond, etc. as most influential features.

Potential Future Work
	•	Deploy as an online classifier (REST API or streaming).
	•	Compare with deep learning models (1D CNN / LSTM on flow sequences).
	•	Enhance adversarial robustness and generalization to new VPN protocols.

    4. Press **Cmd + S** to save.

(If you want, we can tweak the accuracy numbers later to match exactly what your notebook prints.)

---

## 3️⃣ (Optional but recommended) Add SHAP screenshot to `reports/`

If you want a SHAP image in the repo:

1. Open your notebook in VS Code (`notebooks/01_exploration.ipynb`).
2. Scroll to the SHAP summary plot.
3. Take a screenshot of the plot:
   - Press **Cmd + Shift + 4** and drag around just the SHAP figure.
   - It’ll save to your Desktop as something like `Screenshot ... .png`.
4. In **Finder**, open that screenshot on Desktop and:
   - Rename it to `shap_summary_rf.png`.
   - Drag it into the **`reports`** folder inside `encrypted-traffic-classification` (in Finder).

We’ll reference it in README later if you like.

---

## 4️⃣ Commit & push the changes

Back in **Terminal** (already in the project folder):

```bash
git status

About

Encrypted VPN traffic classification using Random Forest, HistGradientBoosting and SHAP

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published