Skip to content

Real-time SMS spam detection using ML models in simulated cellular networks. Compares 4 algorithms with comprehensive performance analysis.

License

Notifications You must be signed in to change notification settings

breezy-codes/machine-learning-for-spam-sms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Machine Learning for Real-Time SMS Spam Detection in Cellular Networks

Python scikit-learn License: GPL v3

Overview

This project evaluates the effectiveness of multiple machine learning models for real-time spam detection in cellular networks. Using a comprehensive SMS dataset, we train, evaluate, and simulate four different ML models in a realistic cellular network environment to determine the most effective approach for real-time spam filtering. View the academic report here.

Key Features

  • 4 Machine Learning Models: Logistic Regression, Naive Bayes, Random Forest, and Support Vector Machine
  • Real-time Simulation: Simulated cellular network environment with baseband units and radio units
  • Comprehensive Evaluation: Performance metrics including accuracy, precision, recall, and F1-score
  • Automated Spam Detection: Real-time alerting system for high spam volume detection
  • Dataset Generation: Custom spam dataset generator for testing on unseen data

Research Focus

The project addresses the critical need for effective spam detection in cellular networks by:

  • Comparing multiple ML algorithms in a realistic network simulation
  • Evaluating real-time performance under cellular network constraints
  • Analyzing model effectiveness for different spam patterns and volumes
  • Providing insights into the most suitable algorithms for mobile network deployment

Model Performance Summary

Model Training Accuracy Simulation Accuracy Simulation Time Best Use Case
Logistic Regression 99% 88% 8m 47s High precision spam detection
Naive Bayes 89% 81% 9m 39s Fast processing, good recall
Random Forest 89% 85% 20m 55s Balanced performance
Support Vector Machine 89% 82% 25m 53s Complex pattern recognition

Project Structure

machine-learning-for-spam-sms/
β”œβ”€β”€ πŸ“ models/                          # ML Model Development
β”‚   β”œβ”€β”€ models.ipynb                    # Model training & evaluation
β”‚   β”œβ”€β”€ spam_data.csv                   # Training dataset
β”‚   └── *.pkl                          # Trained model files
β”œβ”€β”€ πŸ“ simulation/                      # Network Simulation
β”‚   β”œβ”€β”€ simulation.ipynb                # Main simulation notebook
β”‚   β”œβ”€β”€ πŸ“ data/                       # Generated test datasets
β”‚   β”œβ”€β”€ πŸ“ logs/                       # Simulation logs by model
β”‚   └── πŸ“ results/                    # Performance results & figures
β”œβ”€β”€ πŸ“ spam-generator/                  # Dataset Generation
β”‚   β”œβ”€β”€ generator.py                    # Spam dataset generator
β”‚   └── conversations.py               # Conversation templates
β”œβ”€β”€ πŸ“ markdown/                        # Documentation
β”‚   β”œβ”€β”€ models.md                       # Model implementation details
β”‚   β”œβ”€β”€ simulation.md                   # Simulation methodology
β”‚   └── install_instructions.md         # Setup instructions
└── requirements.txt                    # Python dependencies

Key Components

  • Machine Learning Models: Four different algorithms trained on SMS spam data
  • Cellular Network Simulation: Realistic network topology with baseband and radio units
  • Real-time Processing: Stream processing of SMS messages with spam detection
  • Performance Monitoring: Comprehensive logging and alerting system
  • Dataset Generation: Custom spam generator for testing model robustness

Quick Start

Prerequisites

  • Python 3.8 or higher
  • Virtual environment (recommended)

Installation

  1. Clone the repository

    git clone https://github.com/breezy-codes/machine-learning-for-spam-sms.git
    cd machine-learning-for-spam-sms
  2. Set up virtual environment

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .\.venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt

For detailed setup instructions, see: Setting Up a Python Virtual Environment

πŸ€– Running the Machine Learning Models

Train and evaluate all four ML models using the comprehensive Jupyter notebook:

jupyter notebook models/models.ipynb

What the Models Do

  • Data Preprocessing: Text cleaning, tokenization, and vectorization
  • Model Training: Hyperparameter tuning with cross-validation
  • Performance Evaluation: Accuracy, precision, recall, F1-score metrics
  • Model Persistence: Saves trained models as .pkl files

Detailed guide: Model Implementation Notes

Running the Cellular Network Simulation

Experience real-time spam detection in a simulated cellular environment:

jupyter notebook simulation/simulation.ipynb

Simulation Features

  • Network Topology: Multiple baseband units with radio units
  • Real-time Processing: Stream-based message processing
  • Spam Detection: Live classification with alerting system
  • Performance Analytics: Comprehensive logging and metrics collection
  • Load Testing: Handles high-volume message streams

Detailed guide: Simulation Methodology

Results & Analysis

Model Performance Comparison

The simulation reveals interesting trade-offs between different algorithms:

  • Logistic Regression: Highest precision (99% spam detection) but lower recall
  • Random Forest: Best balanced performance with 85% accuracy
  • Naive Bayes: Fastest processing with good spam recall (90%)
  • SVM: Robust to outliers but computationally intensive

Real-time Performance Insights

  • Processing Speed: Naive Bayes processes messages fastest
  • Memory Usage: Logistic Regression has smallest memory footprint
  • Accuracy vs Speed: Random Forest offers best accuracy/speed balance
  • Alert Response: All models successfully trigger spam volume alerts

πŸ› οΈ Technical Architecture

Machine Learning Pipeline

  1. Data Preprocessing: Text normalization, stop word removal, stemming
  2. Feature Extraction: TF-IDF vectorization with n-grams
  3. Model Training: Cross-validation with hyperparameter optimization
  4. Evaluation: Multi-metric assessment on held-out test data

Cellular Network Simulation

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Radio Unit    │───▢│  Baseband Unit   │───▢│  Core Network   β”‚
β”‚  (Message RX)   β”‚     β”‚  (ML Processing) β”‚     β”‚ (Spam Alerts)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Radio Units: Simulate message reception from mobile devices
  • Baseband Units: Apply ML models for real-time spam classification
  • Core Network: Aggregate results and trigger spam volume alerts
  • Logging System: Captures all decisions and performance metrics

Customization & Extension

Adding New Models

  1. Train your model in models/models.ipynb
  2. Save as .pkl file in the models/ directory
  3. Add simulation code in simulation/simulation.ipynb
  4. Update logging and results directories

Modifying Network Topology

  • Adjust baseband unit count in simulation parameters
  • Configure radio unit connections per baseband
  • Customize message processing rates and volumes

Custom Dataset Generation

Use the spam generator to create targeted test scenarios:

from spam_generator.generator import generate_spam_dataset
dataset = generate_spam_dataset(volume=1000, spam_ratio=0.3)

Dependencies

Key libraries used in this project:

  • scikit-learn: Machine learning algorithms and evaluation
  • pandas: Data manipulation and analysis
  • numpy: Numerical computing
  • matplotlib/seaborn: Data visualization
  • nltk: Natural language processing
  • simpy: Discrete event simulation
  • jupyter: Interactive development environment

Contributing

Contributions are welcome! Areas for improvement:

  • Additional ML algorithms (Deep Learning, XGBoost)
  • Enhanced network simulation (5G features, edge computing)
  • Real-world dataset integration
  • Performance optimization
  • Mobile deployment strategies

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

References

  • SMS Spam Collection Dataset
  • Cellular Network Architecture Standards
  • Machine Learning for Telecommunications
  • Real-time Stream Processing Techniques

Built with ❀️ for telecommunications and machine learning research

About

Real-time SMS spam detection using ML models in simulated cellular networks. Compares 4 algorithms with comprehensive performance analysis.

Topics

Resources

License

Stars

Watchers

Forks