This project evaluates the effectiveness of multiple machine learning models for real-time spam detection in cellular networks. Using a comprehensive SMS dataset, we train, evaluate, and simulate four different ML models in a realistic cellular network environment to determine the most effective approach for real-time spam filtering. View the academic report here.
- 4 Machine Learning Models: Logistic Regression, Naive Bayes, Random Forest, and Support Vector Machine
- Real-time Simulation: Simulated cellular network environment with baseband units and radio units
- Comprehensive Evaluation: Performance metrics including accuracy, precision, recall, and F1-score
- Automated Spam Detection: Real-time alerting system for high spam volume detection
- Dataset Generation: Custom spam dataset generator for testing on unseen data
The project addresses the critical need for effective spam detection in cellular networks by:
- Comparing multiple ML algorithms in a realistic network simulation
- Evaluating real-time performance under cellular network constraints
- Analyzing model effectiveness for different spam patterns and volumes
- Providing insights into the most suitable algorithms for mobile network deployment
| Model | Training Accuracy | Simulation Accuracy | Simulation Time | Best Use Case |
|---|---|---|---|---|
| Logistic Regression | 99% | 88% | 8m 47s | High precision spam detection |
| Naive Bayes | 89% | 81% | 9m 39s | Fast processing, good recall |
| Random Forest | 89% | 85% | 20m 55s | Balanced performance |
| Support Vector Machine | 89% | 82% | 25m 53s | Complex pattern recognition |
machine-learning-for-spam-sms/
βββ π models/ # ML Model Development
β βββ models.ipynb # Model training & evaluation
β βββ spam_data.csv # Training dataset
β βββ *.pkl # Trained model files
βββ π simulation/ # Network Simulation
β βββ simulation.ipynb # Main simulation notebook
β βββ π data/ # Generated test datasets
β βββ π logs/ # Simulation logs by model
β βββ π results/ # Performance results & figures
βββ π spam-generator/ # Dataset Generation
β βββ generator.py # Spam dataset generator
β βββ conversations.py # Conversation templates
βββ π markdown/ # Documentation
β βββ models.md # Model implementation details
β βββ simulation.md # Simulation methodology
β βββ install_instructions.md # Setup instructions
βββ requirements.txt # Python dependencies
- Machine Learning Models: Four different algorithms trained on SMS spam data
- Cellular Network Simulation: Realistic network topology with baseband and radio units
- Real-time Processing: Stream processing of SMS messages with spam detection
- Performance Monitoring: Comprehensive logging and alerting system
- Dataset Generation: Custom spam generator for testing model robustness
- Python 3.8 or higher
- Virtual environment (recommended)
-
Clone the repository
git clone https://github.com/breezy-codes/machine-learning-for-spam-sms.git cd machine-learning-for-spam-sms -
Set up virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .\.venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
For detailed setup instructions, see: Setting Up a Python Virtual Environment
Train and evaluate all four ML models using the comprehensive Jupyter notebook:
jupyter notebook models/models.ipynb- Data Preprocessing: Text cleaning, tokenization, and vectorization
- Model Training: Hyperparameter tuning with cross-validation
- Performance Evaluation: Accuracy, precision, recall, F1-score metrics
- Model Persistence: Saves trained models as
.pklfiles
Detailed guide: Model Implementation Notes
Experience real-time spam detection in a simulated cellular environment:
jupyter notebook simulation/simulation.ipynb- Network Topology: Multiple baseband units with radio units
- Real-time Processing: Stream-based message processing
- Spam Detection: Live classification with alerting system
- Performance Analytics: Comprehensive logging and metrics collection
- Load Testing: Handles high-volume message streams
Detailed guide: Simulation Methodology
The simulation reveals interesting trade-offs between different algorithms:
- Logistic Regression: Highest precision (99% spam detection) but lower recall
- Random Forest: Best balanced performance with 85% accuracy
- Naive Bayes: Fastest processing with good spam recall (90%)
- SVM: Robust to outliers but computationally intensive
- Processing Speed: Naive Bayes processes messages fastest
- Memory Usage: Logistic Regression has smallest memory footprint
- Accuracy vs Speed: Random Forest offers best accuracy/speed balance
- Alert Response: All models successfully trigger spam volume alerts
- Data Preprocessing: Text normalization, stop word removal, stemming
- Feature Extraction: TF-IDF vectorization with n-grams
- Model Training: Cross-validation with hyperparameter optimization
- Evaluation: Multi-metric assessment on held-out test data
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Radio Unit βββββΆβ Baseband Unit βββββΆβ Core Network β
β (Message RX) β β (ML Processing) β β (Spam Alerts) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
- Radio Units: Simulate message reception from mobile devices
- Baseband Units: Apply ML models for real-time spam classification
- Core Network: Aggregate results and trigger spam volume alerts
- Logging System: Captures all decisions and performance metrics
- Train your model in
models/models.ipynb - Save as
.pklfile in themodels/directory - Add simulation code in
simulation/simulation.ipynb - Update logging and results directories
- Adjust baseband unit count in simulation parameters
- Configure radio unit connections per baseband
- Customize message processing rates and volumes
Use the spam generator to create targeted test scenarios:
from spam_generator.generator import generate_spam_dataset
dataset = generate_spam_dataset(volume=1000, spam_ratio=0.3)Key libraries used in this project:
- scikit-learn: Machine learning algorithms and evaluation
- pandas: Data manipulation and analysis
- numpy: Numerical computing
- matplotlib/seaborn: Data visualization
- nltk: Natural language processing
- simpy: Discrete event simulation
- jupyter: Interactive development environment
Contributions are welcome! Areas for improvement:
- Additional ML algorithms (Deep Learning, XGBoost)
- Enhanced network simulation (5G features, edge computing)
- Real-world dataset integration
- Performance optimization
- Mobile deployment strategies
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
- SMS Spam Collection Dataset
- Cellular Network Architecture Standards
- Machine Learning for Telecommunications
- Real-time Stream Processing Techniques
Built with β€οΈ for telecommunications and machine learning research