A comprehensive cybersecurity solution that uses machine learning algorithms to detect phishing URLs and spam messages in real-time. The project includes a Flask web application, trained ML models, and a browser extension for proactive protection against phishing attacks.
- Overview
- Features
- Project Structure
- Machine Learning Algorithms
- Feature Engineering
- Key Functions
- Installation
- Usage
- API Endpoints
- Browser Extension
- Technologies Used
This project implements a multi-layered phishing detection system that combines:
- Machine Learning Models: Random Forest, SVM, and Logistic Regression classifiers
- Feature Extraction: 17+ URL-based features for comprehensive analysis
- Web Application: Flask-based API for real-time URL scanning
- Browser Extension: Chrome extension for proactive protection while browsing
- Spam Detection: Text-based spam detection using TF-IDF vectorization
- Real-time URL Analysis: Instant phishing detection for any URL
- Multiple ML Models: Ensemble approach using three different algorithms
- Batch Processing: Analyze multiple URLs simultaneously
- Browser Integration: Chrome extension with Gmail integration
- Spam Detection: Email and message spam classification
- Confidence Scoring: Probability-based prediction confidence
- Visual Warnings: In-browser alerts for suspicious links
- Model Persistence: Pre-trained models for quick deployment
Detection-of-Phishing-URLs-using-Machine-Learning/
βββ app.py # Flask web application
βββ phishing_detector.py # Core ML model training script
βββ phishing_detector_v2.py # Enhanced version with additional features
βββ phishing_detector_using_gpu.py # GPU-accelerated training
βββ spam_detector.py # Spam detection module
βββ train_spam_model.py # Spam model training script
βββ retrain_model.py # Model retraining utility
βββ phishing_detector.pkl # Trained phishing detection model
βββ spam_detector.pkl # Trained spam detection model
βββ requirements.txt # Python dependencies
βββ extension/ # Browser extension files
β βββ manifest.json # Extension configuration
β βββ background.js # Background service worker
β βββ content.js # Content script for all pages
β βββ gmail-content.js # Gmail-specific integration
β βββ popup.html # Extension popup interface
β βββ popup.js # Popup logic
β βββ options.js # Extension settings
βββ templates/ # HTML templates for Flask app
- Algorithm Type: Ensemble learning method
- Configuration: 100 decision trees (
n_estimators=100) - Advantages:
- Handles non-linear relationships well
- Resistant to overfitting
- Provides feature importance rankings
- Use Case: Primary model for phishing detection
- Algorithm Type: Kernel-based classifier
- Configuration: RBF (Radial Basis Function) kernel
- Preprocessing: Requires feature scaling using
StandardScaler - Advantages:
- Effective in high-dimensional spaces
- Memory efficient
- Robust against outliers
- Use Case: Alternative model for complex decision boundaries
- Algorithm Type: Linear classification model
- Configuration: Maximum 1000 iterations for convergence
- Preprocessing: Feature scaling applied
- Advantages:
- Fast training and prediction
- Provides probability estimates
- Interpretable coefficients
- Use Case: Baseline model and probability calibration
The system trains all three models and selects the best performer based on F1-Score, which balances precision and recall. This ensures optimal detection of phishing URLs while minimizing false positives.
The system extracts 17 comprehensive features from each URL:
url_length: Total character count of the URLdomain_length: Length of the domain namepath_length: Length of the URL pathquery_length: Length of query parameterspath_depth: Number of directory levels in path
subdomain_count: Number of subdomainshas_ip: Binary flag for IP address instead of domain namedigits_in_domain: Count of numeric characters in domainletters_in_domain: Count of alphabetic characters in domain
dash_count: Number of hyphens (-)dot_count: Number of dots (.)underscore_count: Number of underscores (_)question_count: Number of question marks (?)equal_count: Number of equals signs (=)and_count: Number of ampersands (&)at_count: Number of @ symbols
is_https: Binary flag for HTTPS protocolhas_port: Binary flag for custom port usagehas_suspicious_words: Detects keywords like 'secure', 'account', 'login', 'verify', 'bank'
domain_entropy: Shannon entropy of domain (measures randomness)path_entropy: Shannon entropy of path
def calculate_entropy(s):
"""Calculate Shannon entropy to measure randomness"""
if not s:
return 0
entropy = 0
for x in set(s):
p_x = s.count(x) / len(s)
entropy += -p_x * np.log2(p_x)
return entropyHigher entropy values indicate more random character distributions, which is common in phishing URLs.
Purpose: Extracts all features from a given URL for ML prediction
Parameters:
url(str): The URL to analyze
Returns: Dictionary of feature names and values
Logic:
- Parses URL using
urlparseandtldextract - Calculates structural metrics (lengths, counts)
- Identifies suspicious patterns
- Computes entropy for randomness detection
- Returns feature dictionary
Example:
features = detector.extract_features("https://secure-paypal-update.com/signin")
# Returns: {'url_length': 45, 'has_suspicious_words': 1, ...}Purpose: Trains multiple ML models and compares their performance
Parameters:
X(array): Feature matrix (n_samples Γ n_features)y(array): Labels (0=legitimate, 1=phishing)
Returns: Dictionary containing model performance metrics
Logic:
- Splits data into 80% training, 20% testing
- Applies
StandardScalerfor SVM and Logistic Regression - Trains Random Forest, SVM, and Logistic Regression
- Evaluates using accuracy, precision, recall, and F1-score
- Selects best model based on F1-score
- Stores all models for ensemble predictions
Metrics Calculated:
- Accuracy: Overall correctness
- Precision: True positives / (True positives + False positives)
- Recall: True positives / (True positives + False negatives)
- F1-Score: Harmonic mean of precision and recall
Purpose: Predicts whether a URL is phishing or legitimate
Parameters:
url(str): URL to classifymodel_name(str, optional): Specific model to use (defaults to best model)
Returns: Dictionary with prediction, confidence, and model used
Logic:
- Extracts features from the URL
- Applies feature scaling if needed (for SVM/Logistic Regression)
- Makes prediction using specified or best model
- Calculates confidence score using
predict_proba() - Returns structured result
Example:
result = detector.predict("http://192.168.1.1/bank-login")
# Returns: {
# 'prediction': 'Phishing',
# 'confidence': 0.95,
# 'model_used': 'Random Forest',
# 'features': {...}
# }Purpose: Persist and load trained models
Saved Components:
- All trained models (Random Forest, SVM, Logistic Regression)
StandardScalerobject (for consistent feature scaling)- Feature names list
- Best model identifier
Logic: Uses joblib for efficient serialization of scikit-learn objects
Purpose: Loads and preprocesses URL datasets from CSV files
Parameters:
file_path(str): Path to CSV file with 'URL' and 'Label' columns
Returns: Tuple of (urls, labels)
Logic:
- Reads CSV using pandas
- Validates required columns exist
- Removes rows with missing values
- Cleans URLs using
clean_url() - Validates URLs using
is_valid_url() - Converts labels to numeric (0/1) using
process_label() - Handles errors gracefully, skipping invalid entries
Purpose: Handles missing or invalid feature values
Logic:
- Detects NaN values in feature matrix
- Uses
SimpleImputerwith mean strategy to fill missing values - Falls back to zero-filling if imputation fails
- Ensures no NaN values remain before model training
Purpose: Cleans and normalizes text for spam detection
Logic:
- Converts to lowercase
- Removes URLs using regex
- Removes email addresses
- Removes numbers
- Removes punctuation
- Normalizes whitespace
Purpose: Extracts features from text messages
Features Extracted:
- TF-IDF features: Term frequency-inverse document frequency vectors
- Length features: Total character count, word count
- Stylistic features: Capital letters, exclamation marks, question marks
- Capital ratio: Proportion of uppercase characters
Returns: Combined feature vector for classification
Purpose: Classifies text as spam or ham (legitimate)
Returns: Dictionary with prediction ('spam'/'ham'), confidence, and model used
Purpose: API endpoint for single URL phishing detection
Request Body:
{
"url": "https://example.com",
"model": "Random Forest" // optional
}Response:
{
"url": "https://example.com",
"prediction": "Legitimate",
"prediction_numeric": 0,
"confidence": 0.92,
"model_used": "Random Forest"
}Purpose: Batch processing of multiple URLs
Request Body:
{
"urls": ["url1", "url2", "url3"],
"model": "Random Forest"
}Returns: Array of prediction results
Purpose: Spam detection for text messages
Request Body:
{
"text": "Congratulations! You've won $1000...",
"model": "Random Forest"
}- Python 3.8+
- pip package manager
- Clone the repository:
git clone https://github.com/yourusername/Detection-of-Phishing-URLs-using-Machine-Learning.git
cd Detection-of-Phishing-URLs-using-Machine-Learning- Create virtual environment (recommended):
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/Mac- Install dependencies:
pip install -r requirements.txt- Train models (if not using pre-trained):
python phishing_detector.py
python train_spam_model.py- Run Flask application:
python app.pyThe application will be available at http://localhost:5000
- Navigate to
http://localhost:5000 - Enter a URL in the input field
- Click "Check URL" to get instant results
- View prediction, confidence score, and model used
Python Example:
import requests
url = "http://localhost:5000/api/detect"
data = {"url": "https://suspicious-site.com"}
response = requests.post(url, json=data)
print(response.json())cURL Example:
curl -X POST http://localhost:5000/api/detect \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com"}'import requests
url = "http://localhost:5000/api/batch_detect"
data = {
"urls": [
"https://google.com",
"http://suspicious-paypal.tk",
"https://github.com"
]
}
response = requests.post(url, json=data)
results = response.json()| Endpoint | Method | Description | Request Body |
|---|---|---|---|
/ |
GET | Main web interface | - |
/api/detect |
POST | Single URL detection | {"url": "...", "model": "..."} |
/api/batch_detect |
POST | Multiple URL detection | {"urls": [...], "model": "..."} |
/api/detect_spam |
POST | Spam text detection | {"text": "...", "model": "..."} |
/api/models |
GET | List available models | - |
/detect |
GET/POST | Form-based detection | Form data |
/batch |
GET | Batch detection page | - |
- Real-time Link Scanning: Automatically checks links on web pages
- Visual Indicators: Color-coded badges (green=safe, red=phishing, yellow=suspicious)
- Gmail Integration: Scans links in Gmail messages
- Click Protection: Warns users before visiting phishing sites
- Customizable Settings: Configure API endpoint and sensitivity
-
Load Extension in Chrome:
- Open
chrome://extensions/ - Enable "Developer mode"
- Click "Load unpacked"
- Select the
extensionfolder
- Open
-
Configure Settings:
- Click extension icon
- Go to Options
- Set API endpoint (default:
http://localhost:5000)
background.js: Service worker for API communicationcontent.js: Scans and marks links on all web pagesgmail-content.js: Specialized Gmail integrationpopup.js: Extension popup interfaceoptions.js: Settings management
- Flask: Web framework for API
- scikit-learn: Machine learning library
- pandas: Data manipulation
- NumPy: Numerical computing
- joblib: Model serialization
- Random Forest: Ensemble classifier
- SVM: Support Vector Machine
- Logistic Regression: Linear classifier
- TF-IDF: Text vectorization for spam detection
- HTML/CSS/JavaScript: Web interface
- Chrome Extension API: Browser integration
- tldextract: Domain extraction
- BeautifulSoup: HTML parsing
- urllib: URL parsing
- Flask-CORS: Cross-origin resource sharing
The models are evaluated using:
- Accuracy: Overall prediction correctness
- Precision: Ratio of true phishing detections to all phishing predictions
- Recall: Ratio of detected phishing URLs to all actual phishing URLs
- F1-Score: Harmonic mean balancing precision and recall
Typical performance metrics:
- Random Forest: F1-Score ~0.95+
- SVM: F1-Score ~0.92+
- Logistic Regression: F1-Score ~0.88+
To retrain models with new data:
- Prepare CSV file with columns:
URL,Label(good/bad) - Update file path in
phishing_detector.py - Run training script:
python phishing_detector.pyFor spam detection:
python train_spam_model.pyContributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is created for educational purposes as part of a Cyber Security course.
Shubham
- GitHub: @shub15
- Dataset sources for phishing URL training
- scikit-learn documentation and community
- Flask framework developers
- Chrome Extension API documentation
Note: This system is designed for educational and research purposes. Always use multiple layers of security when protecting against phishing attacks.