Phishing URL Detection using Machine Learning

A comprehensive cybersecurity solution that uses machine learning algorithms to detect phishing URLs and spam messages in real-time. The project includes a Flask web application, trained ML models, and a browser extension for proactive protection against phishing attacks.

🎯 Overview

This project implements a multi-layered phishing detection system that combines:

Machine Learning Models: Random Forest, SVM, and Logistic Regression classifiers
Feature Extraction: 17+ URL-based features for comprehensive analysis
Web Application: Flask-based API for real-time URL scanning
Browser Extension: Chrome extension for proactive protection while browsing
Spam Detection: Text-based spam detection using TF-IDF vectorization

✨ Features

Real-time URL Analysis: Instant phishing detection for any URL
Multiple ML Models: Ensemble approach using three different algorithms
Batch Processing: Analyze multiple URLs simultaneously
Browser Integration: Chrome extension with Gmail integration
Spam Detection: Email and message spam classification
Confidence Scoring: Probability-based prediction confidence
Visual Warnings: In-browser alerts for suspicious links
Model Persistence: Pre-trained models for quick deployment

📁 Project Structure

Detection-of-Phishing-URLs-using-Machine-Learning/
├── app.py                          # Flask web application
├── phishing_detector.py            # Core ML model training script
├── phishing_detector_v2.py         # Enhanced version with additional features
├── phishing_detector_using_gpu.py  # GPU-accelerated training
├── spam_detector.py                # Spam detection module
├── train_spam_model.py             # Spam model training script
├── retrain_model.py                # Model retraining utility
├── phishing_detector.pkl           # Trained phishing detection model
├── spam_detector.pkl               # Trained spam detection model
├── requirements.txt                # Python dependencies
├── extension/                      # Browser extension files
│   ├── manifest.json              # Extension configuration
│   ├── background.js              # Background service worker
│   ├── content.js                 # Content script for all pages
│   ├── gmail-content.js           # Gmail-specific integration
│   ├── popup.html                 # Extension popup interface
│   ├── popup.js                   # Popup logic
│   └── options.js                 # Extension settings
└── templates/                      # HTML templates for Flask app

🤖 Machine Learning Algorithms

1. Random Forest Classifier

Algorithm Type: Ensemble learning method
Configuration: 100 decision trees (n_estimators=100)
Advantages:
- Handles non-linear relationships well
- Resistant to overfitting
- Provides feature importance rankings
Use Case: Primary model for phishing detection

2. Support Vector Machine (SVM)

Algorithm Type: Kernel-based classifier
Configuration: RBF (Radial Basis Function) kernel
Preprocessing: Requires feature scaling using StandardScaler
Advantages:
- Effective in high-dimensional spaces
- Memory efficient
- Robust against outliers
Use Case: Alternative model for complex decision boundaries

3. Logistic Regression

Algorithm Type: Linear classification model
Configuration: Maximum 1000 iterations for convergence
Preprocessing: Feature scaling applied
Advantages:
- Fast training and prediction
- Provides probability estimates
- Interpretable coefficients
Use Case: Baseline model and probability calibration

Model Selection Strategy

The system trains all three models and selects the best performer based on F1-Score, which balances precision and recall. This ensures optimal detection of phishing URLs while minimizing false positives.

🔍 Feature Engineering

The system extracts 17 comprehensive features from each URL:

URL Structure Features

url_length: Total character count of the URL
domain_length: Length of the domain name
path_length: Length of the URL path
query_length: Length of query parameters
path_depth: Number of directory levels in path

Domain-Based Features

subdomain_count: Number of subdomains
has_ip: Binary flag for IP address instead of domain name
digits_in_domain: Count of numeric characters in domain
letters_in_domain: Count of alphabetic characters in domain

Character Analysis Features

dash_count: Number of hyphens (-)
dot_count: Number of dots (.)
underscore_count: Number of underscores (_)
question_count: Number of question marks (?)
equal_count: Number of equals signs (=)
and_count: Number of ampersands (&)
at_count: Number of @ symbols

Security Features

is_https: Binary flag for HTTPS protocol
has_port: Binary flag for custom port usage
has_suspicious_words: Detects keywords like 'secure', 'account', 'login', 'verify', 'bank'

Entropy Features

domain_entropy: Shannon entropy of domain (measures randomness)
path_entropy: Shannon entropy of path

Feature Extraction Logic

def calculate_entropy(s):
    """Calculate Shannon entropy to measure randomness"""
    if not s:
        return 0
    entropy = 0
    for x in set(s):
        p_x = s.count(x) / len(s)
        entropy += -p_x * np.log2(p_x)
    return entropy

Higher entropy values indicate more random character distributions, which is common in phishing URLs.

🔑 Key Functions

PhishingURLDetector Class

`extract_features(url)`

Purpose: Extracts all features from a given URL for ML prediction

Parameters:

url (str): The URL to analyze

Returns: Dictionary of feature names and values

Logic:

Parses URL using urlparse and tldextract
Calculates structural metrics (lengths, counts)
Identifies suspicious patterns
Computes entropy for randomness detection
Returns feature dictionary

Example:

features = detector.extract_features("https://secure-paypal-update.com/signin")
# Returns: {'url_length': 45, 'has_suspicious_words': 1, ...}

`train_models(X, y)`

Purpose: Trains multiple ML models and compares their performance

Parameters:

X (array): Feature matrix (n_samples × n_features)
y (array): Labels (0=legitimate, 1=phishing)

Returns: Dictionary containing model performance metrics

Logic:

Splits data into 80% training, 20% testing
Applies StandardScaler for SVM and Logistic Regression
Trains Random Forest, SVM, and Logistic Regression
Evaluates using accuracy, precision, recall, and F1-score
Selects best model based on F1-score
Stores all models for ensemble predictions

Metrics Calculated:

Accuracy: Overall correctness
Precision: True positives / (True positives + False positives)
Recall: True positives / (True positives + False negatives)
F1-Score: Harmonic mean of precision and recall

`predict(url, model_name=None)`

Purpose: Predicts whether a URL is phishing or legitimate

Parameters:

url (str): URL to classify
model_name (str, optional): Specific model to use (defaults to best model)

Returns: Dictionary with prediction, confidence, and model used

Logic:

Extracts features from the URL
Applies feature scaling if needed (for SVM/Logistic Regression)
Makes prediction using specified or best model
Calculates confidence score using predict_proba()
Returns structured result

Example:

result = detector.predict("http://192.168.1.1/bank-login")
# Returns: {
#   'prediction': 'Phishing',
#   'confidence': 0.95,
#   'model_used': 'Random Forest',
#   'features': {...}
# }

`save_model(filename)` & `load_model(filename)`

Purpose: Persist and load trained models

Saved Components:

All trained models (Random Forest, SVM, Logistic Regression)
StandardScaler object (for consistent feature scaling)
Feature names list
Best model identifier

Logic: Uses joblib for efficient serialization of scikit-learn objects

`parse_csv_file(file_path)`

Purpose: Loads and preprocesses URL datasets from CSV files

Parameters:

file_path (str): Path to CSV file with 'URL' and 'Label' columns

Returns: Tuple of (urls, labels)

Logic:

Reads CSV using pandas
Validates required columns exist
Removes rows with missing values
Cleans URLs using clean_url()
Validates URLs using is_valid_url()
Converts labels to numeric (0/1) using process_label()
Handles errors gracefully, skipping invalid entries

`handle_feature_nan(X)`

Purpose: Handles missing or invalid feature values

Logic:

Detects NaN values in feature matrix
Uses SimpleImputer with mean strategy to fill missing values
Falls back to zero-filling if imputation fails
Ensures no NaN values remain before model training

SpamDetector Class

`preprocess_text(text)`

Purpose: Cleans and normalizes text for spam detection

Logic:

Converts to lowercase
Removes URLs using regex
Removes email addresses
Removes numbers
Removes punctuation
Normalizes whitespace

`extract_features(text)`

Purpose: Extracts features from text messages

Features Extracted:

TF-IDF features: Term frequency-inverse document frequency vectors
Length features: Total character count, word count
Stylistic features: Capital letters, exclamation marks, question marks
Capital ratio: Proportion of uppercase characters

Returns: Combined feature vector for classification

`predict(text, model_name=None)`

Purpose: Classifies text as spam or ham (legitimate)

Returns: Dictionary with prediction ('spam'/'ham'), confidence, and model used

Flask Application Functions (app.py)

`/api/detect` (POST)

Purpose: API endpoint for single URL phishing detection

Request Body:

{
  "url": "https://example.com",
  "model": "Random Forest"  // optional
}

Response:

{
  "url": "https://example.com",
  "prediction": "Legitimate",
  "prediction_numeric": 0,
  "confidence": 0.92,
  "model_used": "Random Forest"
}

`/api/batch_detect` (POST)

Purpose: Batch processing of multiple URLs

Request Body:

{
  "urls": ["url1", "url2", "url3"],
  "model": "Random Forest"
}

Returns: Array of prediction results

`/api/detect_spam` (POST)

Purpose: Spam detection for text messages

Request Body:

{
  "text": "Congratulations! You've won $1000...",
  "model": "Random Forest"
}

💻 Installation

Prerequisites

Python 3.8+
pip package manager

Setup

Clone the repository:

git clone https://github.com/yourusername/Detection-of-Phishing-URLs-using-Machine-Learning.git
cd Detection-of-Phishing-URLs-using-Machine-Learning

Create virtual environment (recommended):

python -m venv .venv
.venv\Scripts\activate  # Windows
# source .venv/bin/activate  # Linux/Mac

Install dependencies:

pip install -r requirements.txt

Train models (if not using pre-trained):

python phishing_detector.py
python train_spam_model.py

Run Flask application:

python app.py

The application will be available at http://localhost:5000

🚀 Usage

Web Interface

Navigate to http://localhost:5000
Enter a URL in the input field
Click "Check URL" to get instant results
View prediction, confidence score, and model used

API Usage

Python Example:

import requests

url = "http://localhost:5000/api/detect"
data = {"url": "https://suspicious-site.com"}
response = requests.post(url, json=data)
print(response.json())

cURL Example:

curl -X POST http://localhost:5000/api/detect \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com"}'

Batch Processing

import requests

url = "http://localhost:5000/api/batch_detect"
data = {
    "urls": [
        "https://google.com",
        "http://suspicious-paypal.tk",
        "https://github.com"
    ]
}
response = requests.post(url, json=data)
results = response.json()

🌐 API Endpoints

Endpoint	Method	Description	Request Body
`/`	GET	Main web interface	-
`/api/detect`	POST	Single URL detection	`{"url": "...", "model": "..."}`
`/api/batch_detect`	POST	Multiple URL detection	`{"urls": [...], "model": "..."}`
`/api/detect_spam`	POST	Spam text detection	`{"text": "...", "model": "..."}`
`/api/models`	GET	List available models	-
`/detect`	GET/POST	Form-based detection	Form data
`/batch`	GET	Batch detection page	-

🔌 Browser Extension

Features

Real-time Link Scanning: Automatically checks links on web pages
Visual Indicators: Color-coded badges (green=safe, red=phishing, yellow=suspicious)
Gmail Integration: Scans links in Gmail messages
Click Protection: Warns users before visiting phishing sites
Customizable Settings: Configure API endpoint and sensitivity

Installation

Load Extension in Chrome:
- Open chrome://extensions/
- Enable "Developer mode"
- Click "Load unpacked"
- Select the extension folder
Configure Settings:
- Click extension icon
- Go to Options
- Set API endpoint (default: http://localhost:5000)

Extension Components

background.js: Service worker for API communication
content.js: Scans and marks links on all web pages
gmail-content.js: Specialized Gmail integration
popup.js: Extension popup interface
options.js: Settings management

🛠️ Technologies Used

Backend

Flask: Web framework for API
scikit-learn: Machine learning library
pandas: Data manipulation
NumPy: Numerical computing
joblib: Model serialization

Machine Learning

Random Forest: Ensemble classifier
SVM: Support Vector Machine
Logistic Regression: Linear classifier
TF-IDF: Text vectorization for spam detection

Frontend

HTML/CSS/JavaScript: Web interface
Chrome Extension API: Browser integration

Utilities

tldextract: Domain extraction
BeautifulSoup: HTML parsing
urllib: URL parsing
Flask-CORS: Cross-origin resource sharing

📊 Model Performance

The models are evaluated using:

Accuracy: Overall prediction correctness
Precision: Ratio of true phishing detections to all phishing predictions
Recall: Ratio of detected phishing URLs to all actual phishing URLs
F1-Score: Harmonic mean balancing precision and recall

Typical performance metrics:

Random Forest: F1-Score ~0.95+
SVM: F1-Score ~0.92+
Logistic Regression: F1-Score ~0.88+

🔄 Retraining Models

To retrain models with new data:

Prepare CSV file with columns: URL, Label (good/bad)
Update file path in phishing_detector.py
Run training script:

python phishing_detector.py

For spam detection:

python train_spam_model.py

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

📝 License

This project is created for educational purposes as part of a Cyber Security course.

👨‍💻 Author

Shubham

GitHub: @shub15

🙏 Acknowledgments

Dataset sources for phishing URL training
scikit-learn documentation and community
Flask framework developers
Chrome Extension API documentation

Note: This system is designed for educational and research purposes. Always use multiple layers of security when protecting against phishing attacks.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
extension		extension
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
extension.crx		extension.crx
extension.pem		extension.pem
phishing_detector.pkl		phishing_detector.pkl
phishing_detector.py		phishing_detector.py
phishing_detector_using_gpu.py		phishing_detector_using_gpu.py
phishing_detector_v2.py		phishing_detector_v2.py
requirements.txt		requirements.txt
retrain_model.py		retrain_model.py
spam_detector.py		spam_detector.py
train_spam_model.py		train_spam_model.py

shub15/phishing-url-detection

Folders and files

Latest commit

History

Repository files navigation

Phishing URL Detection using Machine Learning

📋 Table of Contents

🎯 Overview

✨ Features

📁 Project Structure

🤖 Machine Learning Algorithms

1. Random Forest Classifier

2. Support Vector Machine (SVM)

3. Logistic Regression

Model Selection Strategy

🔍 Feature Engineering

URL Structure Features

Domain-Based Features

Character Analysis Features

Security Features

Entropy Features

Feature Extraction Logic

🔑 Key Functions

PhishingURLDetector Class

extract_features(url)

train_models(X, y)

predict(url, model_name=None)

save_model(filename) & load_model(filename)

parse_csv_file(file_path)

handle_feature_nan(X)

SpamDetector Class

preprocess_text(text)

extract_features(text)

predict(text, model_name=None)

Flask Application Functions (app.py)

/api/detect (POST)

/api/batch_detect (POST)

/api/detect_spam (POST)

💻 Installation

Prerequisites

Setup

🚀 Usage

Web Interface

API Usage

Batch Processing

🌐 API Endpoints

🔌 Browser Extension

Features

Installation

Extension Components

🛠️ Technologies Used

Backend

Machine Learning

Frontend

Utilities

📊 Model Performance

🔄 Retraining Models

🤝 Contributing

📝 License

👨‍💻 Author

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`extract_features(url)`

`train_models(X, y)`

`predict(url, model_name=None)`

`save_model(filename)` & `load_model(filename)`

`parse_csv_file(file_path)`

`handle_feature_nan(X)`

`preprocess_text(text)`

`extract_features(text)`

`predict(text, model_name=None)`

`/api/detect` (POST)

`/api/batch_detect` (POST)

`/api/detect_spam` (POST)

Packages