Skip to content

shub15/phishing-url-detection

Repository files navigation

Phishing URL Detection using Machine Learning

A comprehensive cybersecurity solution that uses machine learning algorithms to detect phishing URLs and spam messages in real-time. The project includes a Flask web application, trained ML models, and a browser extension for proactive protection against phishing attacks.

πŸ“‹ Table of Contents

🎯 Overview

This project implements a multi-layered phishing detection system that combines:

  • Machine Learning Models: Random Forest, SVM, and Logistic Regression classifiers
  • Feature Extraction: 17+ URL-based features for comprehensive analysis
  • Web Application: Flask-based API for real-time URL scanning
  • Browser Extension: Chrome extension for proactive protection while browsing
  • Spam Detection: Text-based spam detection using TF-IDF vectorization

✨ Features

  • Real-time URL Analysis: Instant phishing detection for any URL
  • Multiple ML Models: Ensemble approach using three different algorithms
  • Batch Processing: Analyze multiple URLs simultaneously
  • Browser Integration: Chrome extension with Gmail integration
  • Spam Detection: Email and message spam classification
  • Confidence Scoring: Probability-based prediction confidence
  • Visual Warnings: In-browser alerts for suspicious links
  • Model Persistence: Pre-trained models for quick deployment

πŸ“ Project Structure

Detection-of-Phishing-URLs-using-Machine-Learning/
β”œβ”€β”€ app.py                          # Flask web application
β”œβ”€β”€ phishing_detector.py            # Core ML model training script
β”œβ”€β”€ phishing_detector_v2.py         # Enhanced version with additional features
β”œβ”€β”€ phishing_detector_using_gpu.py  # GPU-accelerated training
β”œβ”€β”€ spam_detector.py                # Spam detection module
β”œβ”€β”€ train_spam_model.py             # Spam model training script
β”œβ”€β”€ retrain_model.py                # Model retraining utility
β”œβ”€β”€ phishing_detector.pkl           # Trained phishing detection model
β”œβ”€β”€ spam_detector.pkl               # Trained spam detection model
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ extension/                      # Browser extension files
β”‚   β”œβ”€β”€ manifest.json              # Extension configuration
β”‚   β”œβ”€β”€ background.js              # Background service worker
β”‚   β”œβ”€β”€ content.js                 # Content script for all pages
β”‚   β”œβ”€β”€ gmail-content.js           # Gmail-specific integration
β”‚   β”œβ”€β”€ popup.html                 # Extension popup interface
β”‚   β”œβ”€β”€ popup.js                   # Popup logic
β”‚   └── options.js                 # Extension settings
└── templates/                      # HTML templates for Flask app

πŸ€– Machine Learning Algorithms

1. Random Forest Classifier

  • Algorithm Type: Ensemble learning method
  • Configuration: 100 decision trees (n_estimators=100)
  • Advantages:
    • Handles non-linear relationships well
    • Resistant to overfitting
    • Provides feature importance rankings
  • Use Case: Primary model for phishing detection

2. Support Vector Machine (SVM)

  • Algorithm Type: Kernel-based classifier
  • Configuration: RBF (Radial Basis Function) kernel
  • Preprocessing: Requires feature scaling using StandardScaler
  • Advantages:
    • Effective in high-dimensional spaces
    • Memory efficient
    • Robust against outliers
  • Use Case: Alternative model for complex decision boundaries

3. Logistic Regression

  • Algorithm Type: Linear classification model
  • Configuration: Maximum 1000 iterations for convergence
  • Preprocessing: Feature scaling applied
  • Advantages:
    • Fast training and prediction
    • Provides probability estimates
    • Interpretable coefficients
  • Use Case: Baseline model and probability calibration

Model Selection Strategy

The system trains all three models and selects the best performer based on F1-Score, which balances precision and recall. This ensures optimal detection of phishing URLs while minimizing false positives.

πŸ” Feature Engineering

The system extracts 17 comprehensive features from each URL:

URL Structure Features

  1. url_length: Total character count of the URL
  2. domain_length: Length of the domain name
  3. path_length: Length of the URL path
  4. query_length: Length of query parameters
  5. path_depth: Number of directory levels in path

Domain-Based Features

  1. subdomain_count: Number of subdomains
  2. has_ip: Binary flag for IP address instead of domain name
  3. digits_in_domain: Count of numeric characters in domain
  4. letters_in_domain: Count of alphabetic characters in domain

Character Analysis Features

  1. dash_count: Number of hyphens (-)
  2. dot_count: Number of dots (.)
  3. underscore_count: Number of underscores (_)
  4. question_count: Number of question marks (?)
  5. equal_count: Number of equals signs (=)
  6. and_count: Number of ampersands (&)
  7. at_count: Number of @ symbols

Security Features

  1. is_https: Binary flag for HTTPS protocol
  2. has_port: Binary flag for custom port usage
  3. has_suspicious_words: Detects keywords like 'secure', 'account', 'login', 'verify', 'bank'

Entropy Features

  1. domain_entropy: Shannon entropy of domain (measures randomness)
  2. path_entropy: Shannon entropy of path

Feature Extraction Logic

def calculate_entropy(s):
    """Calculate Shannon entropy to measure randomness"""
    if not s:
        return 0
    entropy = 0
    for x in set(s):
        p_x = s.count(x) / len(s)
        entropy += -p_x * np.log2(p_x)
    return entropy

Higher entropy values indicate more random character distributions, which is common in phishing URLs.

πŸ”‘ Key Functions

PhishingURLDetector Class

extract_features(url)

Purpose: Extracts all features from a given URL for ML prediction

Parameters:

  • url (str): The URL to analyze

Returns: Dictionary of feature names and values

Logic:

  1. Parses URL using urlparse and tldextract
  2. Calculates structural metrics (lengths, counts)
  3. Identifies suspicious patterns
  4. Computes entropy for randomness detection
  5. Returns feature dictionary

Example:

features = detector.extract_features("https://secure-paypal-update.com/signin")
# Returns: {'url_length': 45, 'has_suspicious_words': 1, ...}

train_models(X, y)

Purpose: Trains multiple ML models and compares their performance

Parameters:

  • X (array): Feature matrix (n_samples Γ— n_features)
  • y (array): Labels (0=legitimate, 1=phishing)

Returns: Dictionary containing model performance metrics

Logic:

  1. Splits data into 80% training, 20% testing
  2. Applies StandardScaler for SVM and Logistic Regression
  3. Trains Random Forest, SVM, and Logistic Regression
  4. Evaluates using accuracy, precision, recall, and F1-score
  5. Selects best model based on F1-score
  6. Stores all models for ensemble predictions

Metrics Calculated:

  • Accuracy: Overall correctness
  • Precision: True positives / (True positives + False positives)
  • Recall: True positives / (True positives + False negatives)
  • F1-Score: Harmonic mean of precision and recall

predict(url, model_name=None)

Purpose: Predicts whether a URL is phishing or legitimate

Parameters:

  • url (str): URL to classify
  • model_name (str, optional): Specific model to use (defaults to best model)

Returns: Dictionary with prediction, confidence, and model used

Logic:

  1. Extracts features from the URL
  2. Applies feature scaling if needed (for SVM/Logistic Regression)
  3. Makes prediction using specified or best model
  4. Calculates confidence score using predict_proba()
  5. Returns structured result

Example:

result = detector.predict("http://192.168.1.1/bank-login")
# Returns: {
#   'prediction': 'Phishing',
#   'confidence': 0.95,
#   'model_used': 'Random Forest',
#   'features': {...}
# }

save_model(filename) & load_model(filename)

Purpose: Persist and load trained models

Saved Components:

  • All trained models (Random Forest, SVM, Logistic Regression)
  • StandardScaler object (for consistent feature scaling)
  • Feature names list
  • Best model identifier

Logic: Uses joblib for efficient serialization of scikit-learn objects


parse_csv_file(file_path)

Purpose: Loads and preprocesses URL datasets from CSV files

Parameters:

  • file_path (str): Path to CSV file with 'URL' and 'Label' columns

Returns: Tuple of (urls, labels)

Logic:

  1. Reads CSV using pandas
  2. Validates required columns exist
  3. Removes rows with missing values
  4. Cleans URLs using clean_url()
  5. Validates URLs using is_valid_url()
  6. Converts labels to numeric (0/1) using process_label()
  7. Handles errors gracefully, skipping invalid entries

handle_feature_nan(X)

Purpose: Handles missing or invalid feature values

Logic:

  1. Detects NaN values in feature matrix
  2. Uses SimpleImputer with mean strategy to fill missing values
  3. Falls back to zero-filling if imputation fails
  4. Ensures no NaN values remain before model training

SpamDetector Class

preprocess_text(text)

Purpose: Cleans and normalizes text for spam detection

Logic:

  1. Converts to lowercase
  2. Removes URLs using regex
  3. Removes email addresses
  4. Removes numbers
  5. Removes punctuation
  6. Normalizes whitespace

extract_features(text)

Purpose: Extracts features from text messages

Features Extracted:

  • TF-IDF features: Term frequency-inverse document frequency vectors
  • Length features: Total character count, word count
  • Stylistic features: Capital letters, exclamation marks, question marks
  • Capital ratio: Proportion of uppercase characters

Returns: Combined feature vector for classification


predict(text, model_name=None)

Purpose: Classifies text as spam or ham (legitimate)

Returns: Dictionary with prediction ('spam'/'ham'), confidence, and model used


Flask Application Functions (app.py)

/api/detect (POST)

Purpose: API endpoint for single URL phishing detection

Request Body:

{
  "url": "https://example.com",
  "model": "Random Forest"  // optional
}

Response:

{
  "url": "https://example.com",
  "prediction": "Legitimate",
  "prediction_numeric": 0,
  "confidence": 0.92,
  "model_used": "Random Forest"
}

/api/batch_detect (POST)

Purpose: Batch processing of multiple URLs

Request Body:

{
  "urls": ["url1", "url2", "url3"],
  "model": "Random Forest"
}

Returns: Array of prediction results


/api/detect_spam (POST)

Purpose: Spam detection for text messages

Request Body:

{
  "text": "Congratulations! You've won $1000...",
  "model": "Random Forest"
}

πŸ’» Installation

Prerequisites

  • Python 3.8+
  • pip package manager

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/Detection-of-Phishing-URLs-using-Machine-Learning.git
cd Detection-of-Phishing-URLs-using-Machine-Learning
  1. Create virtual environment (recommended):
python -m venv .venv
.venv\Scripts\activate  # Windows
# source .venv/bin/activate  # Linux/Mac
  1. Install dependencies:
pip install -r requirements.txt
  1. Train models (if not using pre-trained):
python phishing_detector.py
python train_spam_model.py
  1. Run Flask application:
python app.py

The application will be available at http://localhost:5000

πŸš€ Usage

Web Interface

  1. Navigate to http://localhost:5000
  2. Enter a URL in the input field
  3. Click "Check URL" to get instant results
  4. View prediction, confidence score, and model used

API Usage

Python Example:

import requests

url = "http://localhost:5000/api/detect"
data = {"url": "https://suspicious-site.com"}
response = requests.post(url, json=data)
print(response.json())

cURL Example:

curl -X POST http://localhost:5000/api/detect \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com"}'

Batch Processing

import requests

url = "http://localhost:5000/api/batch_detect"
data = {
    "urls": [
        "https://google.com",
        "http://suspicious-paypal.tk",
        "https://github.com"
    ]
}
response = requests.post(url, json=data)
results = response.json()

🌐 API Endpoints

Endpoint Method Description Request Body
/ GET Main web interface -
/api/detect POST Single URL detection {"url": "...", "model": "..."}
/api/batch_detect POST Multiple URL detection {"urls": [...], "model": "..."}
/api/detect_spam POST Spam text detection {"text": "...", "model": "..."}
/api/models GET List available models -
/detect GET/POST Form-based detection Form data
/batch GET Batch detection page -

πŸ”Œ Browser Extension

Features

  • Real-time Link Scanning: Automatically checks links on web pages
  • Visual Indicators: Color-coded badges (green=safe, red=phishing, yellow=suspicious)
  • Gmail Integration: Scans links in Gmail messages
  • Click Protection: Warns users before visiting phishing sites
  • Customizable Settings: Configure API endpoint and sensitivity

Installation

  1. Load Extension in Chrome:

    • Open chrome://extensions/
    • Enable "Developer mode"
    • Click "Load unpacked"
    • Select the extension folder
  2. Configure Settings:

    • Click extension icon
    • Go to Options
    • Set API endpoint (default: http://localhost:5000)

Extension Components

  • background.js: Service worker for API communication
  • content.js: Scans and marks links on all web pages
  • gmail-content.js: Specialized Gmail integration
  • popup.js: Extension popup interface
  • options.js: Settings management

πŸ› οΈ Technologies Used

Backend

  • Flask: Web framework for API
  • scikit-learn: Machine learning library
  • pandas: Data manipulation
  • NumPy: Numerical computing
  • joblib: Model serialization

Machine Learning

  • Random Forest: Ensemble classifier
  • SVM: Support Vector Machine
  • Logistic Regression: Linear classifier
  • TF-IDF: Text vectorization for spam detection

Frontend

  • HTML/CSS/JavaScript: Web interface
  • Chrome Extension API: Browser integration

Utilities

  • tldextract: Domain extraction
  • BeautifulSoup: HTML parsing
  • urllib: URL parsing
  • Flask-CORS: Cross-origin resource sharing

πŸ“Š Model Performance

The models are evaluated using:

  • Accuracy: Overall prediction correctness
  • Precision: Ratio of true phishing detections to all phishing predictions
  • Recall: Ratio of detected phishing URLs to all actual phishing URLs
  • F1-Score: Harmonic mean balancing precision and recall

Typical performance metrics:

  • Random Forest: F1-Score ~0.95+
  • SVM: F1-Score ~0.92+
  • Logistic Regression: F1-Score ~0.88+

πŸ”„ Retraining Models

To retrain models with new data:

  1. Prepare CSV file with columns: URL, Label (good/bad)
  2. Update file path in phishing_detector.py
  3. Run training script:
python phishing_detector.py

For spam detection:

python train_spam_model.py

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

πŸ“ License

This project is created for educational purposes as part of a Cyber Security course.

πŸ‘¨β€πŸ’» Author

Shubham

πŸ™ Acknowledgments

  • Dataset sources for phishing URL training
  • scikit-learn documentation and community
  • Flask framework developers
  • Chrome Extension API documentation

Note: This system is designed for educational and research purposes. Always use multiple layers of security when protecting against phishing attacks.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published