This project is designed for analyzing and recognizing cloned (deepfake) and real voice samples. It provides tools for data preparation, audio processing, and waveform analysis to help distinguish between real and cloned audio.
This project (Voice Clone Recognition) is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). See the LICENSE file or https://creativecommons.org/licenses/by/4.0/ for full terms. You must provide proper attribution when using or redistributing code, models, or datasets from this repository.
Replace the URL below with the exact repository or release URL/DOI you used, and retain the CC BY 4.0 attribution line in any redistributed content.
Suggested plain citation (replace the URL with the repository or release URL/DOI you used): Voice Clone Recognition (2025). GitHub repository: https://github.com/ibeuler/VCR
Suggested BibTeX entry:
@misc{Voice_Clone_Recognition2025,
title = {Voice\_Clone\_Recognition},
author = {Ibrahim H.I. Abushawish and Yusuf Aykut and Arda Usta},
year = {2025},
howpublished = {\url{https://github.com/ibeuler/VCR}},
note = {Prepared by: Arda Usta, Yusuf Aykut, Ibrahim H.I. Abushawish; CC BY 4.0}
}Replace the URL above with the exact repository or release URL/DOI you used, and retain the CC BY 4.0 attribution line in any redistributed content.
├── analysis.ipynb # Jupyter notebook for waveform analysis and visualization
├── app.py # Flask web application for deepfake detection
├── batch_test.py # Batch testing script for rule-based detection
├── train_ml_models.py # ML model training script (Logistic Regression & SVM)
├── ml_detector.py # ML-based detection module
├── hybrid_detector.py # Hybrid detection (rule-based + ML)
├── clone_real_data.py # Script to prepare/clone real data for experiments
├── record_sentences.py # Script to record sentences for dataset
├── analyze_scores.py # Score distribution analysis
├── optimize_simple.py # Parameter optimization script
├── requirements.txt # Python dependencies
├── templates/ # HTML templates for web interface
│ └── index.html
├── static/ # Static files (CSS, JS)
│ ├── css/
│ └── js/
├── models/ # Trained ML models (created after training)
│ ├── logistic_regression.pkl
│ ├── svm.pkl
│ └── scaler.pkl
├── data/ # Main data directory
│ ├── cloned/ # Cloned (deepfake) audio samples
│ │ └── [speaker_name]/ # Cloned samples organized by speaker
│ └── real/ # Real audio samples
│ └── [speaker_name]/ # Real samples organized by speaker
│ └── meta.json # Metadata for speaker
- data/real/: Contains real audio samples, organized by speaker. Each speaker folder may include a
meta.jsonfile with metadata. - data/cloned/: Contains cloned (deepfake) audio samples, organized similarly by speaker.
- data/manifests/: Intended for manifest or metadata files describing datasets.
All dependencies are listed in requirements.txt. Key packages include:
librosa: Audio processing and feature extractionmatplotlib: Plotting and visualizationnumpy: Numerical operationsscikit-learn: Machine learning models (Logistic Regression, SVM)flask: Web framework for the interfacejupyter: For running notebooks
To install all dependencies, first create and activate a virtual environment:
# On Windows (PowerShell)
.\.venv310\Scripts\Activate.ps1or
# On Unix/macOS
source ./.venv310/Scripts/activateThen install the requirements:
pip install -r requirements.txt- Prepare your data in the
data/realanddata/cloneddirectories, following the structure above. - Use
record_sentences.pyto record real voice samples. - Use
clone_real_data.pyto generate cloned versions of the recordings.
Important: Before using ML-based detection, you need to train the models using your dataset.
python train_ml_models.py --real-dir data/real --cloned-dir data/cloned --output models/This script will:
- Extract features from all real and cloned audio files
- Train Logistic Regression and SVM classifiers
- Save trained models to the
models/directory - Display training results and accuracy metrics
Parameters:
--real-dir: Directory containing real audio samples (default:data/real)--cloned-dir: Directory containing cloned audio samples (default:data/cloned)--output: Output directory for saved models (default:models)--test-size: Proportion of test set (default: 0.2)
Note: Make sure you have both real and cloned audio files in the specified directories before training.
The system supports three detection methods:
-
Rule-Based Detection (no training required):
python batch_test.py --threshold 0.34
-
ML-Based Detection (requires trained models):
from ml_detector import detect_with_ml result = detect_with_ml('path/to/audio.wav', models_dir='models')
-
Hybrid Detection (combines rule-based + ML):
from hybrid_detector import detect_hybrid result = detect_hybrid('path/to/audio.wav', real_dir='data/real', models_dir='models')
Start the web application:
python app.pyThen open your browser and navigate to http://localhost:5000
The web interface allows you to:
- Upload audio files (WAV, MP3, FLAC, OGG, M4A)
- Choose detection method (Hybrid, Rule-Based, or ML-Based)
- View detailed detection results with scores and confidence
Run analysis.ipynb to visualize and compare real vs. cloned audio waveforms.
- Ensure your audio files are in a supported format (e.g., WAV).
- Update paths in scripts/notebooks as needed for your data organization.
For more details, see the comments in each script or notebook.
This section explains the detection approach and system design used in this repository.
- Overview
- Key concepts and features
- How the system works
- What we changed and results
- Settings and parameters
This project detects cloned (deepfake) voice samples using rule-based comparisons rather than a learned ML classifier.
Core idea:
- Extract features from real (reference) audio
- Extract same features from test audio
- Compare features and score differences
- Large differences → likely cloned; small differences → likely real
- MFCC (Mel-Frequency Cepstral Coefficients): 13 coefficients representing perceptual spectral shape.
- Delta and Delta-Delta: first and second derivatives of MFCCs to capture dynamics.
- Fourier / spectral features: spectral centroid, spectral rolloff, zero-crossing rate, spectral bandwidth.
- Statistical summaries: mean, std, skewness, kurtosis computed per feature.
- Distance metric: Euclidean distance between feature vectors.
- Threshold: decision cutoff applied to the combined score.
- Data preparation: record real voices and generate cloned versions (see
record_sentences.pyandclone_real_data.py). - Feature extraction per file: MFCCs, deltas, spectral features, and stats.
- Build a reference distribution from real samples.
- For each test file, compute feature distances and threshold deviations vs. the reference.
- Combine distance, threshold-exceed counts, and statistical measures into a hybrid score (range 0–1).
- Decide: score ≥ threshold → Fake, else → Real.
- Initial threshold (0.5) produced poor cloned detection (only ~5% detected).
- After analyzing score distributions we selected an optimal threshold of 0.34.
- With threshold 0.34: Real accuracy ≈ 85% (17/20), Cloned accuracy = 100% (20/20), Overall ≈ 92.5%.
- Threshold: default 0.34 (changeable via
batch_test.py --threshold). - Hybrid weights: (distance, threshold, statistical) default
(0.3, 0.4, 0.3). - Distance scale: default
10.0(normalization factor). - MFCC parameters:
n_mfcc=13,hop_length=512,n_fft=2048. - Features used: MFCC (13), Delta, Delta-Delta, spectral centroid, rolloff, zero-crossing rate, spectral bandwidth; each summarized by mean/std/skew/kurtosis (~200+ features total).
Single file test (Python):
from batch_test import detect_deepfake
result = detect_deepfake('path/to/audio.wav', real_dir='data/real', threshold=0.34)
print('Is Fake:', result['is_fake'])
print('Score: {:.4f}'.format(result['score']))Batch test (CLI):
python batch_test.py --threshold 0.34Analyze/optimize parameters:
python analyze_scores.py
python optimize_simple.py
python quick_optimize.py- The threshold can be tuned (e.g., 0.36–0.37) if you want higher real accuracy at the expense of cloned recall.
- Consider adding further features or improving feature normalization for better robustness.