Voice Clone Recognition

Project Overview

This project is designed for analyzing and recognizing cloned (deepfake) and real voice samples. It provides tools for data preparation, audio processing, and waveform analysis to help distinguish between real and cloned audio.

License and Citation

This project (Voice Clone Recognition) is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). See the LICENSE file or https://creativecommons.org/licenses/by/4.0/ for full terms. You must provide proper attribution when using or redistributing code, models, or datasets from this repository.

Replace the URL below with the exact repository or release URL/DOI you used, and retain the CC BY 4.0 attribution line in any redistributed content.

Suggested plain citation (replace the URL with the repository or release URL/DOI you used): Voice Clone Recognition (2025). GitHub repository: https://github.com/ibeuler/VCR

Suggested BibTeX entry:

@misc{Voice_Clone_Recognition2025,
   title = {Voice\_Clone\_Recognition},
   author = {Ibrahim H.I. Abushawish and Yusuf Aykut and Arda Usta},
   year = {2025},
   howpublished = {\url{https://github.com/ibeuler/VCR}},
   note = {Prepared by: Arda Usta, Yusuf Aykut, Ibrahim H.I. Abushawish; CC BY 4.0}
}

Replace the URL above with the exact repository or release URL/DOI you used, and retain the CC BY 4.0 attribution line in any redistributed content.

Codebase Structure

├── analysis.ipynb           # Jupyter notebook for waveform analysis and visualization
├── app.py                  # Flask web application for deepfake detection
├── batch_test.py           # Batch testing script for rule-based detection
├── train_ml_models.py      # ML model training script (Logistic Regression & SVM)
├── ml_detector.py          # ML-based detection module
├── hybrid_detector.py      # Hybrid detection (rule-based + ML)
├── clone_real_data.py      # Script to prepare/clone real data for experiments
├── record_sentences.py     # Script to record sentences for dataset
├── analyze_scores.py        # Score distribution analysis
├── optimize_simple.py       # Parameter optimization script
├── requirements.txt        # Python dependencies
├── templates/               # HTML templates for web interface
│   └── index.html
├── static/                  # Static files (CSS, JS)
│   ├── css/
│   └── js/
├── models/                  # Trained ML models (created after training)
│   ├── logistic_regression.pkl
│   ├── svm.pkl
│   └── scaler.pkl
├── data/                    # Main data directory
│   ├── cloned/              # Cloned (deepfake) audio samples
│   │   └── [speaker_name]/  # Cloned samples organized by speaker
│   └── real/                # Real audio samples
│       └── [speaker_name]/  # Real samples organized by speaker
│           └── meta.json    # Metadata for speaker

Data Structure

data/real/: Contains real audio samples, organized by speaker. Each speaker folder may include a meta.json file with metadata.
data/cloned/: Contains cloned (deepfake) audio samples, organized similarly by speaker.
data/manifests/: Intended for manifest or metadata files describing datasets.

Dependencies

All dependencies are listed in requirements.txt. Key packages include:

librosa: Audio processing and feature extraction
matplotlib: Plotting and visualization
numpy: Numerical operations
scikit-learn: Machine learning models (Logistic Regression, SVM)
flask: Web framework for the interface
jupyter: For running notebooks

To install all dependencies, first create and activate a virtual environment:

# On Windows (PowerShell)
.\.venv310\Scripts\Activate.ps1

or

# On Unix/macOS
source ./.venv310/Scripts/activate

Then install the requirements:

pip install -r requirements.txt

Usage

1. Data Preparation

Prepare your data in the data/real and data/cloned directories, following the structure above.
Use record_sentences.py to record real voice samples.
Use clone_real_data.py to generate cloned versions of the recordings.

2. ML Model Training

Important: Before using ML-based detection, you need to train the models using your dataset.

python train_ml_models.py --real-dir data/real --cloned-dir data/cloned --output models/

This script will:

Extract features from all real and cloned audio files
Train Logistic Regression and SVM classifiers
Save trained models to the models/ directory
Display training results and accuracy metrics

Parameters:

--real-dir: Directory containing real audio samples (default: data/real)
--cloned-dir: Directory containing cloned audio samples (default: data/cloned)
--output: Output directory for saved models (default: models)
--test-size: Proportion of test set (default: 0.2)

Note: Make sure you have both real and cloned audio files in the specified directories before training.

3. Detection Methods

The system supports three detection methods:

Rule-Based Detection (no training required):
```
python batch_test.py --threshold 0.34
```

ML-Based Detection (requires trained models):

from ml_detector import detect_with_ml
result = detect_with_ml('path/to/audio.wav', models_dir='models')

Hybrid Detection (combines rule-based + ML):

from hybrid_detector import detect_hybrid
result = detect_hybrid('path/to/audio.wav', real_dir='data/real', models_dir='models')

4. Web Interface

Start the web application:

python app.py

Then open your browser and navigate to http://localhost:5000

The web interface allows you to:

Upload audio files (WAV, MP3, FLAC, OGG, M4A)
Choose detection method (Hybrid, Rule-Based, or ML-Based)
View detailed detection results with scores and confidence

5. Analysis and Visualization

Run analysis.ipynb to visualize and compare real vs. cloned audio waveforms.

Notes

Ensure your audio files are in a supported format (e.g., WAV).
Update paths in scripts/notebooks as needed for your data organization.

For more details, see the comments in each script or notebook.

System Explanation (Detailed)

This section explains the detection approach and system design used in this repository.

Overview

This project detects cloned (deepfake) voice samples using rule-based comparisons rather than a learned ML classifier.

Core idea:

Extract features from real (reference) audio
Extract same features from test audio
Compare features and score differences
Large differences → likely cloned; small differences → likely real

Key Concepts

MFCC (Mel-Frequency Cepstral Coefficients): 13 coefficients representing perceptual spectral shape.
Delta and Delta-Delta: first and second derivatives of MFCCs to capture dynamics.
Fourier / spectral features: spectral centroid, spectral rolloff, zero-crossing rate, spectral bandwidth.
Statistical summaries: mean, std, skewness, kurtosis computed per feature.
Distance metric: Euclidean distance between feature vectors.
Threshold: decision cutoff applied to the combined score.

How the System Works

Data preparation: record real voices and generate cloned versions (see record_sentences.py and clone_real_data.py).
Feature extraction per file: MFCCs, deltas, spectral features, and stats.
Build a reference distribution from real samples.
For each test file, compute feature distances and threshold deviations vs. the reference.
Combine distance, threshold-exceed counts, and statistical measures into a hybrid score (range 0–1).
Decide: score ≥ threshold → Fake, else → Real.

What We Did / Results

Initial threshold (0.5) produced poor cloned detection (only ~5% detected).
After analyzing score distributions we selected an optimal threshold of 0.34.
With threshold 0.34: Real accuracy ≈ 85% (17/20), Cloned accuracy = 100% (20/20), Overall ≈ 92.5%.

Settings and Parameters

Threshold: default 0.34 (changeable via batch_test.py --threshold).
Hybrid weights: (distance, threshold, statistical) default (0.3, 0.4, 0.3).
Distance scale: default 10.0 (normalization factor).
MFCC parameters: n_mfcc=13, hop_length=512, n_fft=2048.
Features used: MFCC (13), Delta, Delta-Delta, spectral centroid, rolloff, zero-crossing rate, spectral bandwidth; each summarized by mean/std/skew/kurtosis (~200+ features total).

Usage Examples

Single file test (Python):

from batch_test import detect_deepfake

result = detect_deepfake('path/to/audio.wav', real_dir='data/real', threshold=0.34)
print('Is Fake:', result['is_fake'])
print('Score: {:.4f}'.format(result['score']))

Batch test (CLI):

python batch_test.py --threshold 0.34

Analyze/optimize parameters:

python analyze_scores.py
python optimize_simple.py
python quick_optimize.py

Notes / Next Steps

The threshold can be tuned (e.g., 0.36–0.37) if you want higher real accuracy at the expense of cloned recall.
Consider adding further features or improving feature normalization for better robustness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Voice Clone Recognition

Project Overview

License and Citation

Codebase Structure

Data Structure

Dependencies

Usage

1. Data Preparation

2. ML Model Training

3. Detection Methods

4. Web Interface

5. Analysis and Visualization

Notes

System Explanation (Detailed)

Contents

Overview

Key Concepts

How the System Works

What We Did / Results

Settings and Parameters

Usage Examples

Notes / Next Steps

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.ipynb_checkpoints		.ipynb_checkpoints
models		models
static		static
templates		templates
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SYSTEM_EXPLANATION.md		SYSTEM_EXPLANATION.md
analysis.ipynb		analysis.ipynb
analysis.ipynb.bak		analysis.ipynb.bak
analyze_scores.py		analyze_scores.py
app.py		app.py
batch_test.py		batch_test.py
classify.py		classify.py
clone_real_data.py		clone_real_data.py
convert_audio.ipynb		convert_audio.ipynb
hybrid_detector.py		hybrid_detector.py
ml_detector.py		ml_detector.py
optimize_simple.py		optimize_simple.py
optimize_threshold.py		optimize_threshold.py
quick_optimize.py		quick_optimize.py
record_sentences.py		record_sentences.py
requirements.txt		requirements.txt
train_ml_models.py		train_ml_models.py

License

ibeuler/VCR

Folders and files

Latest commit

History

Repository files navigation

Voice Clone Recognition

Project Overview

License and Citation

Codebase Structure

Data Structure

Dependencies

Usage

1. Data Preparation

2. ML Model Training

3. Detection Methods

4. Web Interface

5. Analysis and Visualization

Notes

System Explanation (Detailed)

Contents

Overview

Key Concepts

How the System Works

What We Did / Results

Settings and Parameters

Usage Examples

Notes / Next Steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages