Evaluating Small-Scale Code Models for Code Clone Detection

Official repository for the paper: Evaluating Small-Scale Code Models for Code Clone Detection.

📖 Abstract

Code clone detection is a critical task for software maintenance, plagiarism detection, and refactoring. While Large Language Models (LLMs) have shown promise, their computational cost is prohibitive for many real-time or resource-constrained environments.

This work rigorously evaluates small-scale transformer-based code models (<220M parameters) to determine their efficacy in distinguishing clone pairs. We provide a unified evaluation framework across five benchmark datasets, offering insights into the trade-offs between model size, architecture (Encoder-only vs. Encoder-Decoder), and detection accuracy.

🚀 Key Features

Unified Framework: A single pipeline to evaluate six different architectures.
Diverse Benchmarks: Pre-processed loaders for BigCloneBench, POJ104, and more.
Reproducibility: Docker-ready scripts to replicate the exact numbers reported in the paper.
Extensibility: Easily add new models or datasets to compare against our baselines.

📊 Models & Datasets

🧠 Code Models Evaluated

We focus on efficiency-oriented models suitable for standard GPUs:

Model	Parameters	Architecture
CodeBERT	125M	Encoder-only
GraphCodeBERT	125M	Encoder-only (Data Flow)
PLBART	140M	Encoder-Decoder
PolyCoder	160M	Decoder-only
UniXCoder	~200M	Unified (Enc-Dec)
Salesforce T5	220M	Encoder-Decoder

📂 Datasets

BigCloneBench: Validated clone pairs from real-world open-source projects (Java).
CodeJam: Google Code Jam competition submissions.
Karnalim: Academic exercise-based code pairs.
POJ104: Peking University student submissions (C++).
PoolC: Diverse clone types from open-source projects.

Prerequisites

OS: Linux (Recommended) or Windows
Hardware: CUDA-enabled GPU (8GB+ VRAM recommended)
Python: Version 3.8, 3.9, or 3.10

🖊️ Citation

If you use this code or our findings in your research, please cite the following paper.

BibTeX:

@article{martinezgil2025smallscale,
  author       = {Jorge Martinez-Gil},
  title        = {Evaluating Small-Scale Code Models for Code Clone Detection},
  journal      = {CoRR},
  volume       = {abs/2506.10995},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2506.10995},
  eprint       = {2506.10995},
  archivePrefix = {arXiv},
  primaryClass = {cs.SE}
}

📖 Research that has already cited this work

AI Assisted System for Automated Evaluation of Entity-Relationship Diagram and Schema Diagram Using Large Language Models)
- Authors: R. Ramachandran, P. Vijayan, A. Anilkumar, …
- Journal: Big Data and Cognitive Computing, 2025 (MDPI)
- Abstract: Describes an automated marking system for database design exercises. An LLM compares student ER and schema diagrams with instructor references, aiming to reduce manual review time and improve scoring consistency.
Nuanced Code Clone Detection Through LLM-Based Code Revision and AST Graph Modeling
- Authors: C. Li, J. Konpang, A. Sirikham, Y. Wang
- Journal: IEEE Access, 2025
- Abstract: Focuses on detecting Type-4 code clones where behavior matches even if structure differs. The method mixes LLM-guided code rewriting with graph representations of syntax trees to better identify deep similarity across code fragments.

📜 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
bcb_detection_models		bcb_detection_models
gcj_clone_detection_models		gcj_clone_detection_models
karnalim_clone_detection_models		karnalim_clone_detection_models
poj104_clone_detection_models		poj104_clone_detection_models
poolc_clone_detection_models		poolc_clone_detection_models
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating Small-Scale Code Models for Code Clone Detection

📖 Abstract

🚀 Key Features

📊 Models & Datasets

🧠 Code Models Evaluated

📂 Datasets

Prerequisites

🖊️ Citation

📖 Research that has already cited this work

📜 License

About

Uh oh!

Releases

Packages

Languages

License

jorge-martinez-gil/small-code-models

Folders and files

Latest commit

History

Repository files navigation

Evaluating Small-Scale Code Models for Code Clone Detection

📖 Abstract

🚀 Key Features

📊 Models & Datasets

🧠 Code Models Evaluated

📂 Datasets

Prerequisites

🖊️ Citation

📖 Research that has already cited this work

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages