Official repository for the paper: Evaluating Small-Scale Code Models for Code Clone Detection.
Code clone detection is a critical task for software maintenance, plagiarism detection, and refactoring. While Large Language Models (LLMs) have shown promise, their computational cost is prohibitive for many real-time or resource-constrained environments.
This work rigorously evaluates small-scale transformer-based code models (<220M parameters) to determine their efficacy in distinguishing clone pairs. We provide a unified evaluation framework across five benchmark datasets, offering insights into the trade-offs between model size, architecture (Encoder-only vs. Encoder-Decoder), and detection accuracy.
- Unified Framework: A single pipeline to evaluate six different architectures.
- Diverse Benchmarks: Pre-processed loaders for BigCloneBench, POJ104, and more.
- Reproducibility: Docker-ready scripts to replicate the exact numbers reported in the paper.
- Extensibility: Easily add new models or datasets to compare against our baselines.
We focus on efficiency-oriented models suitable for standard GPUs:
| Model | Parameters | Architecture |
|---|---|---|
| CodeBERT | 125M | Encoder-only |
| GraphCodeBERT | 125M | Encoder-only (Data Flow) |
| PLBART | 140M | Encoder-Decoder |
| PolyCoder | 160M | Decoder-only |
| UniXCoder | ~200M | Unified (Enc-Dec) |
| Salesforce T5 | 220M | Encoder-Decoder |
- BigCloneBench: Validated clone pairs from real-world open-source projects (Java).
- CodeJam: Google Code Jam competition submissions.
- Karnalim: Academic exercise-based code pairs.
- POJ104: Peking University student submissions (C++).
- PoolC: Diverse clone types from open-source projects.
- OS: Linux (Recommended) or Windows
- Hardware: CUDA-enabled GPU (8GB+ VRAM recommended)
- Python: Version 3.8, 3.9, or 3.10
If you use this code or our findings in your research, please cite the following paper.
BibTeX:
@article{martinezgil2025smallscale,
author = {Jorge Martinez-Gil},
title = {Evaluating Small-Scale Code Models for Code Clone Detection},
journal = {CoRR},
volume = {abs/2506.10995},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2506.10995},
eprint = {2506.10995},
archivePrefix = {arXiv},
primaryClass = {cs.SE}
}-
- Authors: R. Ramachandran, P. Vijayan, A. Anilkumar, …
- Journal: Big Data and Cognitive Computing, 2025 (MDPI)
- Abstract: Describes an automated marking system for database design exercises. An LLM compares student ER and schema diagrams with instructor references, aiming to reduce manual review time and improve scoring consistency.
-
Nuanced Code Clone Detection Through LLM-Based Code Revision and AST Graph Modeling
- Authors: C. Li, J. Konpang, A. Sirikham, Y. Wang
- Journal: IEEE Access, 2025
- Abstract: Focuses on detecting Type-4 code clones where behavior matches even if structure differs. The method mixes LLM-guided code rewriting with graph representations of syntax trees to better identify deep similarity across code fragments.
This project is licensed under the MIT License.