Hybrid Code Generation Evaluation Framework

# 코드 생성 모델 평가 프레임워크

## 1. 개요 (Overview)

이 프레임워크는 코드 생성 모델을 평가하기 위해 **3가지 축**을 통합적으로 분석합니다.

- **코딩 스타일 (LPcodedec)**
- **코드 구조 (StructCoder)**
- **코드 의미 (CodeBLEU)**

이를 통해 코드의 표면적 형식, 내부 논리 구조, 의미적 동작까지 다차원적으로 평가가 가능하며,  
연구 및 산업 현장에서 모두 활용할 수 있습니다.

---

## 2. 평가 구성 요소

### 2.1 LPcodedec (스타일 분석)

- **출처**: Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features (arXiv:2502.17749v1)  
- **역할**: 코드의 스타일적 특징을 정량 평가
- **주요 특징 (총 10가지)**  
  1. 네이밍 일관성: 함수명, 변수명, 클래스명, 상수명  
  2. 코드 구조 측정: 들여쓰기 일관성, 평균 함수 길이, 중첩 깊이  
  3. 가독성(Readability): 주석 비율(핵심), 평균 함수명 길이, 평균 변수명 길이
- **장점**: 계산 속도가 매우 빠르며(LLM 기반 구조 분석 대비 1,000배 이상), 해석 가능한 피처 제공

---

### 2.2 StructCoder (구조 분석)

- **역할**: 코드의 구조적 의미를 분석하여 내부 논리와 조직도를 파악
- **5대 분석 축**  
  1. **AST 구조 분석**: 노드 분포, 최대 깊이, 리프 비율, 분기 계수  
  2. **데이터 플로우(DFG)**: 변수 정의/사용 비율, 의존성 체인  
  3. **제어 흐름(CFG)**: if/for/while/try 블록 수, 중첩 루프, 순환 복잡도  
  4. **함수 호출 그래프**: 호출 횟수, 고유 함수, 내장 함수 비율  
  5. **변수 의존성**: 지역/전역 변수 비율, 스코프 크기
- **장점**: 코드 실행 구조를 정밀하게 반영 가능

---

### 2.3 CodeBLEU (의미 분석)

- **역할**: 코드의 동작 의미를 비교하는 정밀 지표
- **4대 구성 요소**  
  1. n-gram 매칭  
  2. 구문 트리(AST) 매칭  
  3. 데이터 플로우 매칭  
  4. 키워드 매칭
- **장점**: 표면적 유사성(BLEU)을 넘어 구조 및 의미까지 평가 가능 → 실행 결과 정확성 향상

---

## 3. 평가 절차

1. **모델 로드** — 파인튜닝된 코드 생성 모델 불러오기  
2. **코드 생성** — Instruction으로 N개의 후보 코드 생성  
3. **지표 계산** — 각 후보에 대해 스타일·구조·의미 점수 산출  
4. **통합 스코어 계산** — 가중합 또는 적응형 가중치 적용 (기본 1:1:1)  
5. **Best-of-N 선택** — 최고 점수 후보 선택  
6. **통계 분석** — 평균, 표준편차, 점수 다양성, 스타일-구조 상관계수 분석  
7. **결과 저장** — JSON 형식으로 저장  

---

## 4. 예상 결과 해석

- **스타일 점수 높음**: 형식·네이밍·주석 패턴 유사  
- **구조 점수 높음**: 제어/데이터 흐름, 프로그램 조직 유사  
- **의미 점수 높음**: 동작·출력 유사, 로직 일치  
- **지표 간 상관계수**: 코드 품질 특성의 원인 분석 가능

---

## 5. 활용 분야

- **연구** — 모델 간 성능 비교, 데이터셋 품질 평가  
- **산업** — 코드 리뷰 자동화, 보안 패턴 점검  
- **교육** — 학생 코드 채점, 피드백 시스템

---

Hybrid Code Generation Evaluation Framework

Integrating LPcodedec Stylistic Analysis and StructCoder Structural Analysis

Executive Summary

This repository presents a novel evaluation framework for code generation models that combines two cutting-edge methodologies: LPcodedec for coding style analysis and StructCoder for structural code analysis. The framework represents a significant advancement in code quality assessment by integrating statistical style features with deep structural understanding, providing comprehensive evaluation capabilities for modern code generation systems.

Theoretical Foundations and Background

1. LPcodedec: Coding Style Feature Analysis

LPcodedec (LLM-Paraphrased Code Detection) is a groundbreaking approach introduced in "Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features" [arXiv:2502.17749v1]. This methodology addresses the critical challenge of distinguishing between human-written and LLM-generated code through sophisticated stylistic analysis.

Core Principles of LPcodedec:

Statistical Foundation: The method is built upon rigorous ANOVA (Analysis of Variance) statistical analysis that identified significant differences between human-written and LLM-generated code across multiple programming languages (C, C++, Java, Python).

Feature Design Philosophy: LPcodedec extracts 10 quantitative features organized into three conceptual groups:

Naming Consistency (4 features): Function, variable, class, and constant naming patterns
Code Structure (3 features): Indentation consistency, function length, nesting depth
Readability (3 features): Comment ratio, function name length, variable name length

Empirical Validation: The paper's ANOVA analysis revealed that comment ratio consistently exhibits the highest F-statistic across all programming languages, making it the most discriminative feature for detecting LLM-generated code.

Language-Specific Insights:

Python and C: Readability features (especially comment ratio) are most effective
C++ and Java: Code structure features dominate due to object-oriented complexity
Cross-language consistency: Comment ratio remains universally significant

Performance Characteristics: LPcodedec achieves remarkable efficiency with 1,343× speedup compared to Tree Edit Distance while maintaining superior F1 scores (87-93% across languages).

2. StructCoder: Structure-Aware Code Analysis

StructCoder represents a paradigm shift in code analysis by moving beyond surface-level token sequences to capture deep structural semantics. This approach is inspired by structure-aware Transformer architectures that simultaneously model Abstract Syntax Trees (ASTs) and data flow graphs.

Fundamental Concepts:

Multi-Dimensional Structural Analysis: StructCoder analyzes code through five complementary perspectives:

AST Structural Features: Node type distributions, tree depth, branching factors
Data Flow Graph Features: Variable dependencies, def-use chains, scope analysis
Control Flow Features: Conditional structures, loop patterns, cyclomatic complexity
Call Graph Features: Function invocation patterns, built-in usage analysis
Dependency Features: Variable scope relationships, local vs. global dependencies

Architectural Innovation: Unlike traditional approaches that treat code as linear text, StructCoder maintains explicit representations of program structure, enabling more accurate semantic similarity assessment.

3. CodeBLEU: Comprehensive Code Similarity Metric

CodeBLEU extends traditional BLEU metrics to address the unique characteristics of programming languages. It combines four complementary components:

N-gram Match: Surface-level token similarity
Weighted N-gram Match: Emphasizes programming keywords
AST Match: Syntactic structure comparison
Data Flow Match: Semantic correctness assessment

The final score is computed as: CodeBLEU = α·BLEU + β·Weighted_BLEU + γ·AST_Match + θ·DataFlow_Match

System Architecture and Implementation

Core Components Analysis

1. LPcodedecAnalyzer Class

The LPcodedecAnalyzer class implements the complete feature extraction pipeline based on the original paper's methodology.

Key Implementation Details:

class LPcodedecAnalyzer:
    def __init__(self):
        self.naming_patterns = {
            'camelCase': re.compile(r'^[a-z][a-zA-Z0-9]*$'),
            'PascalCase': re.compile(r'^[A-Z][a-zA-Z0-9]*$'),
            'snake_case': re.compile(r'^[a-z_][a-z0-9_]*$'),
            'UPPER_SNAKE_CASE': re.compile(r'^[A-Z_][A-Z0-9_]*$'),
        }

Feature Extraction Process:

AST Parsing: Utilizes Python's ast module for robust code analysis
Naming Analysis: Regex-based pattern matching for consistent style detection
Structural Metrics: Statistical analysis of code organization patterns
Readability Assessment: Comment density and identifier length analysis

Critical Implementation Insight: The code correctly prioritizes comment ratio (feature #8) as the most important feature, aligning with the paper's empirical findings.

2. StructCoderAnalyzer Class

The StructCoderAnalyzer represents a sophisticated structural analysis engine that captures multi-dimensional code characteristics.

Architectural Highlights:

def extract_structural_features(self, code: str) -> Dict[str, Any]:
    # Five-dimensional analysis
    ast_features = self._extract_ast_structural_features(tree)
    dfg_features = self._extract_dataflow_features(tree)
    cfg_features = self._extract_control_flow_features(tree)
    call_graph_features = self._extract_call_graph_features(tree)
    dependency_features = self._extract_dependency_features(tree)

Advanced Analysis Techniques:

AST Node Distribution: 38 distinct node types with frequency analysis
Data Flow Tracking: Variable lifecycle and dependency chain analysis
Control Flow Complexity: Cyclomatic complexity with nested structure detection
Call Graph Construction: Function invocation pattern analysis
Scope-Aware Dependency Analysis: Multi-level variable relationship tracking

3. LPStructHybridEvaluator Class

The hybrid evaluator represents the core innovation of this framework, implementing a weighted combination of stylistic and structural analyses.

Weighting Strategy:

def __init__(self, style_weight: float = 0.5, structural_weight: float = 0.5):
    # Configurable weighting allows domain-specific tuning
    self.style_weight = style_weight / total_weight
    self.structural_weight = structural_weight / total_weight

Multi-Level Similarity Computation:

Style Similarity: Cosine similarity of 10-dimensional LPcodedec features
Structural Similarity: Weighted combination of 5 structural aspects
- AST similarity: 40% weight (most important)
- Data flow similarity: 30% weight (StructCoder's core innovation)
- Control flow similarity: 20% weight
- Call graph similarity: 7% weight
- Dependency similarity: 3% weight

4. ModelCodeGenerator Class

The code generation component integrates with Hugging Face Transformers for practical evaluation scenarios.

Generation Strategy:

Sampling Parameters: Temperature=0.8, top_p=0.95 for balanced creativity/quality
Diversity Mechanisms: no_repeat_ngram_size=3 prevents repetitive outputs
Post-processing: Intelligent code cleaning and formatting

Evaluation Methodology and Metrics

Best-of-N Evaluation Protocol

The framework implements a sophisticated Best-of-N evaluation strategy that:

Generates multiple code candidates (typically N=3)
Evaluates each candidate against the reference using hybrid metrics
Selects the highest-scoring candidate
Provides comprehensive statistical analysis

Statistical Analysis Framework

Primary Metrics:

Hybrid Score: Weighted combination of style and structural similarities
Component Scores: Individual style and structural similarity scores
Diversity Metrics: Score variance and improvement quantification
Correlation Analysis: Style-structure relationship assessment

Advanced Analytics:

'diversity_metrics': {
    'score_diversity': np.std(scores) / np.mean(scores),
    'best_improvement': best_result['hybrid_score'] - np.mean(scores),
    'style_structural_correlation': np.corrcoef(style_scores, structural_scores)[0,1]
}

Practical Applications and Use Cases

Academic Research Applications

Model Comparison Studies: Comprehensive evaluation of different code generation architectures
Feature Importance Analysis: Understanding which aspects of code quality matter most
Cross-Language Evaluation: Extension of LPcodedec findings to other programming languages
Human-AI Code Comparison: Quantitative analysis of human vs. AI coding patterns

Industry Applications

Code Review Automation: Multi-dimensional quality assessment for generated code
Educational Technology: Automated feedback systems for programming education
Software Development Tools: Integration with IDEs for real-time code quality assessment
Documentation Generation: Quality assessment for auto-generated code documentation

Research Extensions

Multi-Language Support: Extension beyond Python to Java, C++, JavaScript, etc.
Domain-Specific Evaluation: Specialized metrics for different programming domains
Temporal Analysis: Code quality evolution over time
Interactive Evaluation: Real-time feedback during code generation

Performance Optimization Strategies

Vectorized Operations: NumPy-based computations for efficiency
Caching: Feature vector caching for repeated evaluations
Parallel Processing: Multi-threaded evaluation capabilities
Memory Management: Efficient handling of large-scale evaluations

Validation and Quality Assurance

Statistical Validation

The framework implements rigorous statistical validation:

Confidence Intervals: 95% confidence bounds for all metrics
Significance Testing: Statistical significance of improvements
Cross-Validation: Robust evaluation across multiple data splits
Effect Size Analysis: Practical significance assessment

Comparative Analysis

Comprehensive comparison with existing approaches:

Traditional Metrics: BLEU, ROUGE, exact match comparison
Structural Metrics: Tree edit distance, AST similarity
Embedding-Based: Code embedding similarity analysis
Human Evaluation: Correlation with human judgment studies

Limitations and Future Directions

Current Limitations

Language Specificity: Primary validation on Python code
Computational Cost: High-dimensional analysis requires significant resources
Model Dependency: Requires specific model architectures for optimal performance
Feature Coverage: Limited to predefined feature sets

Future Research Directions

Deep Learning Integration: Neural network-based feature learning
Cross-Modal Analysis: Integration of code comments and documentation
Dynamic Analysis: Runtime behavior evaluation
Collaborative Filtering: Learning from developer preferences
Multimodal Evaluation: Integration of visual code representations

Conclusion

This comprehensive three-tier evaluation framework represents a significant advancement in automated code quality assessment. By integrating LPcodedec's empirically-validated stylistic analysis, StructCoder's sophisticated structural understanding, and CodeBLEU's semantic evaluation capabilities, the framework provides unprecedented depth and accuracy in code generation evaluation.

The adaptive weighting mechanisms, multi-strategy evaluation protocols, and ensemble learning approaches ensure robust performance across diverse code characteristics and use cases. The framework's modular architecture, comprehensive error handling, and extensive statistical validation make it suitable for both research applications and practical deployment scenarios.

Key contributions include:

Empirical Integration: Direct incorporation of peer-reviewed research findings
Multi-Dimensional Analysis: Comprehensive evaluation across style, structure, and semantics
Adaptive Intelligence: Context-aware evaluation strategies
Practical Implementation: Production-ready code with extensive error handling
Statistical Rigor: Comprehensive validation and significance testing

This work establishes a new standard for code generation evaluation and provides a foundation for future research in automated code quality assessment.

References and Citations

Park, S., Jin, H., Cha, J., & Han, Y. S. (2025). Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features. arXiv preprint arXiv:2502.17749v1.
Wang, Y., Wang, W., Joty, S., & Hoi, S. C. (2022). StructCoder: Structure-Aware Transformer for Code Generation. arXiv preprint arXiv:2206.05239.
Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., ... & Zhou, M. (2020). CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv preprint arXiv:2009.10297.
Chen, M., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.
Austin, J., et al. (2021). Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
__pycache__		__pycache__
evaluation_models		evaluation_models
lpdedoc_result_json		lpdedoc_result_json
README.md		README.md
evaluation.py		evaluation.py
test_data.json		test_data.json

jjaehyun2/code_generate_model_evaluation

Folders and files

Latest commit

History

Repository files navigation

Hybrid Code Generation Evaluation Framework

Integrating LPcodedec Stylistic Analysis and StructCoder Structural Analysis

Executive Summary

Theoretical Foundations and Background

1. LPcodedec: Coding Style Feature Analysis

Core Principles of LPcodedec:

2. StructCoder: Structure-Aware Code Analysis

Fundamental Concepts:

3. CodeBLEU: Comprehensive Code Similarity Metric

System Architecture and Implementation

Core Components Analysis

1. LPcodedecAnalyzer Class

2. StructCoderAnalyzer Class

3. LPStructHybridEvaluator Class

4. ModelCodeGenerator Class

Evaluation Methodology and Metrics

Best-of-N Evaluation Protocol

Statistical Analysis Framework

Practical Applications and Use Cases

Academic Research Applications

Industry Applications

Research Extensions

Performance Optimization Strategies

Validation and Quality Assurance

Statistical Validation

Comparative Analysis

Limitations and Future Directions

Current Limitations

Future Research Directions

Conclusion

References and Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages