Skip to content

jjaehyun2/code_generate_model_evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

# 코드 생성 모델 평가 프레임워크

## 1. 개요 (Overview)

이 프레임워크는 코드 생성 모델을 평가하기 위해 **3가지 축**을 통합적으로 분석합니다.

- **코딩 스타일 (LPcodedec)**
- **코드 구조 (StructCoder)**
- **코드 의미 (CodeBLEU)**

이를 통해 코드의 표면적 형식, 내부 논리 구조, 의미적 동작까지 다차원적으로 평가가 가능하며,  
연구 및 산업 현장에서 모두 활용할 수 있습니다.

---

## 2. 평가 구성 요소

### 2.1 LPcodedec (스타일 분석)

- **출처**: Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features (arXiv:2502.17749v1)  
- **역할**: 코드의 스타일적 특징을 정량 평가
- **주요 특징 (총 10가지)**  
  1. 네이밍 일관성: 함수명, 변수명, 클래스명, 상수명  
  2. 코드 구조 측정: 들여쓰기 일관성, 평균 함수 길이, 중첩 깊이  
  3. 가독성(Readability): 주석 비율(핵심), 평균 함수명 길이, 평균 변수명 길이
- **장점**: 계산 속도가 매우 빠르며(LLM 기반 구조 분석 대비 1,000배 이상), 해석 가능한 피처 제공

---

### 2.2 StructCoder (구조 분석)

- **역할**: 코드의 구조적 의미를 분석하여 내부 논리와 조직도를 파악
- **5대 분석 축**  
  1. **AST 구조 분석**: 노드 분포, 최대 깊이, 리프 비율, 분기 계수  
  2. **데이터 플로우(DFG)**: 변수 정의/사용 비율, 의존성 체인  
  3. **제어 흐름(CFG)**: if/for/while/try 블록 수, 중첩 루프, 순환 복잡도  
  4. **함수 호출 그래프**: 호출 횟수, 고유 함수, 내장 함수 비율  
  5. **변수 의존성**: 지역/전역 변수 비율, 스코프 크기
- **장점**: 코드 실행 구조를 정밀하게 반영 가능

---

### 2.3 CodeBLEU (의미 분석)

- **역할**: 코드의 동작 의미를 비교하는 정밀 지표
- **4대 구성 요소**  
  1. n-gram 매칭  
  2. 구문 트리(AST) 매칭  
  3. 데이터 플로우 매칭  
  4. 키워드 매칭
- **장점**: 표면적 유사성(BLEU)을 넘어 구조 및 의미까지 평가 가능 → 실행 결과 정확성 향상

---

## 3. 평가 절차

1. **모델 로드** — 파인튜닝된 코드 생성 모델 불러오기  
2. **코드 생성** — Instruction으로 N개의 후보 코드 생성  
3. **지표 계산** — 각 후보에 대해 스타일·구조·의미 점수 산출  
4. **통합 스코어 계산** — 가중합 또는 적응형 가중치 적용 (기본 1:1:1)  
5. **Best-of-N 선택** — 최고 점수 후보 선택  
6. **통계 분석** — 평균, 표준편차, 점수 다양성, 스타일-구조 상관계수 분석  
7. **결과 저장** — JSON 형식으로 저장  

---

## 4. 예상 결과 해석

- **스타일 점수 높음**: 형식·네이밍·주석 패턴 유사  
- **구조 점수 높음**: 제어/데이터 흐름, 프로그램 조직 유사  
- **의미 점수 높음**: 동작·출력 유사, 로직 일치  
- **지표 간 상관계수**: 코드 품질 특성의 원인 분석 가능

---

## 5. 활용 분야

- **연구** — 모델 간 성능 비교, 데이터셋 품질 평가  
- **산업** — 코드 리뷰 자동화, 보안 패턴 점검  
- **교육** — 학생 코드 채점, 피드백 시스템

---

Hybrid Code Generation Evaluation Framework

Integrating LPcodedec Stylistic Analysis and StructCoder Structural Analysis


Executive Summary

This repository presents a novel evaluation framework for code generation models that combines two cutting-edge methodologies: LPcodedec for coding style analysis and StructCoder for structural code analysis. The framework represents a significant advancement in code quality assessment by integrating statistical style features with deep structural understanding, providing comprehensive evaluation capabilities for modern code generation systems.


Theoretical Foundations and Background

1. LPcodedec: Coding Style Feature Analysis

LPcodedec (LLM-Paraphrased Code Detection) is a groundbreaking approach introduced in "Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features" [arXiv:2502.17749v1]. This methodology addresses the critical challenge of distinguishing between human-written and LLM-generated code through sophisticated stylistic analysis.

Core Principles of LPcodedec:

Statistical Foundation: The method is built upon rigorous ANOVA (Analysis of Variance) statistical analysis that identified significant differences between human-written and LLM-generated code across multiple programming languages (C, C++, Java, Python).

Feature Design Philosophy: LPcodedec extracts 10 quantitative features organized into three conceptual groups:

  • Naming Consistency (4 features): Function, variable, class, and constant naming patterns
  • Code Structure (3 features): Indentation consistency, function length, nesting depth
  • Readability (3 features): Comment ratio, function name length, variable name length

Empirical Validation: The paper's ANOVA analysis revealed that comment ratio consistently exhibits the highest F-statistic across all programming languages, making it the most discriminative feature for detecting LLM-generated code.

Language-Specific Insights:

  • Python and C: Readability features (especially comment ratio) are most effective
  • C++ and Java: Code structure features dominate due to object-oriented complexity
  • Cross-language consistency: Comment ratio remains universally significant

Performance Characteristics: LPcodedec achieves remarkable efficiency with 1,343× speedup compared to Tree Edit Distance while maintaining superior F1 scores (87-93% across languages).

2. StructCoder: Structure-Aware Code Analysis

StructCoder represents a paradigm shift in code analysis by moving beyond surface-level token sequences to capture deep structural semantics. This approach is inspired by structure-aware Transformer architectures that simultaneously model Abstract Syntax Trees (ASTs) and data flow graphs.

Fundamental Concepts:

Multi-Dimensional Structural Analysis: StructCoder analyzes code through five complementary perspectives:

  • AST Structural Features: Node type distributions, tree depth, branching factors
  • Data Flow Graph Features: Variable dependencies, def-use chains, scope analysis
  • Control Flow Features: Conditional structures, loop patterns, cyclomatic complexity
  • Call Graph Features: Function invocation patterns, built-in usage analysis
  • Dependency Features: Variable scope relationships, local vs. global dependencies

Architectural Innovation: Unlike traditional approaches that treat code as linear text, StructCoder maintains explicit representations of program structure, enabling more accurate semantic similarity assessment.

3. CodeBLEU: Comprehensive Code Similarity Metric

CodeBLEU extends traditional BLEU metrics to address the unique characteristics of programming languages. It combines four complementary components:

  • N-gram Match: Surface-level token similarity
  • Weighted N-gram Match: Emphasizes programming keywords
  • AST Match: Syntactic structure comparison
  • Data Flow Match: Semantic correctness assessment

The final score is computed as: CodeBLEU = α·BLEU + β·Weighted_BLEU + γ·AST_Match + θ·DataFlow_Match


System Architecture and Implementation

Core Components Analysis

1. LPcodedecAnalyzer Class

The LPcodedecAnalyzer class implements the complete feature extraction pipeline based on the original paper's methodology.

Key Implementation Details:

class LPcodedecAnalyzer:
    def __init__(self):
        self.naming_patterns = {
            'camelCase': re.compile(r'^[a-z][a-zA-Z0-9]*$'),
            'PascalCase': re.compile(r'^[A-Z][a-zA-Z0-9]*$'),
            'snake_case': re.compile(r'^[a-z_][a-z0-9_]*$'),
            'UPPER_SNAKE_CASE': re.compile(r'^[A-Z_][A-Z0-9_]*$'),
        }

Feature Extraction Process:

  1. AST Parsing: Utilizes Python's ast module for robust code analysis
  2. Naming Analysis: Regex-based pattern matching for consistent style detection
  3. Structural Metrics: Statistical analysis of code organization patterns
  4. Readability Assessment: Comment density and identifier length analysis

Critical Implementation Insight: The code correctly prioritizes comment ratio (feature #8) as the most important feature, aligning with the paper's empirical findings.

2. StructCoderAnalyzer Class

The StructCoderAnalyzer represents a sophisticated structural analysis engine that captures multi-dimensional code characteristics.

Architectural Highlights:

def extract_structural_features(self, code: str) -> Dict[str, Any]:
    # Five-dimensional analysis
    ast_features = self._extract_ast_structural_features(tree)
    dfg_features = self._extract_dataflow_features(tree)
    cfg_features = self._extract_control_flow_features(tree)
    call_graph_features = self._extract_call_graph_features(tree)
    dependency_features = self._extract_dependency_features(tree)

Advanced Analysis Techniques:

  • AST Node Distribution: 38 distinct node types with frequency analysis
  • Data Flow Tracking: Variable lifecycle and dependency chain analysis
  • Control Flow Complexity: Cyclomatic complexity with nested structure detection
  • Call Graph Construction: Function invocation pattern analysis
  • Scope-Aware Dependency Analysis: Multi-level variable relationship tracking

3. LPStructHybridEvaluator Class

The hybrid evaluator represents the core innovation of this framework, implementing a weighted combination of stylistic and structural analyses.

Weighting Strategy:

def __init__(self, style_weight: float = 0.5, structural_weight: float = 0.5):
    # Configurable weighting allows domain-specific tuning
    self.style_weight = style_weight / total_weight
    self.structural_weight = structural_weight / total_weight

Multi-Level Similarity Computation:

  • Style Similarity: Cosine similarity of 10-dimensional LPcodedec features
  • Structural Similarity: Weighted combination of 5 structural aspects
    • AST similarity: 40% weight (most important)
    • Data flow similarity: 30% weight (StructCoder's core innovation)
    • Control flow similarity: 20% weight
    • Call graph similarity: 7% weight
    • Dependency similarity: 3% weight

4. ModelCodeGenerator Class

The code generation component integrates with Hugging Face Transformers for practical evaluation scenarios.

Generation Strategy:

  • Sampling Parameters: Temperature=0.8, top_p=0.95 for balanced creativity/quality
  • Diversity Mechanisms: no_repeat_ngram_size=3 prevents repetitive outputs
  • Post-processing: Intelligent code cleaning and formatting

Evaluation Methodology and Metrics

Best-of-N Evaluation Protocol

The framework implements a sophisticated Best-of-N evaluation strategy that:

  1. Generates multiple code candidates (typically N=3)
  2. Evaluates each candidate against the reference using hybrid metrics
  3. Selects the highest-scoring candidate
  4. Provides comprehensive statistical analysis

Statistical Analysis Framework

Primary Metrics:

  • Hybrid Score: Weighted combination of style and structural similarities
  • Component Scores: Individual style and structural similarity scores
  • Diversity Metrics: Score variance and improvement quantification
  • Correlation Analysis: Style-structure relationship assessment

Advanced Analytics:

'diversity_metrics': {
    'score_diversity': np.std(scores) / np.mean(scores),
    'best_improvement': best_result['hybrid_score'] - np.mean(scores),
    'style_structural_correlation': np.corrcoef(style_scores, structural_scores)[0,1]
}

Practical Applications and Use Cases

Academic Research Applications

  1. Model Comparison Studies: Comprehensive evaluation of different code generation architectures
  2. Feature Importance Analysis: Understanding which aspects of code quality matter most
  3. Cross-Language Evaluation: Extension of LPcodedec findings to other programming languages
  4. Human-AI Code Comparison: Quantitative analysis of human vs. AI coding patterns

Industry Applications

  1. Code Review Automation: Multi-dimensional quality assessment for generated code
  2. Educational Technology: Automated feedback systems for programming education
  3. Software Development Tools: Integration with IDEs for real-time code quality assessment
  4. Documentation Generation: Quality assessment for auto-generated code documentation

Research Extensions

  1. Multi-Language Support: Extension beyond Python to Java, C++, JavaScript, etc.
  2. Domain-Specific Evaluation: Specialized metrics for different programming domains
  3. Temporal Analysis: Code quality evolution over time
  4. Interactive Evaluation: Real-time feedback during code generation

Performance Optimization Strategies

  1. Vectorized Operations: NumPy-based computations for efficiency
  2. Caching: Feature vector caching for repeated evaluations
  3. Parallel Processing: Multi-threaded evaluation capabilities
  4. Memory Management: Efficient handling of large-scale evaluations

Validation and Quality Assurance

Statistical Validation

The framework implements rigorous statistical validation:

  • Confidence Intervals: 95% confidence bounds for all metrics
  • Significance Testing: Statistical significance of improvements
  • Cross-Validation: Robust evaluation across multiple data splits
  • Effect Size Analysis: Practical significance assessment

Comparative Analysis

Comprehensive comparison with existing approaches:

  • Traditional Metrics: BLEU, ROUGE, exact match comparison
  • Structural Metrics: Tree edit distance, AST similarity
  • Embedding-Based: Code embedding similarity analysis
  • Human Evaluation: Correlation with human judgment studies

Limitations and Future Directions

Current Limitations

  1. Language Specificity: Primary validation on Python code
  2. Computational Cost: High-dimensional analysis requires significant resources
  3. Model Dependency: Requires specific model architectures for optimal performance
  4. Feature Coverage: Limited to predefined feature sets

Future Research Directions

  1. Deep Learning Integration: Neural network-based feature learning
  2. Cross-Modal Analysis: Integration of code comments and documentation
  3. Dynamic Analysis: Runtime behavior evaluation
  4. Collaborative Filtering: Learning from developer preferences
  5. Multimodal Evaluation: Integration of visual code representations

Conclusion

This comprehensive three-tier evaluation framework represents a significant advancement in automated code quality assessment. By integrating LPcodedec's empirically-validated stylistic analysis, StructCoder's sophisticated structural understanding, and CodeBLEU's semantic evaluation capabilities, the framework provides unprecedented depth and accuracy in code generation evaluation.

The adaptive weighting mechanisms, multi-strategy evaluation protocols, and ensemble learning approaches ensure robust performance across diverse code characteristics and use cases. The framework's modular architecture, comprehensive error handling, and extensive statistical validation make it suitable for both research applications and practical deployment scenarios.

Key contributions include:

  • Empirical Integration: Direct incorporation of peer-reviewed research findings
  • Multi-Dimensional Analysis: Comprehensive evaluation across style, structure, and semantics
  • Adaptive Intelligence: Context-aware evaluation strategies
  • Practical Implementation: Production-ready code with extensive error handling
  • Statistical Rigor: Comprehensive validation and significance testing

This work establishes a new standard for code generation evaluation and provides a foundation for future research in automated code quality assessment.


References and Citations

  1. Park, S., Jin, H., Cha, J., & Han, Y. S. (2025). Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features. arXiv preprint arXiv:2502.17749v1.

  2. Wang, Y., Wang, W., Joty, S., & Hoi, S. C. (2022). StructCoder: Structure-Aware Transformer for Code Generation. arXiv preprint arXiv:2206.05239.

  3. Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., ... & Zhou, M. (2020). CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv preprint arXiv:2009.10297.

  4. Chen, M., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.

  5. Austin, J., et al. (2021). Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages