Do LLMs Understand How to Be Judges?

Authors: Nicolò Donati*, Giuseppe Savino, Paolo Torroni
Affiliations: University of Bologna, Zanichelli Editore S.p.A.
Contact: [email protected]
License: [CC BY-NC-ND 4.0]

🧠 Overview

This repository accompanies the paper "Do LLMs Understand How to Be Judges?", which investigates the capacity of Large Language Models (LLMs) to act as evaluators for open-ended text generation tasks, such as summarization. The study introduces a structured rubric grounded in editorial best practices and evaluates whether LLMs can apply it reliably and interpretively.

📄 Abstract

Traditional automatic metrics like ROUGE and BLEU fail to capture the nuanced qualities of generated text. Human evaluations, while richer, are costly and inconsistent. This work explores whether LLMs can serve as scalable, rubric-based judges by evaluating summaries across five editorial dimensions:

Coherence
Consistency
Fluency
Relevance
Ordering

Using a purpose-built dataset of Italian news summaries first generated by GPT-4o then revised by humans, we assess LLMs' ability to assign scores and provide rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) on some criteria, but also reveal systematic biases and rubric interpretation challenges.

📁 Repository Structure

.
├── code/                   # Python code to run the experiments on the 24 different models using the Annotated Dataset
├── data/                   # Annotated Dataset of Italian news articles and summaries, and Few-shot examples used in prompting
├── prompts/                # Jinja Prompt templates for each evaluation criterion
├── notebooks/              # Notebooks for results processing, computing metrics and visualisation
├── figures/                # Visualisations from the paper
├── annotation_guidelines/  # Bilingual rubric and annotation instructions
├── evaluation_results/     # Raw and Processed Model outputs
├── requirements.txt        # Software dependencies to reproduce the environment
├── LICENCE                 # CC BY-NC-ND 4.0 Licence description
└── README.md               # This file

📊 Evaluation Criteria

Each summary is evaluated along five dimensions, each with a 1–5 scale and detailed sub-criteria:

Criterion	Description
Coherence	Logical flow and structural organisation of the summary
Consistency	Factual alignment with the source text
Fluency	Grammaticality, spelling, and stylistic quality
Relevance	Inclusion of essential content and exclusion of irrelevant details
Ordering	Preservation of the original narrative or logical sequence

See prompts/ or annotation_guidelines/ for full definitions in English and Italian.

🧪 Experimental Setup

Models Evaluated: 24 LLMs (e.g., GPT-4o, LLaMA 3, Phi-4, Qwen 3, Gemma 3, DeepSeek distil)
Evaluation Method: Few-shot prompting with rubric-based scoring
Metrics: Spearman’s ρ (ranking alignment), Mean Absolute Error (score accuracy)
Dataset: 50 summaries across 10 Italian news articles, annotated by an expert editor

🔬 Reproducibility

Clone the repo: git clone https://github.com/ZanichelliEditore/llm-summarization-evaluation.git
Move into the repo: cd llm-summarization-evaluation
Create Virtual Environment: python3 -m venv .venv
Activate the Virtual Environment: source .venv/bin/activate
Install the Dependencies: pip install -r requirements.txt
Run Experiments: python3 code/ita/ita_meta_eval_qwen_3_14B.py. The are 24 LLMs to choose from in the code/ita folder (e.g., GPT-4o, LLaMA 3, Phi-4, Qwen 3, Gemma 3, DeepSeek distil).
Run Meta-Evaluation: The results of the experiments are processed, and metrics are computed in the notebooks.

📈 Key Findings

The study reveals a fundamental disconnect: while LLMs can approximate human scores in absolute terms (MAE), their ability to preserve the correct ordinal relationships between items (Spearman's ρ) remains inconsistent and model-dependent (Fig. 1), with significant implications for their reliability as evaluation tools.

Figure 1: Mean Absolute Error vs. Spearman’s ρ

Figure 2: Mean Absolute Error vs. Model Size(Bilion of Parameters)

Figure 3: Spearman’s ρ vs. Model Size(Bilion of Parameters)

Scaling Effects: Figure 1,2,3
- No consistent scaling benefit: While larger models generally show reduced Mean Absolute Error (MAE) as size increases (Fig. 2), this doesn't translate to better ranking ability as measured by Spearman's ρ (Fig. 3)
- Family-specific patterns:
  - DeepSeek models show decreasing MAE but ρ fluctuates near zero across sizes (Fig. 1)
  - Gemma 3 exhibits a U-shaped MAE trend with size, with the 4B model showing a negative correlation (ρ = -0.179)
  - GPT "mini" variants outperform larger counterparts in ranking ability (GPT-4o-mini achieves ρ = 0.277, Fig. 3)
- Raw model scale improves scoring precision but doesn't guarantee human-like ranking ability (Fig. 1 shows poor correlation between MAE and ρ)

Figure 4: Human vs. LLM I. Rating on the x axis and Count on the y axis.

Figure 5: Human vs. LLM II. Rating on the x axis and Count on the y axis.

Figure 6: Human vs. LLM ratings for Fluency and Coherence criteria. Rating on the x axis and Count on the y axis.

Systematic Positive Bias: (Figures 4, 5, 6)
- Consistent over-scoring: All models systematically assign higher scores than human annotators, particularly on subjective dimensions
- Fluency: Fig. 6 shows models disproportionately assigning 4-5 ratings
- Coherence: Fig. 6 reveals similar positive skew
- Not size-dependent: The bias appears across both small and large models, suggesting it stems from shared training dynamics rather than model capacity
Model-to-Model Agreement
- Size matters for consensus: Small models (e.g., DeepSeek 1.5B) show negligible or negative alignment with other models (ρ ≈ 0.127)
- Strong intra-family alignment: GPT models show exceptional consensus (GPT-4.1 and GPT-4o: ρ = 0.810), Qwen 3 4B and Qwen 3 14B achieve strong alignment (ρ = 0.671)
- Cross-family patterns: Larger models show stronger cross-family alignment (Qwen 3 14B and GPT-4.1: ρ = 0.725)

📌 Limitations

Dataset limited to the Italian news domain
Single expert annotator per summary
No fine-tuning applied to models

🔭 Future Work

Extend to multilingual and multi-genre datasets
Explore instruction-tuning and calibration techniques
Develop ensemble or multi-agent evaluation strategies
Investigate why smaller models sometimes outperform larger ones in ranking tasks
Evaluate rationales for quality

📚 Citation

If you use this work, please cite:

@misc{donati2025judgingllms,
  title={Do Large Language Models Understand How to Be Judges?},
  author={Nicolò Donati and Giuseppe Savino and Paolo Torroni},
  year={2025},
  note={TODO insert acl procedings link once available},
  url={https://github.com/your-repo-url}
}

🤝 Acknowledgements

We acknowledge Zanichelli editore for their support in enabling this research. Their provision of access to digital infrastructure and ex- pertise significantly facilitated the curation of the dataset and the hu- man evaluation processes. Special thanks to Dr. Isabella Nenci for her dedicated contribution to dataset annotation and for sharing her expertise. We extend our sincere appreciation to the anonymous re- viewers for their insightful feedback and constructive suggestions. This work was partially supported by the project FAIR: Future Artificial Intelligence Research (European Commission NextGeneration EU programme, PNRR-M4C2-Investimento 1.3, PE00000013-"FAIR" - Spoke 8).

📬 Contact

For questions or collaborations, please contact Nicolò Donati at [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Do LLMs Understand How to Be Judges?

🧠 Overview

📄 Abstract

📁 Repository Structure

📊 Evaluation Criteria

🧪 Experimental Setup

🔬 Reproducibility

📈 Key Findings

📌 Limitations

🔭 Future Work

📚 Citation

🤝 Acknowledgements

📬 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
annotation_guidelines		annotation_guidelines
code		code
data		data
evaluation_results		evaluation_results
figures		figures
notebooks		notebooks
prompts		prompts
LICENCE		LICENCE
README.md		README.md
requirements.txt		requirements.txt

License

ZanichelliEditore/llm-summarization-evaluation

Folders and files

Latest commit

History

Repository files navigation

Do LLMs Understand How to Be Judges?

🧠 Overview

📄 Abstract

📁 Repository Structure

📊 Evaluation Criteria

🧪 Experimental Setup

🔬 Reproducibility

📈 Key Findings

📌 Limitations

🔭 Future Work

📚 Citation

🤝 Acknowledgements

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages