Authors: Nicolò Donati*, Giuseppe Savino, Paolo Torroni
Affiliations: University of Bologna, Zanichelli Editore S.p.A.
Contact: [email protected]
License: [CC BY-NC-ND 4.0]
This repository accompanies the paper "Do LLMs Understand How to Be Judges?", which investigates the capacity of Large Language Models (LLMs) to act as evaluators for open-ended text generation tasks, such as summarization. The study introduces a structured rubric grounded in editorial best practices and evaluates whether LLMs can apply it reliably and interpretively.
Traditional automatic metrics like ROUGE and BLEU fail to capture the nuanced qualities of generated text. Human evaluations, while richer, are costly and inconsistent. This work explores whether LLMs can serve as scalable, rubric-based judges by evaluating summaries across five editorial dimensions:
- Coherence
- Consistency
- Fluency
- Relevance
- Ordering
Using a purpose-built dataset of Italian news summaries first generated by GPT-4o then revised by humans, we assess LLMs' ability to assign scores and provide rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) on some criteria, but also reveal systematic biases and rubric interpretation challenges.
.
├── code/ # Python code to run the experiments on the 24 different models using the Annotated Dataset
├── data/ # Annotated Dataset of Italian news articles and summaries, and Few-shot examples used in prompting
├── prompts/ # Jinja Prompt templates for each evaluation criterion
├── notebooks/ # Notebooks for results processing, computing metrics and visualisation
├── figures/ # Visualisations from the paper
├── annotation_guidelines/ # Bilingual rubric and annotation instructions
├── evaluation_results/ # Raw and Processed Model outputs
├── requirements.txt # Software dependencies to reproduce the environment
├── LICENCE # CC BY-NC-ND 4.0 Licence description
└── README.md # This fileEach summary is evaluated along five dimensions, each with a 1–5 scale and detailed sub-criteria:
| Criterion | Description |
|---|---|
| Coherence | Logical flow and structural organisation of the summary |
| Consistency | Factual alignment with the source text |
| Fluency | Grammaticality, spelling, and stylistic quality |
| Relevance | Inclusion of essential content and exclusion of irrelevant details |
| Ordering | Preservation of the original narrative or logical sequence |
See prompts/ or annotation_guidelines/ for full definitions in English and Italian.
- Models Evaluated: 24 LLMs (e.g., GPT-4o, LLaMA 3, Phi-4, Qwen 3, Gemma 3, DeepSeek distil)
- Evaluation Method: Few-shot prompting with rubric-based scoring
- Metrics: Spearman’s ρ (ranking alignment), Mean Absolute Error (score accuracy)
- Dataset: 50 summaries across 10 Italian news articles, annotated by an expert editor
- Clone the repo:
git clone https://github.com/ZanichelliEditore/llm-summarization-evaluation.git - Move into the repo:
cd llm-summarization-evaluation - Create Virtual Environment:
python3 -m venv .venv - Activate the Virtual Environment:
source .venv/bin/activate - Install the Dependencies:
pip install -r requirements.txt - Run Experiments:
python3 code/ita/ita_meta_eval_qwen_3_14B.py. The are 24 LLMs to choose from in thecode/itafolder (e.g., GPT-4o, LLaMA 3, Phi-4, Qwen 3, Gemma 3, DeepSeek distil). - Run Meta-Evaluation: The results of the experiments are processed, and metrics are computed in the
notebooks.
The study reveals a fundamental disconnect: while LLMs can approximate human scores in absolute terms (MAE), their ability to preserve the correct ordinal relationships between items (Spearman's ρ) remains inconsistent and model-dependent (Fig. 1), with significant implications for their reliability as evaluation tools.
- Scaling Effects: Figure 1,2,3
- No consistent scaling benefit: While larger models generally show reduced Mean Absolute Error (MAE) as size increases (Fig. 2), this doesn't translate to better ranking ability as measured by Spearman's ρ (Fig. 3)
- Family-specific patterns:
- DeepSeek models show decreasing MAE but ρ fluctuates near zero across sizes (Fig. 1)
- Gemma 3 exhibits a U-shaped MAE trend with size, with the 4B model showing a negative correlation (ρ = -0.179)
- GPT "mini" variants outperform larger counterparts in ranking ability (GPT-4o-mini achieves ρ = 0.277, Fig. 3)
- Raw model scale improves scoring precision but doesn't guarantee human-like ranking ability (Fig. 1 shows poor correlation between MAE and ρ)
Figure 6: Human vs. LLM ratings for Fluency and Coherence criteria. Rating on the x axis and Count on the y axis.
- Systematic Positive Bias: (Figures 4, 5, 6)
- Consistent over-scoring: All models systematically assign higher scores than human annotators, particularly on subjective dimensions
- Fluency: Fig. 6 shows models disproportionately assigning 4-5 ratings
- Coherence: Fig. 6 reveals similar positive skew
- Not size-dependent: The bias appears across both small and large models, suggesting it stems from shared training dynamics rather than model capacity
- Model-to-Model Agreement
- Size matters for consensus: Small models (e.g., DeepSeek 1.5B) show negligible or negative alignment with other models (ρ ≈ 0.127)
- Strong intra-family alignment: GPT models show exceptional consensus (GPT-4.1 and GPT-4o: ρ = 0.810), Qwen 3 4B and Qwen 3 14B achieve strong alignment (ρ = 0.671)
- Cross-family patterns: Larger models show stronger cross-family alignment (Qwen 3 14B and GPT-4.1: ρ = 0.725)
- Dataset limited to the Italian news domain
- Single expert annotator per summary
- No fine-tuning applied to models
- Extend to multilingual and multi-genre datasets
- Explore instruction-tuning and calibration techniques
- Develop ensemble or multi-agent evaluation strategies
- Investigate why smaller models sometimes outperform larger ones in ranking tasks
- Evaluate rationales for quality
If you use this work, please cite:
@misc{donati2025judgingllms,
title={Do Large Language Models Understand How to Be Judges?},
author={Nicolò Donati and Giuseppe Savino and Paolo Torroni},
year={2025},
note={TODO insert acl procedings link once available},
url={https://github.com/your-repo-url}
}We acknowledge Zanichelli editore for their support in enabling this research. Their provision of access to digital infrastructure and ex- pertise significantly facilitated the curation of the dataset and the hu- man evaluation processes. Special thanks to Dr. Isabella Nenci for her dedicated contribution to dataset annotation and for sharing her expertise. We extend our sincere appreciation to the anonymous re- viewers for their insightful feedback and constructive suggestions. This work was partially supported by the project FAIR: Future Artificial Intelligence Research (European Commission NextGeneration EU programme, PNRR-M4C2-Investimento 1.3, PE00000013-"FAIR" - Spoke 8).
For questions or collaborations, please contact Nicolò Donati at [email protected].





