Collaborative Science in Action: Pan-European ELLIS Collaboration Leads to Influential NLP Publication

Scientific collaboration across European borders is essential to driving forward cutting-edge open and trustworthy research in artificial intelligence, enabling Europe to stand as a united and stronger player on the global stage. The ELLIS network provides an excellence recognition and collaboration platform to enable that. A striking recent example of this effort is the joint publication, LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks, authored by 20 researchers - 12 of which are affiliated with the ELLIS (European Laboratory for Learning and Intelligent Systems) network. The paper, which has already attracted substantial academic attention, was accepted to the proceedings of ACL 2025 — one of the leading international conferences in the field of Natural Language Processing (NLP).

About the Research

The study examines whether Large Language Models (LLMs) can reliably replace human judges in evaluating NLP outputs. To do this, the authors introduce JUDGE-BENCH, a large and diverse benchmark of 20 human-annotated evaluation datasets spanning a wide range of tasks, domains, and quality dimensions. They systematically assess 11 LLMs—both open and proprietary—by comparing their evaluation judgments against those of human annotators.

Members of the author team presenting their research at ACL 2025 in Vienna.

Unlike earlier studies that typically focused on a few tasks or assumed LLM reliability, this work shows that alignment with human judgments varies significantly depending on the model, task, and type of annotation. Specifically, on some tasks, such as instruction following and the generation of mathematical reasoning traces, models can be reliably used as evaluators. However, overall, models’ agreement with human judgments varies widely across datasets, evaluated properties, and data sources; and depends on the level of expertise of human judges. Therefore, the authors recommend validation and calibration of LLMs against task-specific human judgments prior to their deployment as evaluators. These findings highlight the need for careful, task-specific validation before using LLMs as stand-ins for human evaluators.

The paper reflects a united effort by researchers from 11 institutions across Europe, connecting expertise from multiple ELLIS Units, including:

Anna Bavaresco - ELLIS PhD Student (University of Amsterdam)

Raffaella Bernardi - ELLIS Unit Trento & ELLIS Fellow (University of Trento)

Leonardo Bertolazzi - (University of Trento)

Desmond Elliott - ELLIS Unit Copenhagen & ELLIS Member (University of Copenhagen)

Raquel Fernández - ELLIS Unit Amsterdam & ELLIS Fellow (University of Amsterdam)

Albert Gatt - (Utrecht University)

Esam Ghaleb - (Max Planck Institute for Psycholinguistics)

Mario Giulianelli - ELLIS Member (ETH Zurich)

Michael Hanna - ELLIS PhD Student (University of Amsterdam)

Alexander Koller - ELLIS Unit Saarbrücken & ELLIS Member (Saarland University)

André F. T. Martins - ELLIS Unit Lisbon & ELLIS Fellow (Universidade de Lisboa & Unbabel)

Philipp Mondorf - (LMU Munich & MCML)

Vera Neplenbroek - (University of Amsterdam)

Barbara Plank - ELLIS Unit Munich & ELLIS Fellow (LMU Munich & MCML)

Sandro Pezzelle - ELLIS Unit Amsterdam & ELLIS Member (University of Amsterdam)

David Schlangen - ELLIS Member (University of Potsdam)

Alessandro Suglia - ELLIS Member (Heriot-Watt University)

Aditya K. Surikuchi - (University of Amsterdam)

Ece Takmaz - (Utrecht University)

Alberto Testoni - (Amsterdam UMC)

Origins at ELLIS Research Program Workshop

The first ideas and discussions behind this collaborative research emerged during a workshop held in March 2024 at the Oberwolfach Research Institute for Mathematics (MFO) in Germany’s Black Forest. Organised as part of the ELLIS NLP Program, the event was led by ELLIS Unit Amsterdam members Prof. Dr. Raquel Fernández (ELLIS Fellow) and Dr. Sandro Pezzelle (ELLIS Member). The workshop focused on future directions in open LLMs and multimodal language technologies.

The ELLIS workshop provided a focused environment for scientific exchange, where intensive discussions and collaborative brainstorming laid the groundwork for the future research. What began as an initial idea developed over a few days quickly evolved into a large-scale empirical study.

A Model for European Research Collaboration

The paper is amassing citations rapidly and is emerging as a reference point in the ongoing discourse on the role of LLMs in NLP evaluation. This early impact reflects both the topical relevance of the research and the rigor of its empirical methodology.

The success of this project stands as a testament to the effectiveness of the ELLIS network in nurturing high-impact, interdisciplinary, and international collaborations. As Sandro Pezzelle noted:

It’s exciting to see a research idea that took shape in just a few days, during a workshop in the woods of Germany's Black Forest, grow into a paper that’s already making an impact, published at one of the most prestigious venues in Natural Language Processing. This highlights the importance of fostering scientific collaboration and networking among researchers across Europe, something ELLIS actively supports and strengthens.

Strong Case for the ELLIS Network’s Collaborative Strength

This case study exemplifies how ELLIS, as a strategic academic network with focused scientific gatherings, can drive transformative research. As AI and NLP continue to advance at a rapid pace, the network’s commitment to collaboration, excellence, and pan-European integration will remain key to keeping Europe at the forefront of open and trustworthy AI research.