arxiv:2605.04523

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

Published on May 6

· Submitted by

Ivan Bondarenko on May 8

Novosibirsk State University

Upvote

Authors:

Abstract

A heterogeneous ensemble of seven large language models with dual prompting strategies achieved top performance in the SemEval-2026 MTRAGEval task through judge selection and demonstrated the importance of model diversity.

AI-generated summary

We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno-Lite-0.1, a 7B domain-adapted model with a strong cost--performance trade-off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval

View arXiv page View PDF GitHub 3 Add to collection

Community

bond005

Paper submitter 1 day ago

Why this paper is worth your time
It’s a practical masterclass in squeezing top-tier RAG performance from a diverse ensemble without betting everything on a single giant. An ensemble of 7 LLMs — from a 7B specialist to frontier systems like Gemini-3-Pro-Preview and Claude 4.5 Haiku — orchestrated by GPT-4o-mini as a lightweight faithfulness judge, won Task B of SemEval-2026 Task 8 Multi-Turn RAG Eval, outperforming even much larger standalone models. The secret? Heterogeneity in model families, scales, and prompts, not raw size alone.

What you can steal

Meno-Lite-0.1 — a tiny 7B specialist for retrieval‑augmented generation that can rival far larger alternatives in many practical scenarios.
The ensemble blueprint: how to mix weak and strong models with a cheap judge to surpass a single giant.
In‑context learning delivers: few-shot prompting consistently improved the tested large models (GLM-4.6, Llama-70B), especially on edge cases, proving more effective than abstract iterative prompt refinement.

avahal

1 day ago

the most interesting bit here is the judge-guided per-instance selection from a diverse pool of seven grounded llms. i wonder how the lightweight gpt-4o-mini judge handles turns where two candidates commit to different faithful readings of ambiguous passages, and whether faithfulness can be maintained when no candidate cleanly aligns with retrieved evidence. this gating approach seems powerful, but it could be brittle if the judge overfits to the prompts or model ids in the pool; an ablation where the judge also looks at a passage-local reliability score might help. the arxivlens breakdown helped me parse the method details, and it’s nice to see the claim that diversity across model families and prompts beats sheer scale (also worth checking the breakdown here: https://arxivlens.com/PaperView/Details/raguteam-at-semeval-2026-task-8-meno-and-friends-in-a-judge-orchestrated-llm-ensemble-for-faithful-multi-turn-response-generation-8162-a20ac434). would adding a tiny cross-check between judge verdicts and local evidence signals help reduce edge-case hallucinations, especially on unanswerable turns?