RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation
Abstract
A heterogeneous ensemble of seven large language models with dual prompting strategies achieved top performance in the SemEval-2026 MTRAGEval task through judge selection and demonstrated the importance of model diversity.
We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno-Lite-0.1, a 7B domain-adapted model with a strong cost--performance trade-off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval
Community
Why this paper is worth your time
It’s a practical masterclass in squeezing top-tier RAG performance from a diverse ensemble without betting everything on a single giant. An ensemble of 7 LLMs — from a 7B specialist to frontier systems like Gemini-3-Pro-Preview and Claude 4.5 Haiku — orchestrated by GPT-4o-mini as a lightweight faithfulness judge, won Task B of SemEval-2026 Task 8 Multi-Turn RAG Eval, outperforming even much larger standalone models. The secret? Heterogeneity in model families, scales, and prompts, not raw size alone.
What you can steal
- Meno-Lite-0.1 — a tiny 7B specialist for retrieval‑augmented generation that can rival far larger alternatives in many practical scenarios.
- The ensemble blueprint: how to mix weak and strong models with a cheap judge to surpass a single giant.
- In‑context learning delivers: few-shot prompting consistently improved the tested large models (GLM-4.6, Llama-70B), especially on edge cases, proving more effective than abstract iterative prompt refinement.
the most interesting bit here is the judge-guided per-instance selection from a diverse pool of seven grounded llms. i wonder how the lightweight gpt-4o-mini judge handles turns where two candidates commit to different faithful readings of ambiguous passages, and whether faithfulness can be maintained when no candidate cleanly aligns with retrieved evidence. this gating approach seems powerful, but it could be brittle if the judge overfits to the prompts or model ids in the pool; an ablation where the judge also looks at a passage-local reliability score might help. the arxivlens breakdown helped me parse the method details, and it’s nice to see the claim that diversity across model families and prompts beats sheer scale (also worth checking the breakdown here: https://arxivlens.com/PaperView/Details/raguteam-at-semeval-2026-task-8-meno-and-friends-in-a-judge-orchestrated-llm-ensemble-for-faithful-multi-turn-response-generation-8162-a20ac434). would adding a tiny cross-check between judge verdicts and local evidence signals help reduce edge-case hallucinations, especially on unanswerable turns?
This is a very interesting paper about the smartest and most computationally efficient RAG.
We've tested your Menon model on a laptop - it works very fast.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations (2026)
- Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents (2026)
- H-RAG at SemEval-2026 Task 8: Hierarchical Parent-Child Retrieval for Multi-Turn RAG Conversations (2026)
- S2G-RAG: Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA (2026)
- EviMem: Evidence-Gap-Driven Iterative Retrieval for Long-Term Conversational Memory (2026)
- Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering (2026)
- OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper