Papers
arxiv:2601.08118

MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness

Published on Jan 13
· Submitted by
Ashutosh Hathidara
on Jan 23
Authors:
,
,
,

Abstract

Large language models are evaluated as user simulators through a reproducible benchmark framework that assesses their ability to generate human-like conversational responses across diverse tasks.

AI-generated summary

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so-called user proxy agents. We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success. MIRRORBENCH features a modular execution engine with typed interfaces, metadata-driven registries, multi-backend support, caching, and robust observability. The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance-aware harness. We include three lexical-diversity metrics (MATTR, YULE'S K, and HD-D) and three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason). Across four open datasets, MIRRORBENCH yields variance-aware results and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a simple command-line interface for running experiments, managing configurations and caching, and generating reports. The framework can be accessed at https://github.com/SAP/mirrorbench.

Community

Paper author Paper submitter

The framework is open-sourced at https://github.com/SAP/mirrorbench

Paper author Paper submitter

MirrorBench is an automatic, extensible Framework to Evaluate User-Proxy Agents for Human-Likeness. It provides a modular architecture to benchmark different User-Proxy Agents against a variety of realism metrics. MirrorBench is designed to be extensible, allowing researchers and developers to bring their own agents and metrics into the framework.

GitHub Repo: https://github.com/SAP/mirrorbench

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.08118 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.08118 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.08118 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.