SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents Paper • 2505.20411 • Published May 26 • 91
CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards Paper • 2510.08529 • Published Oct 9 • 18
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments Paper • 2510.01179 • Published Oct 1 • 25
Do You Need Proprioceptive States in Visuomotor Policies? Paper • 2509.18644 • Published Sep 23 • 49 • 2
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? Paper • 2509.16941 • Published Sep 21 • 21
DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery Paper • 2508.06960 • Published Aug 9 • 1
PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play? Paper • 2508.10014 • Published Aug 6