InnoGym: Benchmarking the Innovation Potential of AI Agents Paper • 2512.01822 • Published Dec 1, 2025 • 36
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models Paper • 2506.01062 • Published Jun 1, 2025 • 5
LightMem: Lightweight and Efficient Memory-Augmented Generation Paper • 2510.18866 • Published Oct 21, 2025 • 113
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research Paper • 2507.13300 • Published Jul 17, 2025 • 20
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21, 2025 • 84
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21, 2025 • 84