Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
black-ytΒ 
posted an update 1 day ago
Post
2586
Hey all β€” our ResearchClawBench leaderboard just updated πŸ”₯

We let AI do real science: 40 tasks across 10 disciplines, compared to human papers. Hard example? πŸ”οΈ Glacier mass change β€” AI must integrate 233 datasets from 35 teams, 4 methods, reproduce 6542Β±387 Gt ice loss vs IPCC. No toy problems.

Latest leaderboard (2026-06-09) πŸ“Š:
Agents: πŸ₯‡ Claude Code 21.5 (50 = match human), $5.3; πŸ₯ˆ EvoScientist 18.8, $4.1; πŸ₯‰ Codex CLI 18.4, just $2.0
LLMs+Harness: πŸ₯‡ Claude-Opus-4.8 21.1, $4.0; πŸ₯ˆ Claude-Opus-4.7 20.7; πŸ₯‰ MiniMax-M3 19.8, only $0.45; Qwen3.7-Max 18.7, $0.42, 11min πŸ’₯

Claude still king, but MiniMax/Qwen/DeepSeek are crazy cheap and competitive. Expensive isn't always better.

πŸ“Ž Code & star: https://github.com/InternScience/ResearchClawBench
🏠 Website: https://internscience.github.io/ResearchClawBench-Home/
πŸ€— Upvote paper: ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research (2606.07591)

First time hearing about this … I will steer the SRT on this soon. Catching up.

Β·

Thanks! Looking forward to seeing your results.