AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems
Abstract
Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.
Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.
Community
Introduces AstroReason-Bench, a benchmark for evaluating unified agentic planning in space planning problems with physics constraints, heterogeneous objectives, and long-horizon decisions.
arXivlens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/astroreason-bench-evaluating-unified-agentic-planning-across-heterogeneous-space-planning-problems-1486-917257e9
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents (2025)
- Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol (2025)
- The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments (2026)
- DeliveryBench: Can Agents Earn Profit in Real World? (2025)
- Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction (2025)
- AI Agent Systems: Architectures, Applications, and Evaluation (2026)
- Beyond Entangled Planning: Task-Decoupled Planning for Long-Horizon Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper