LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
Abstract
Vision-Language-Action models show significant performance drops when handling paraphrased instructions due to surface-level matching rather than semantic understanding, highlighting the need for better linguistic generalization metrics.
Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para
Community
We introduce LIBERO-Para, a controlled benchmark that evaluates paraphrase robustness in VLA models by independently varying action expressions and object references. Dataset: https://huggingface.co/datasets/HAI-Lab/LIBERO-Para
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/libero-para-a-diagnostic-benchmark-and-metrics-for-paraphrase-robustness-in-vla-models-5196-f79d33f2
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration (2026)
- LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models (2026)
- TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models (2026)
- RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation (2026)
- RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies (2026)
- When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs (2026)
- AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
this paper caught my eye because of the benchmark angle, found a good summary here https://arxivexplained.com/paper/libero-para-a-diagnostic-benchmark-and-metrics-for-paraphrase-robustness-in-vla-models helped me get through it quicker
Get this paper in your agent:
hf papers read 2603.28301 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper