Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
Abstract
The SANTA framework addresses hallucinations in multimodal LLMs by using self-augmented contrastive alignment to enhance object and action faithfulness in video caption generation.
Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings (2025)
- Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding (2025)
- Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention (2025)
- SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense (2025)
- Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats (2025)
- V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention (2025)
- Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper