Unlike environmental sounds that mainly indicate event occurrence (e.g., dog barking), human speech carries rich semantics and temporal structures. Despite the advancement of audio-visual large-language models (LLMs) in video understanding, it remains unexplored whether current models can accurately align speech contents with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs, where models generate inaccurate or misleading outputs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech–vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech–vision hallucinations from two complementary perspectives: semantic and temporal. Experimental results demonstrate that most advanced audio-visual LLMs struggle with aligning speech content with corresponding visual signals. Our work uncovers a fundamental limitation of current audio-visual LLMs and highlights the need for speech-aware and grounded speech-video perception and comprehension.
We introduce SVHalluc, the first comprehensive benchmark specifically designed to evaluate speech–vision hallucination in audio-visual LLMs. SVHalluc evaluates whether audio-visual LLMs can accurately integrate speech and visual information along two complementary dimensions: semantic and temporal. In the semantic hallucination tasks, we aim to evaluate whether the models can correctly find the correspondence between speech content with the visual scene, avoiding hallucinating non-existing entities (e.g., objects or events). In the temporal hallucination tasks, we aim to evaluate whether the model could identify when the narrated events occur visually relatively to the moment when the person is speaking. We design three complementary coarse-to-fine tasks for each hallucination type, resulting in six tasks in total. SVHalluc systematically exposes failure modes in both semantic and temporal understanding, providing a rigorous and comprehensive evaluation of speech–vision hallucination in current audio–visual LLMs.
@InProceedings{Zhang_2026_CVPR,
author = {Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu and Tae-Hyun Oh},
title = {SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026},
}