Accepted Paper
CVPR 2026
Main Conference
Also selected for
Oral Presentation

SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models

KAIST
TL;DR: SVHalluc is the first benchmark for speech-induced hallucinations in audio-visual LLMs, revealing failures in grounding speech content in visual evidence.

Motivation: Speech Content ≠ Visual Evidence

Existing audio-visual hallucination studies mainly test whether environmental sound events are visually present. SVHalluc studies a more challenging setting: speech can mention objects, actions, and temporal references that are not supported by the current video.

Motivation figure comparing environmental sound hallucination with speech-induced hallucination.

Unlike environmental sounds, speech carries rich semantic and temporal content. Models should not treat what is said as what is seen.

Prior benchmarks

Sound event hallucination

Prior work often asks whether a heard environmental sound, such as a cat meowing, corresponds to a visible event.

Sound Event ≠ Seeing
SVHalluc

Speech-induced hallucination

Speech can mention non-visible content. For example, a model may hear “egg” and hallucinate that an egg is visible.

Speech Content ≠ Seeing
New challenge

Ground speech in vision

SVHalluc evaluates whether AV-LLMs can verify spoken content against visual evidence, rather than simply relying on the transcript.

Abstract

Unlike environmental sounds that mainly indicate event occurrence, human speech carries rich semantics and temporal structures. Despite the advancement of audio-visual large language models in video understanding, it remains underexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs, where models generate inaccurate or misleading outputs by confusing what is said with what is visually supported.

We introduce SVHalluc, a comprehensive benchmark for evaluating speech–vision hallucination in audio-visual LLMs. SVHalluc diagnoses hallucinations from two complementary perspectives: semantic grounding, which asks what speech content is visible, and temporal grounding, which asks when narrated actions visually occur. Experimental results show that current audio-visual LLMs still struggle to ground speech in vision, revealing a fundamental limitation of cross-modal speech-video understanding.

SVHalluc Benchmark

SVHalluc contains six diagnostic tasks organized around two questions: what speech content is visually supported, and when narrated events occur visually.

Semantic Tasks

What speech content is visible?

GSA
Global Semantic AlignmentVideo-level semantic matching.
FGSA
Fine-Grained Semantic AlignmentObject-level semantic grounding.
CMSB
Cross-Modal Semantic BindingEvent-level semantic matching.

Temporal Tasks

When does the narrated action visually occur?

TA
Temporal AlignmentSame-time speech-video matching.
TF
Temporal ForecastingPast / present / future reasoning.
CMTB
Cross-Modal Temporal BindingDistinguishing speech from the current video.
Task design: semantic tasks evaluate whether speech content is visually supported, while temporal tasks evaluate whether narrated events are aligned with the current visual moment.

Task Examples

SVHalluc evaluates speech–vision hallucination through six diagnostic tasks, covering semantic grounding and temporal grounding.

SVHalluc task examples covering semantic and temporal hallucination tasks.

Semantic tasks ask what speech content is visually supported; temporal tasks ask when narrated events visually occur.

Dataset Statistics

SVHalluc provides balanced and human-verified samples for diagnosing speech–vision hallucination.

2,405
video-question pairs
6
evaluation tasks
1,422
semantic samples
983
temporal samples
Semantic tasks test whether spoken content is visually supported, while temporal tasks test when narrated events visually occur.

Main Results

SVHalluc reveals substantial speech–vision hallucination in current audio-visual LLMs. Instead of presenting only textual takeaways, we visualize both the overall model performance and task-wise difficulty.

Audio-visual LLMs perform poorly on SVHalluc

Semantic and temporal average accuracy across evaluated models, with random baselines shown as dashed lines.

Overall Performance
Overall semantic and temporal accuracy of audio-visual LLMs on SVHalluc.
Open-source models Perform near random on multiple tasks.
Commercial models Perform better, but still hallucinate.
Best model Gemini 2.5 Pro achieves the strongest average performance.

Which tasks are most challenging?

Accuracy gain over random choice highlights where models struggle most across the six diagnostic tasks.

Task-wise Analysis
Task-wise accuracy gain over random choice on SVHalluc.
Global alignment is hard GSA is more challenging than fine-grained object grounding for most models.

Why Do Models Fail?

The failure is not simply caused by poor speech recognition. Models can perceive speech, but still struggle to verify whether the spoken content is grounded in the video.

Speech recognition is reliable Most speech content is recognized, with WER = 0.09.
Speech timing is localized The speech timestamp can be localized with mIoU = 0.88.
Transcript is not enough Adding transcripts helps some tasks, but does not solve speech–vision alignment.
Strong speech perception does not guarantee reliable speech–vision integration.

Resources

Paper, code, dataset, poster, and slides will be released here.

Citation

@InProceedings{Zhang_2026_SVHalluc,
  author    = {Chenshuang Zhang and Kyeong Seon Kim and Chengxin Liu and Tae-Hyun Oh},
  title     = {SVHalluc: Benchmarking Speech--Vision Hallucination in Audio-Visual Large Language Models},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026}
}