Accepted Paper

CVPR 2026

Main Conference

Also selected for

Oral Presentation

at the DataMFM Workshop

SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models

Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu, Tae-Hyun Oh

KAIST

TL;DR: SVHalluc is the first benchmark for speech-induced hallucinations in audio-visual LLMs, revealing failures in grounding speech content in visual evidence.

Paper Code Data Poster Slides

Motivation: Speech Content ≠ Visual Evidence

Existing audio-visual hallucination studies mainly test whether environmental sound events are visually present. SVHalluc studies a more challenging setting: speech can mention objects, actions, and temporal references that are not supported by the current video.

Motivation figure comparing environmental sound hallucination with speech-induced hallucination.

Unlike environmental sounds, speech carries rich semantic and temporal content. Models should not treat what is said as what is seen.

Prior benchmarks

Sound event hallucination

Prior work often asks whether a heard environmental sound, such as a cat meowing, corresponds to a visible event.

Sound Event ≠ Seeing

SVHalluc

Speech-induced hallucination

Speech can mention non-visible content. For example, a model may hear “egg” and hallucinate that an egg is visible.

Speech Content ≠ Seeing

New challenge

Ground speech in vision

SVHalluc evaluates whether AV-LLMs can verify spoken content against visual evidence, rather than simply relying on the transcript.

Abstract

Unlike environmental sounds that mainly indicate event occurrence, human speech carries rich semantics and temporal structures. Despite the advancement of audio-visual large language models in video understanding, it remains underexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs, where models generate inaccurate or misleading outputs by confusing what is said with what is visually supported.

We introduce SVHalluc, a comprehensive benchmark for evaluating speech–vision hallucination in audio-visual LLMs. SVHalluc diagnoses hallucinations from two complementary perspectives: semantic grounding, which asks what speech content is visible, and temporal grounding, which asks when narrated actions visually occur. Experimental results show that current audio-visual LLMs still struggle to ground speech in vision, revealing a fundamental limitation of cross-modal speech-video understanding.

SVHalluc Benchmark

SVHalluc contains six diagnostic tasks organized around two questions: what speech content is visually supported, and when narrated events occur visually.

Semantic Tasks

What speech content is visible?

GSA

Global Semantic AlignmentVideo-level semantic matching.

FGSA

Fine-Grained Semantic AlignmentObject-level semantic grounding.

CMSB

Cross-Modal Semantic BindingEvent-level semantic matching.

Temporal Tasks

When does the narrated action visually occur?

Temporal AlignmentSame-time speech-video matching.

Temporal ForecastingPast / present / future reasoning.

CMTB

Cross-Modal Temporal BindingDistinguishing speech from the current video.

Task design: semantic tasks evaluate whether speech content is visually supported, while temporal tasks evaluate whether narrated events are aligned with the current visual moment.

Task Examples

SVHalluc evaluates speech–vision hallucination through six diagnostic tasks, covering semantic grounding and temporal grounding.

Semantic tasks ask what speech content is visually supported; temporal tasks ask when narrated events visually occur.

Dataset Statistics

SVHalluc provides balanced and human-verified samples for diagnosing speech–vision hallucination.

2,405

video-question pairs

evaluation tasks

1,422

semantic samples

983

temporal samples

Semantic tasks test whether spoken content is visually supported, while temporal tasks test when narrated events visually occur.

Main Results

SVHalluc reveals substantial speech–vision hallucination in current audio-visual LLMs. Instead of presenting only textual takeaways, we visualize both the overall model performance and task-wise difficulty.

Audio-visual LLMs perform poorly on SVHalluc

Semantic and temporal average accuracy across evaluated models, with random baselines shown as dashed lines.

Overall Performance

Overall semantic and temporal accuracy of audio-visual LLMs on SVHalluc.

Open-source models Perform near random on multiple tasks.

Commercial models Perform better, but still hallucinate.

Best model Gemini 2.5 Pro achieves the strongest average performance.

Which tasks are most challenging?

Accuracy gain over random choice highlights where models struggle most across the six diagnostic tasks.

Task-wise Analysis

Task-wise accuracy gain over random choice on SVHalluc.

Global alignment is hard GSA is more challenging than fine-grained object grounding for most models.

Why Do Models Fail?

The failure is not simply caused by poor speech recognition. Models can perceive speech, but still struggle to verify whether the spoken content is grounded in the video.

Speech recognition is reliable Most speech content is recognized, with WER = 0.09.

Speech timing is localized The speech timestamp can be localized with mIoU = 0.88.

Transcript is not enough Adding transcripts helps some tasks, but does not solve speech–vision alignment.

Strong speech perception does not guarantee reliable speech–vision integration.

Resources

Paper, code, dataset, poster, and slides will be released here.

Paper
PDF / arXiv

Code
Evaluation scripts

Data
Benchmark samples

Poster
CVPR 2026 project poster

Slides
Oral slides for DataMFM Workshop

Citation

@InProceedings{Zhang_2026_SVHalluc,
  author    = {Chenshuang Zhang and Kyeong Seon Kim and Chengxin Liu and Tae-Hyun Oh},
  title     = {SVHalluc: Benchmarking Speech--Vision Hallucination in Audio-Visual Large Language Models},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026}
}