OmniVideo-100K

A Dataset for Audio-Visual Reasoning through
Structured Scripts and Evidence Chains

Xinyue Cai1 Chaoyou Fu1* Yi-Fan Zhang2 Ran He2 Caifeng Shan1

1Nanjing University    2CASIA

* Corresponding author

Motivation

Why video-caption-QA is not enough?

Existing pipelines often split videos into short clips, caption audio and vision separately, and generate QA directly from long descriptions.

While highly scalable, this paradigm weakens the deep temporal connections and multimodal interactions that complex audio-visual comprehension demands, leading to three major problems:

Severed Audio & Vision

Sound-source associations break.

Incoherent Narratives

Entity references drift across clips.

Localized QA

QA is often limited to local events.

Step 01

Script with Main Entity List

Use the main entity list as a global prior to integrate fragmented audio-visual information into a coherent, structured script.

Step 02

QA from Evidence Clues

Instead of directly generating QA from the long description, the model is first asked to mine the audio-visual cues.

OmniVideo-100K

Method

Entity-Anchored Scripts & Clue-Guided QA

The pipeline formalizes the raw video into a structured text representation, enabling LLMs to mine evidence chains before QA synthesis.

Overview of the OmniVideo-100K data generation pipeline

Entity-Anchored Video Scripting

Script: a summary, a main entity list, and structured segment-wise descriptions that integrate speech, sounds, and visual information.

Use consistent identifiers (orange and green text) from the main entity list to ensure coherence across clips and associate speech with visual entities.

Clue-Guided QA Generation

Mine cross-segment and cross-modal clues from the script.

Produce QA pairs based on the clues and relevant segments.

Evidence Chain Example
Question

Why is Brunette Girl happy with her blind box?

Visual clue

Earlier [03:34 - 03:44], another person is shown holding a Pop Mart doll that carries a red soda can. Later [07:17 - 07:31], the scene shows the Brunette Girl holding her own newly unboxed doll, which carries a brown soda bottle labeled Coca-Cola.

Audio clue

At [03:34 - 03:39], upon seeing another doll being shown, Brunette Girl expresses her expectation: "I hope I get one with a coke bottle instead of a can though." Later, at [07:17 - 07:25], she opens her own blind box and happily exclaims: "I love it."

Answer

It is holding a bottle instead of a can.

Dataset

OmniVideo-100K & OmniVideo-Test

OmniVideo-100K provides automatically generated instruction-tuning samples; OmniVideo-Test features manually verified test samples.

10Task types
100KTraining QA
5,214Training videos
7:3OE / MCQ ratio
505Verified test QA
264Test videos
38.14%Acceptance rate
Video Curation
  • 1Collect online videos and iteratively expand the search keyword pool from seven core labels.
  • 2Exclude videos with resolutions below 480p and retain only English videos.
  • 3Apply visual dynamics and word-density filters to keep videos rich in audio-visual information.
  • 4Remove videos with hard-coded subtitles to reduce shortcut learning from on-screen text.
OmniVideo-Test Quality control
  • 1Remove QA pairs that contain factual errors or ambiguous multiple-choice options.
  • 2Discard QA pairs that can be answered or easily guessed using only a single modality.
  • 3Confirm answer uniqueness to guarantee that only one option is unambiguously correct.
  • 4Conclude the manual screening with a final acceptance rate of approximately 38.14%.
Task taxonomy

Alignment

Focuse on perceiving and synchronizing audio and vision along the timeline.

Fine-Grained Perception Perceive synchronized cross-modal details from given unimodal cues.
Scene Transformation Detection Identify synchronized visual shifts using audio cues.

Understanding

Emphasize semantic analysis, linking cross-modal information to grasp narratives.

Context Understanding Identify semantic associations between audio and visual elements.
Comparison Compare target states or attributes across different timestamps.
Sentiment Analysis Analyze character inner states via audio-visual cues.
Event Sequence Ordering Restore the correct temporal order of jumbled events.
Summarization Generate an audio-visual summary for specific events.

Reasoning

Advanced cognitive abilities requiring logical deduction and abstract thinking.

Causal Reasoning Infer underlying causes of specific events.
Future Prediction Predict upcoming plot developments from current content.
Hypothetical Reasoning Propose what-if scenarios to reinterpret video content.
OmniVideo-100K statistics

Video category distribution across diverse real-world domains.

Video duration distribution

Video length statistics, with the majority spanning 1-3 minutes.

Word count statistics

Word count statistics (mean±std), highlighting detailed OE answers and balanced MCQ options.

Comparison with existing datasets
Datasets# SamplesDomainAvg. LengthAnnotationComplex TemporalEvidence ChainStructured Narratives
AVSD8KOpen30sManual
Pano-AVQA42.8KPanoramic5sManual
Music-AVQA32KMusic60sManual
AVQA40KOpen10sManual
JavisInst-Und110KOpen10sAutomatic*
EgoAVU-Instruct3MEgo1-6minAutomatic
OmniVideo-100K100KOpen103sAutomatic

* The automated pipeline leverages annotations from other datasets rather than starting from raw video.

Results

Experimental Findings

Fine-tuning on OmniVideo-100K improves audio-visual comprehension on OmniVideo-Test and transfers to established benchmarks.

On OmniVideo-Test, fine-tuning on OmniVideo-100K brings overall gains of 20.59%, 17.82%, and 13.86% for VITA-1.5, Qwen2.5-Omni-7B, and Qwen3-Omni-30B-A3B-Instruct respectively.

ModelSizeOverallAlignmentUnderstandingReasoning(0, 2]min(2, 5]min
Human-100.00-----
Gemini-3.1-Pro-83.9683.6284.5083.2182.6184.59
VITA-1.58B40.9928.4543.0248.0944.7239.24
Qwen2.5-Omni7B42.7739.6646.5138.1738.5144.77
video-SALMONN 2+7B45.1537.9349.2243.5146.5844.48
OmniVinci9B47.1343.9751.5541.2247.2047.09
MiniCPM-o 4.59B55.2556.9058.1448.0953.4256.10
Qwen3-Omni30B49.7043.1055.0445.0449.0750.00
Uni-MoE-2.0-Omni30B46.9339.6652.7141.9846.5847.09
Ours (VITA-1.5)8B61.58+20.5959.48+31.0363.18+20.1660.31+12.2259.01+14.2962.79+23.55
Ours (Qwen2.5-Omni)7B60.59+17.8262.93+23.2762.40+15.8954.96+16.7954.66+16.1563.37+18.60
Ours (Qwen3-Omni)30B63.56+13.8660.34+17.2467.05+12.0159.54+14.5062.11+13.0464.24+14.24

Models fine-tuned on OmniVideo-100K exhibit strong generalization, achieving consistent improvements across audio-visual benchmarks (Daily-Omni, OmniVideoBench, JointAVBench, and FutureOmni) while maintaining comparable performance on general video benchmarks (Video-MME ans Video-MME-v2).

ModelSizeVideo-MMEshortVideo-MME-v2Daily-OmniOmniVideoBenchJointAVBenchFutureOmni
# Samples-90032811975092153960
VITA-1.58B70.635.9552.6336.3544.7748.65
Qwen2.5-Omni7B75.5610.2862.4136.5454.4448.85
video-SALMONN 2+7B74.2211.5062.5735.3656.3951.88
OmniVinci9B77.5611.5961.3236.7457.5552.81
MiniCPM-o 4.59B83.2214.4880.2041.0655.1852.29
Qwen3-Omni30B82.0014.3174.2743.8463.1753.44
Uni-MoE-2.0-Omni30B78.898.8264.3338.5557.5552.81
Ours (VITA-1.5)8B67.67-2.967.42+1.4755.39+2.7636.94+0.5957.41+12.6456.35+7.70
Ours (Qwen2.5-Omni)7B76.33+0.778.50-1.7869.84+7.4339.88+3.3460.75+6.3155.00+6.15
Ours (Qwen3-Omni)30B83.56+1.5615.33+1.0276.61+2.3444.81+0.9766.37+3.2057.60+4.16

Single-modality ablations reduce performance, confirming that OmniVideo-Test requires audio-visual synergy rather than isolated audio or visual perception.

ModelSizeOverallAlignmentUnderstandingReasoning(0, 2]min(2, 5]min
Audio-Visual
MiniCPM-o 4.59B55.2556.9058.1448.0953.4256.10
Qwen3-Omni30B49.7043.1055.0445.0449.0750.00
Audio-Only
MiniCPM-o 4.59B45.74-9.5136.21-20.6951.94-6.2041.98-6.1145.34-8.0845.93-10.17
Qwen3-Omni30B46.14-3.5637.07-6.0352.71-2.3341.22-3.8248.45-0.6245.06-4.94
Visual-Only
MiniCPM-o 4.59B47.92-7.3342.24-14.6650.00-8.1448.85+0.7651.55-1.8746.22-9.88
Qwen3-Omni30B47.13-2.5742.24-0.8648.84-6.2048.09+3.0545.96-3.1147.67-2.33

Under matched Qwen2.5-Omni fine-tuning settings, OmniVideo-100K improves most benchmarks while AVQA and JavisInst-Und often reduce performance.

DataVideo-MMEshortVideo-MME-v2OmniVideo-TestDaily-OmniOmniVideoBenchJointAVBenchFutureOmni
Qwen2.5-Omni75.5610.2842.7762.4136.5454.4448.85
w. AVQA68.119.696.2855.1434.3850.1643.65
w. JavisInst-Und59.443.2238.2248.9632.4244.3658.54
w. OmniVideo-100K76.338.5060.5969.8439.8860.7555.00

Performance improves sharply with only 10K samples and peaks in average score at 75K, suggesting useful scaling with mild saturation at the full 100K setting.

DataAvg.Video-MMEshortVideo-MME-v2Daily-OmniOmniVideoBenchJointAVBenchFutureOmniOmniVideo-Test
w/o SFT47.2675.5610.2862.4136.5454.4448.8542.77
OmniVideo-10K52.6477.2213.8469.9239.1060.5752.6055.25
OmniVideo-25K53.9876.5613.7470.6842.6360.1554.2759.80
OmniVideo-50K54.1476.3311.3869.8442.8359.4557.6061.58
OmniVideo-75K54.3276.8910.4772.2641.2661.0355.7362.57
OmniVideo-100K52.9876.338.5069.8439.8860.7555.0060.59

Case studies

Qualitative Analysis on OmniVideo-Test

Cases demonstrating improvements in fine-grained temporal alignment and joint cross-modality understanding.

Qualitative comparison for case 1
Case 1: grounding a visual gesture in the simultaneous spoken discussion.

Visual evidence

The bald pundit performs a visual gesture, reaching up to remove the glasses resting on his forehead.

Audio evidence

Simultaneously, the speech mentions that the club may receive a decision in the coming weeks because they are complaining.

Baseline failure

Qwen2.5-Omni relies solely on the visual action described in the question, offering a speculative guess that he wanted to make a point or see better, without grounding the answer in the simultaneous audio.

Qualitative comparison for case 2
Case 2: linking a speaker gesture to the specific narrated helmet detail.

Visual evidence

The speaker raises his right hand to gesture near his head, mimicking the location of a dent on a helmet.

Audio evidence

Simultaneously, the speech recounts the specific detail of a man returning with a deep dent in his military helmet.

Baseline failure

Qwen2.5-Omni fails in fine-grained temporal alignment, capturing only a nearby audio phrase ("father was a soldier") to offer a speculative emotional interpretation instead of linking the gesture to the exact concurrent words.

Qualitative comparison for case 3
Case 3: interpreting advice through both interview dynamics and visual emphasis.

Visual evidence

The man in blue nods and gestures emphatically.

Audio evidence

The player in black admits that he was terrible and then the man in blue says players must not let their guard down because predators are watching.

Baseline failure

Qwen2.5-Omni fails to associate the preceding segment, missing the player's admission ("been terrible") and isolating the pundit's subsequent advice about predators.

Citation

Cite OmniVideo-100K

Use the following BibTeX for the OmniVideo-100K project and dataset.

@article{cai2026omnivideo100k,
  title={OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains},
  author={Cai, Xinyue and Fu, Chaoyou and Zhang, Yi-Fan and He, Ran and Shan, Caifeng},
  journal={arXiv preprint arXiv:2606.14702}
  year={2026},
}