VISD: Enhancing Video Reasoning via Structured Self-Distillation

Hao Lin1,*, Kunyang Lv2,*, Xu Jiang3, Jingqi Tian4, Zhongjing Du3, Jiayu Ding3,
Qiaoman Zhang3, Hongbo Jin3,†

1HUST · 2Wuhan University · 3Peking University · 4Tsinghua University

TL;DR

VISD is a structured self-distillation framework that enhances video reasoning by providing diagnostically grounded token-level supervision during training. It leverages a video-aware judge to decompose errors and uses this structured feedback to guide a teacher policy for fine-grained credit assignment. Combining a set of stable self-distillation training paradigms tailored for video reasoning, VISD achieves strong performance and high training efficiency.

Abstract

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence-level rewards and the lack of fine-grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with ver ifiable rewards (RLVR) provides reliable supervision, it fails to capture token-level contributions, leading to inefficient learning. Conversely, existing self-distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self-distillation framework that introduces diagnostically mean ingful privileged information for video reasoning. VISD employs a video-aware judge model to decompose reasoning quality into multiple dimensions, includ ing answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token-level supervision. To stably integrate dense supervision with RL, we adopt a direction–magnitude decoupling mechanism, where rollout-level advantages computed from rewards de termine update direction, while structured privileged signals modulate token-level update magnitudes. This design enables semantically aligned and fine-grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA-based teacher stabilization to support robust optimization over long video sequences. Experi ments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio-temporal grounding quality. No tably, VISD reaches these gains with nearly 2× faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs

Motivation

To alleviate the sparse-reward limitation of RLVR, self-distillation offers a natural way to provide dense token-level supervision. However, existing methods usually treat additional supervision as unstructured or modality-agnostic signals. When directly applied to VideoLLMs, this leads to two key challenges:

VISD introduces a video-aware judge that produces structured, diagnostic feedback to guide token-level credit assignment. It further incorporates direction-magnitude decoupling, curriculum scheduling, top-K local support, and EMA stabilization, forming a stable self-distillation framework that better captures spatio-temporal evidence in video reasoning.

Motivation figure placeholder

Method Overview

To address the lack of diagnostic specificity in token-level supervision, we introduce a video-aware judge that generates diagnostically grounded feedback based on the student's response and privileged information. This feedback explicitly identifies spatio-temporal and logical errors in the reasoning process, enabling the teacher to provide more accurate token-level supervision and better capture key reasoning cues during self-distillation.

To address the instability between self-distillation and reward-driven optimization, we adopt a direction-magnitude decoupled optimization strategy. The rollout-level reward determines the update direction, while the discrepancy between teacher and student token distributions controls the update magnitude. To further stabilize training, we compute token-level differences on a top-K local support, apply curriculum scheduling that gradually anneals to standard GRPO, and maintain the teacher using an EMA update. Together, these designs ensure stable and efficient self-distillation for video reasoning.

Method overview placeholder
Method overview placeholder

Experimental Results

Main results placeholder
Table1. V-STAR results. Chain1 denotes what-when-where, while Chain2 denotes what-where-when. * indicates rows evaluated by us. VISD improves answer accuracy over Qwen2.5-VL-7B by +28.4 points and obtains the best overall V-STAR scores, improving mAM/mLGM over VisionCoach from 34.3/47.5 to 35.1/48.9.
Main results placeholder
Table2. General benchmark results. "LRR" refers to LongVideo-Reaason-eval. The Average score is computed over the bold-faced dataset-level metrics. * indicates rows evaluated by us; for rows marked with , the WorldSense scores are evaluated by us.
Main results placeholder
Table3. Video-MME-v2 results. We report the official non-linear score, level wise score, consistency, coherence, and average accuracy using 64 sampled frames without subtitles. Beyond V-STAR, VISD also improves the average score across broader video reasoning benchmarks.

Analysis

Ablation placeholder
Trajectory-dependent judge feedback. For the same video question, the judge provides different evaluations and diagnoses for different student rollouts. The feedback is used as privileged information for teacher replay rather than as a replacement for the reinforcement reward.
Case study placeholder
Visualization. For spatial relation reasoning, VISD accurately localizes the queried child and identifies the object positioned above him, providing precise visual evidence while avoiding confusion with nearby objects. In contrast, related video reasoning models either give incorrect answers or rely on incomplete spatial grounding.
Case study placeholder
Visualization. For temporal action reasoning, VISD grounds the panda and bucket across relevant frames and correctly infers that the panda is putting itself in the bucket. Competing models miss this action transition and produce incorrect answers.
Case study placeholder
Answer-only vs. feedback-conditioned token credit on the same rollout. Feedback selectively adjusts token-level evidence at reasoning and grounding positions, rather than uniformly shifting all tokens, illustrating how VISD redistributes update magnitudes while preserving the reward-driven update direction.

Citation

@article{lin2026visd,
  title={VISD: Enhancing Video Reasoning via Structured Self-Distillation},
  author={Lin, Hao and Lv, Kunyang and Jiang, Xu and Tian, Jingqi and Du, Zhongjing and Ding, Jiayu and Zhang, Qiaoman and Jin, Hongbo},
  journal={arXiv preprint arXiv:2605.06094},
  year={2026}
}