VISD: Enhancing Video Reasoning via Structured Self-Distillation

Hao Lin^1,*, Kunyang Lv^2,*, Xu Jiang³, Jingqi Tian⁴, Zhongjing Du³, Jiayu Ding³,
Qiaoman Zhang³, Hongbo Jin^3,†

¹HUST · ²Wuhan University · ³Peking University · ⁴Tsinghua University

TL;DR

VISD is a structured self-distillation framework that enhances video reasoning by providing diagnostically grounded token-level supervision during training. It leverages a video-aware judge to decompose errors and uses this structured feedback to guide a teacher policy for fine-grained credit assignment. Combining a set of stable self-distillation training paradigms tailored for video reasoning, VISD achieves strong performance and high training efficiency.

Abstract

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence-level rewards and the lack of fine-grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with ver ifiable rewards (RLVR) provides reliable supervision, it fails to capture token-level contributions, leading to inefficient learning. Conversely, existing self-distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self-distillation framework that introduces diagnostically mean ingful privileged information for video reasoning. VISD employs a video-aware judge model to decompose reasoning quality into multiple dimensions, includ ing answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token-level supervision. To stably integrate dense supervision with RL, we adopt a direction–magnitude decoupling mechanism, where rollout-level advantages computed from rewards de termine update direction, while structured privileged signals modulate token-level update magnitudes. This design enables semantically aligned and fine-grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA-based teacher stabilization to support robust optimization over long video sequences. Experi ments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio-temporal grounding quality. No tably, VISD reaches these gains with nearly 2× faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs

Motivation

To alleviate the sparse-reward limitation of RLVR, self-distillation offers a natural way to provide dense token-level supervision. However, existing methods usually treat additional supervision as unstructured or modality-agnostic signals. When directly applied to VideoLLMs, this leads to two key challenges:

the supervision lacks diagnostic specificity, making it hard to distinguish logical inconsistency from spatio-temporal grounding failures
the auxiliary teacher signal may interact unstably with reward-driven optimization.

VISD introduces a video-aware judge that produces structured, diagnostic feedback to guide token-level credit assignment. It further incorporates direction-magnitude decoupling, curriculum scheduling, top-K local support, and EMA stabilization, forming a stable self-distillation framework that better captures spatio-temporal evidence in video reasoning.

Method Overview

To address the lack of diagnostic specificity in token-level supervision, we introduce a video-aware judge that generates diagnostically grounded feedback based on the student's response and privileged information. This feedback explicitly identifies spatio-temporal and logical errors in the reasoning process, enabling the teacher to provide more accurate token-level supervision and better capture key reasoning cues during self-distillation.

To address the instability between self-distillation and reward-driven optimization, we adopt a direction-magnitude decoupled optimization strategy. The rollout-level reward determines the update direction, while the discrepancy between teacher and student token distributions controls the update magnitude. To further stabilize training, we compute token-level differences on a top-K local support, apply curriculum scheduling that gradually anneals to standard GRPO, and maintain the teacher using an EMA update. Together, these designs ensure stable and efficient self-distillation for video reasoning.

Experimental Results

Main results placeholder — **Table1.** V-STAR results. Chain1 denotes what-when-where, while Chain2 denotes what-where-when. ^* indicates rows evaluated by us. VISD improves answer accuracy over Qwen2.5-VL-7B by +28.4 points and obtains the best overall V-STAR scores, improving mAM/mLGM over VisionCoach from 34.3/47.5 to 35.1/48.9.

Analysis

Ablation placeholder — **Trajectory-dependent judge feedback.** For the same video question, the judge provides different evaluations and diagnoses for different student rollouts. The feedback is used as privileged information for teacher replay rather than as a replacement for the reinforcement reward.

Case study placeholder — **Visualization.** For spatial relation reasoning, VISD accurately localizes the queried child and identifies the object positioned above him, providing precise visual evidence while avoiding confusion with nearby objects. In contrast, related video reasoning models either give incorrect answers or rely on incomplete spatial grounding.

Citation

@article{lin2026visd,
  title={VISD: Enhancing Video Reasoning via Structured Self-Distillation},
  author={Lin, Hao and Lv, Kunyang and Jiang, Xu and Tian, Jingqi and Du, Zhongjing and Ding, Jiayu and Zhang, Qiaoman and Jin, Hongbo},
  journal={arXiv preprint arXiv:2605.06094},
  year={2026}
}