Skip to main content
Cornell University
Learn about arXiv becoming an independent nonprofit.
We gratefully acknowledge support from
the Simons Foundation, Schmidt Sciences, Stockholm University, and all contributors.
Donate
arxiv logo > cs > arXiv:2601.17067

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

(cs)
[Submitted on 22 Jan 2026]

Title:A Mechanistic View on Video Generation as World Models: State and Dynamics

Authors:Luozhou Wang, Zhifei Chen, Yihua Du, Dongyu Yan, Wenhang Ge, Guibao Shen, Xinli Xu, Leyi Wu, Man Chen, Tianshuo Xu, Peiran Ren, Xin Tao, Pengfei Wan, Ying-Cong Chen
View a PDF of the paper titled A Mechanistic View on Video Generation as World Models: State and Dynamics, by Luozhou Wang and 13 other authors
View PDF HTML (experimental)
Abstract:Large-scale video generation models have demonstrated emergent physical coherence, positioning them as potential world models. However, a gap remains between contemporary "stateless" video architectures and classic state-centric world model theories. This work bridges this gap by proposing a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling. We categorize state construction into implicit paradigms (context management) and explicit paradigms (latent compression), while dynamics modeling is analyzed through knowledge integration and architectural reformulation. Furthermore, we advocate for a transition in evaluation from visual fidelity to functional benchmarks, testing physical persistence and causal reasoning. We conclude by identifying two critical frontiers: enhancing persistence via data-driven memory and compressed fidelity, and advancing causality through latent factor decoupling and reasoning-prior integration. By addressing these challenges, the field can evolve from generating visually plausible videos to building robust, general-purpose world simulators.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as: arXiv:2601.17067 [cs.CV]
  (or arXiv:2601.17067v1 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2601.17067
arXiv-issued DOI via DataCite

Submission history

From: Luozhou Wang [view email]
[v1] Thu, 22 Jan 2026 19:00:18 UTC (6,932 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled A Mechanistic View on Video Generation as World Models: State and Dynamics, by Luozhou Wang and 13 other authors
  • View PDF
  • HTML (experimental)
  • TeX Source
license icon view license
Current browse context:
cs.CV
< prev   |   next >
new | recent | 2026-01
Change to browse by:
cs
cs.AI

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
export BibTeX citation Loading...

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status