DeepVerse

4D Autoregressive Video Generation as a World Model

Junyi Chen^1,2, Haoyi Zhu^2,3, Xianglong He⁴, Yifan Wang^1,2*, Jianjun Zhou^2,5*,
Wenzheng Chang^2,3*, Yang Zhou^2,6*, Zizun Li^2,3*, Zhoujie Fu^2,7,
Jiangmiao Pang², Tong He^2✉️
¹Shanghai Jiao Tong University, ²Shanghai AI Lab, ³University of Science and Technology of China, ⁴Tsinghua University, ⁵Zhejiang University, ⁶Fudan University, ⁷Nanyang Technology University,
^*equal contribution, ^✉️Corresponding Author

Key Capabilities

General Control

Employ textual representations as a more general control modality

Dynamic & Physical

Formulates dynamics-aware and physics-constrained probabilistic distributions

4D Representation

Achieve temporal coherence and long-term memory by modeling the 4D representation distribution

General Control

DeepVerse diverges from previous methodologies by eschewing controller-derived control signals. Instead, we employs textual input as a universal control modality. This design choice presents two principal advantages: On the one hand, it enables maximal utilization of the conditional control priors inherent in the base video generation model. On the other hand, textual representation serves as a more generalized control mechanism that demonstrates extensible applicability across diverse controller architectures.

The character ran down the road, moving steadily forward.

physical collision

The character ran across the road, moving steadily forward.

The character ran down the road, passing by a car.

The character walked down the village path, passing by a dog and villagers.

The car is driving forward on an empty road at night.

The character walked along the dirt path towards the field.

The character rode a horse along the railway tracks through a grassy landscape.

The character walked along the tram tracks, passing by vintage cars and buildings.

The character walked along the tram tracks, moving forward through the street.

hair simulation

The character rode a horse through a dense forest, moving steadily forward between tall trees.

hair simulation

The character rode a horse along a narrow path through a forested area, moving steadily forward.

hair simulation

The character rode a horse along a narrow path through a forested area.

light simulation

The character ran along the dirt path with a flashlight, illuminating the way ahead.

physical collision

The car is driving through a futuristic city street.

The car is driving through a wet city street.

The car is driving on a road through a desert landscape.

The perspective moved forward along the sandy path, passing by wooden posts and grassy patches towards the beach.

physical collision

The car is driving down a sunny street lined with colorful buildings and festive banners.

physical collision

The car is driving down a sunny street lined with colorful buildings and festive banners.

NPCs

The character ran up the slope.

NPCs

The character ran up the slope.

Projected onto Action Control

The control signals from the controller can be mapped into textual representations, enabling DeepVerse to regulate content generation through controller manipulation. This framework demonstrates robust control consistency across diverse narrative perspectives, including third-person character depictions, multiple avatar integrations, and first-person experiential modes. Let us commence this implementation with the Wukong !

light simulation

4D Representation

DeepVerse enhances the model's scene comprehension by constructing a 4D representation of environments, while our findings reveal that 3D modality significantly contributes to preserving temporal consistency in future predictions. The comparative analysis presented below is derived from identical observational inputs and equivalent action sequences.

w/o depth

w/ depth

Generalization

Although DeepVerse is trained on synthetic data, it demonstrates generalization capabilities across real-world and AI-generated scenarios.