4D Autoregressive Video Generation as a World Model
Employ textual representations as a more general control modality
Formulates dynamics-aware and physics-constrained probabilistic distributions
Achieve temporal coherence and long-term memory by modeling the 4D representation distribution
DeepVerse diverges from previous methodologies by eschewing controller-derived control signals. Instead, we employs textual input as a universal control modality. This design choice presents two principal advantages: On the one hand, it enables maximal utilization of the conditional control priors inherent in the base video generation model. On the other hand, textual representation serves as a more generalized control mechanism that demonstrates extensible applicability across diverse controller architectures.
The character ran down the road, moving steadily forward.
The character ran across the road, moving steadily forward.
The character ran down the road, passing by a car.
The character walked down the village path, passing by a dog and villagers.
The car is driving forward on an empty road at night.
The character walked along the dirt path towards the field.
The character rode a horse along the railway tracks through a grassy landscape.
The character walked along the tram tracks, passing by vintage cars and buildings.
The character walked along the tram tracks, moving forward through the street.
The character rode a horse through a dense forest, moving steadily forward between tall trees.
The character rode a horse along a narrow path through a forested area, moving steadily forward.
The character rode a horse along a narrow path through a forested area.
The character ran along the dirt path with a flashlight, illuminating the way ahead.
The car is driving through a futuristic city street.
The car is driving through a wet city street.
The car is driving on a road through a desert landscape.
The perspective moved forward along the sandy path, passing by wooden posts and grassy patches towards the beach.
The car is driving down a sunny street lined with colorful buildings and festive banners.
The car is driving down a sunny street lined with colorful buildings and festive banners.
The character ran up the slope.
The character ran up the slope.
The control signals from the controller can be mapped into textual representations, enabling DeepVerse to regulate content generation through controller manipulation. This framework demonstrates robust control consistency across diverse narrative perspectives, including third-person character depictions, multiple avatar integrations, and first-person experiential modes. Let us commence this implementation with the Wukong !
DeepVerse enhances the model's scene comprehension by constructing a 4D representation of environments, while our findings reveal that 3D modality significantly contributes to preserving temporal consistency in future predictions. The comparative analysis presented below is derived from identical observational inputs and equivalent action sequences.
w/o depth
w/o depth
w/o depth
w/ depth
w/ depth
w/ depth
Although DeepVerse is trained on synthetic data, it demonstrates generalization capabilities across real-world and AI-generated scenarios.