Accepted to ICLR 2025
The spatial reasoning ability of humans allows individuals to effortlessly conceive novel views and corresponding viewpoint locations simultaneously within a given scene. Consequently, based on this finding, exploring the intrinsic connections between these two modalities is a vital step toward advancing spatial intelligence.
 
                 
            We introduce Generative Spatial Transformer(GST), the first model capable of concurrently performing both novel view synthesis and relative camera pose estimation within a unified framework. Drawing inspiration from human spatial reasoning, we design GST to model the joint distribution of images and camera poses, enabling it to effectively integrate the training objectives of both tasks.
Previous methods have traditionally constructed unimodal target distributions for novel view synthesis and camera estimation tasks. However, GST has introduced a joint distribution for both the image and the corresponding camera poses. This enables us to initiate from a single image while simultaneously sampling novel view images along with their corresponding perspectives.
 
        Diverging from prior research, our focus lies in uncovering the inherent consistency between these two tasks rather than alternately training the two objectives during the training process. Our approach starts by tokenizing the image and camera spatial positions, merging two codebooks to ensure the model treats both modalities equally. We then proceed to train a generative network to model the joint distribution of these components. Our pipeline is shown in the figure below.
 
    For a given observational image, the GST initially sample multiple appropriate camera poses automatically ($p(c | o)$), which are then employed as conditions to generate corresponding novel view images ($p(i | o, c)$).
We selected several highly challenging examples to test the spatial localization capabilities of GST. The selected image pairs include real-world images, images of the same subject taken under different shooting conditions, and images of the same object depicted under various artistic styles. GST demonstrated outstanding performance across all these examples.
We capture a real-world object from various angles and positions, allowing GST to sample valid camera distributions from $p(c|o)$ for each scenario. For images captured from a top-down perspective (a), GST predominantly sampled cameras with a top-down viewpoint. Similarly, for objects viewed from a frontal angle (b), GST preferentially sampled cameras with a frontal perspective. Notably, in scenarios involving obstacles (c), GST effectively avoided these obstructions and sampled reasonable camera positions. These results, achieved without any manual intervention, further demonstrate GST's ability to accurately comprehend the spatial layout from observed images.
 
    @misc{chen2024iiseeautoregressive,
    title={Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction}, 
    author={Junyi Chen and Di Huang and Weicai Ye and Wanli Ouyang and Tong He},
    year={2024},
    eprint={2410.18962},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.18962}, 
}