GVGEN: Text-to-3D Generation with Volumetric Representation

pipeline


1Shanghai AI Lab, 2Tsinghua Shenzhen International Graduate School, 3Shanghai Jiao Tong University, 4Zhejiang University, 5VAST, 6The Chinese University of Hong Kong
*Equal Contributions Corresponding Authors
GVGEN possesses a fast generation speed (~7 seconds), effectively striking a balance between quality and efficiency.

Abstract

In recent years, 3D Gaussian splatting has emerged as a powerful technique for 3D reconstruction and generation, known for its fast and high-quality rendering capabilities. Nevertheless, these methods often come with limitations, either lacking the ability to produce diverse samples or requiring prolonged inference times. To address these shortcomings, this paper introduces a novel diffusion-based framework, GVGEN, designed to efficiently generate 3D Gaussian representations from text input. We propose two innovative techniques: (1) Structured Volumetric Representation. We first arrange disorganized 3D Gaussian points as a structured form GaussianVolume. This transformation allows the capture of intricate texture details within a volume composed of a fixed number of Gaussians. To better optimize the representation of these details, we propose a unique pruning and densifying method named the Candidate Pool Strategy, enhancing detail fidelity through selective optimization. (2) Coarse-to-fine Generation Pipeline. To simplify the generation of GaussianVolume and empower the model to generate instances with detailed 3D geometry, we propose a coarse-to-fine pipeline. It initially constructs a basic geometric structure, followed by the prediction of complete Gaussian attributes. Our framework, GVGEN, demonstrates superior performance in qualitative and quantitative assessments compared to existing 3D generation methods. Simultaneously, it maintains a fast generation speed (~7 seconds), effectively striking a balance between quality and efficiency.

pipeline

Fig. 1: Overview of GVGEN. Our framework comprises two stages. In the data pre-processing phase, we fit GaussianVolumes and extract coarse geometry Gaussian Distance Field (GDF) as training data. For the generation stage, we first generate GDF via a diffusion model, and then send it into a 3D U-Net to predict attributes of GaussianVolumes.


Text-to-3D Results

Comparisons

Generation Diversity

Integration with Existing Methods

With recent text-to-3D optimization-based methods like GSGEN, the generated assets can be further refined and are more aligned with text descriptions in terms of texture and geometry than previous works. The left column represents rendering results initialized with different methods, and the right column stands for rendering results after optimization with GSGEN.

GaussianVolume Fitting Results

To balance between training cost and performance, we fit GaussianVolumes at a resolution of 32 as our training data.

Visual comparisons among different GaussianVolume resolution settings and original 3D Gaussian Splatting (3DGS).

BibTeX

@article{he2024gvgen,
  title={GVGEN: Text-to-3D Generation with Volumetric Representation},
  author={He, Xianglong and Chen, Junyi and Peng, Sida and Huang, Di and Li, Yangguang and Huang, Xiaoshui and Yuan, Chun and Ouyang, Wanli and He, Tong},
  journal={arXiv preprint arXiv:2403.12957},
  year={2024}
}