Lifting 2D observations in pre-trained diffusion models to a 3D world for text-to-3D
is inherently ambiguous. 2D diffusion models solely learn view-agnostic
priors and thus lack 3D knowledge during the lifting, leading to the multi-view
inconsistency problem. Our key finding reveals that this problem primarily stems
from geometric inconsistency, and addressing ambiguously placed geometries
substantially mitigates the issue in the final outcomes. Therefore, we focus on
improving the geometric consistency via enforcing the 2D geometric priors in diffusion
models act in a way that aligns with well-defined 3D geometries during the
lifting, addressing the vast majority of the problem. This is realized by fine-tuning
the 2D diffusion to be viewpoint-aware and to produce view-specific geometric
maps of canonically oriented objects as in 3D datasets. Particularly, only coarse
3D geometries are used for aligning. This “coarse” alignment not only enables
the generation of geometries without multi-view inconsistency but also retains the
ability in 2D diffusion models to generate high-quality geometries of arbitrary objects
unseen in 3D datasets. Furthermore, our Aligned Geometric Priors (AGP) are
generic and can be seamlessly integrated into various state-of-the-art pipelines,
obtaining high generalizability in terms of unseen geometric structures and visual
appearance while greatly alleviating multi-view inconsistency issues, and hence
representing a new state-of-the-art performance in text-to-3D.
A dragon-cat hybrid
 
Albert Einstein with grey suit is riding a bicycle
Mini Paris, highly detailed,
8K, HD
A 3D model of mini China town, highly detailed, 8K, HD, blender 3d
A boy in mohawk hairstyle, head only, 4K, HD, raw
Fire-breathing Phoenix, mythical bird, engulfed in flames,
rebirth and renewal, 3D render, 8K, HD
A bulldog wearing a black pirate hat, highly detailed
A 3D model of mini China town, highly detailed, 8K, HD, blender 3d
We fine-tune the 2D diffusion model (middle) to generate viewpoint- conditioned canonical coordinates maps, which are rendered from canonically oriented 3D assets (left), thereby aligning the geometric priors in the 2D diffusion. The aligned geometric priors can then be seamlessly integrated into existing text-to-3D pipelines to confer 3D consistency (right), while retaining their generalizability to obtain high-fidelity and highly varied 3D content.
@article{sweetdreamer,
author = {Weiyu Li and Rui Chen and Xuelin Chen and Ping Tan},
title = {SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D},
journal = {arxiv:2310.02596},
year = {2023},
}