SweetDreamer: Aligning Geometric Priors
in 2D Diffusion for Consistent Text-to-3D

ICLR 2024

Weiyu Li^1,2, Rui Chen^3,4, Xuelin Chen⁴, Ping Tan^1,2

¹HKUST

²Light Illusions

³South China University of Technology

⁴Tencent AI Lab

Abstract

Lifting 2D observations in pre-trained diffusion models to a 3D world for text-to-3D is inherently ambiguous. 2D diffusion models solely learn view-agnostic priors and thus lack 3D knowledge during the lifting, leading to the multi-view inconsistency problem. Our key finding reveals that this problem primarily stems from geometric inconsistency, and addressing ambiguously placed geometries substantially mitigates the issue in the final outcomes. Therefore, we focus on improving the geometric consistency via enforcing the 2D geometric priors in diffusion models act in a way that aligns with well-defined 3D geometries during the lifting, addressing the vast majority of the problem. This is realized by fine-tuning the 2D diffusion to be viewpoint-aware and to produce view-specific geometric maps of canonically oriented objects as in 3D datasets. Particularly, only coarse 3D geometries are used for aligning. This “coarse” alignment not only enables the generation of geometries without multi-view inconsistency but also retains the ability in 2D diffusion models to generate high-quality geometries of arbitrary objects unseen in 3D datasets. Furthermore, our Aligned Geometric Priors (AGP) are generic and can be seamlessly integrated into various state-of-the-art pipelines, obtaining high generalizability in terms of unseen geometric structures and visual appearance while greatly alleviating multi-view inconsistency issues, and hence representing a new state-of-the-art performance in text-to-3D.

NeRF-based Generated Results

A dragon-cat hybrid

Albert Einstein with grey suit is riding a bicycle

Mini Paris, highly detailed,
8K, HD

A 3D model of mini China town, highly detailed, 8K, HD, blender 3d

DMTet-based Generated Results

A boy in mohawk hairstyle, head only, 4K, HD, raw

Fire-breathing Phoenix, mythical bird, engulfed in flames,
rebirth and renewal, 3D render, 8K, HD

A bulldog wearing a black pirate hat, highly detailed

A 3D model of mini China town, highly detailed, 8K, HD, blender 3d

Method Overview

We fine-tune the 2D diffusion model (middle) to generate viewpoint- conditioned canonical coordinates maps, which are rendered from canonically oriented 3D assets (left), thereby aligning the geometric priors in the 2D diffusion. The aligned geometric priors can then be seamlessly integrated into existing text-to-3D pipelines to confer 3D consistency (right), while retaining their generalizability to obtain high-fidelity and highly varied 3D content.

@article{sweetdreamer, author = {Weiyu Li and Rui Chen and Xuelin Chen and Ping Tan}, title = {SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D}, journal = {arxiv:2310.02596}, year = {2023}, }

SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D

1 HKUST

2 Light Illusions

3 South China University of Technology

4 Tencent AI Lab