Computer Visiontext-to-3d

Text-to-3D

Text-to-3D generates 3D assets — meshes, NeRFs, or Gaussian splats — from text descriptions alone, a capability that barely existed before DreamFusion (2022) showed score distillation sampling could lift 2D diffusion priors into 3D. The field moves at breakneck speed: Magic3D added coarse-to-fine generation, Instant3D achieved single-shot inference, and Meshy and Tripo brought commercial quality. Multi-view consistency remains the core challenge — the "Janus problem" where different viewpoints produce contradictory details. The promise of democratizing 3D content creation for games, VR, and e-commerce is driving massive investment.

1 datasets0 resultsView full task mapping →

Text-to-3D generates 3D assets (meshes, NeRFs, Gaussian splats) from text descriptions. It went from impossible to demo-worthy in 2022-2024 via score distillation (DreamFusion), and practical tools are emerging for game dev and product design. Quality lags 2-3 years behind text-to-image, but the trajectory is steep.

History

2020

NeRF (Mildenhall et al.) enables photorealistic novel view synthesis from multi-view images — provides the 3D representation that text-to-3D methods will later target

2022

DreamFields (Jain et al.) first optimizes NeRF from CLIP guidance, producing basic text-aligned 3D objects

2022

DreamFusion (Poole et al.) introduces Score Distillation Sampling (SDS) — using a 2D diffusion model as a 3D optimization prior — achieving dramatically better text-to-3D results

2023

Magic3D (Lin et al.) adds coarse-to-fine optimization with DMTet meshes, producing higher-resolution textured 3D assets

2023

ProlificDreamer introduces Variational Score Distillation (VSD), fixing SDS's over-saturation and over-smoothing problems

2023

3D Gaussian Splatting (Kerbl et al.) provides a faster, more flexible 3D representation that renders 100× faster than NeRF — adopted widely for text-to-3D

2024

Instant3D and LRM (Large Reconstruction Model) generate 3D from text in seconds via feed-forward prediction instead of per-asset optimization

2024

TripoSR, InstantMesh, and SF3D produce textured meshes from text (via intermediate images) in under 10 seconds

2025

Trellis, CLAY, and Meshy v4 produce production-quality textured meshes with PBR materials; feed-forward models close the quality gap with optimization-based methods

How Text-to-3D Works

1Score Distillation (O…A 3D representation (NeRF2Feed-Forward Generati…A trained model directly pr…3Multi-View DiffusionModels like MVDream and Zer…4Mesh Extraction + Tex…Marching cubes extracts a m…5EvaluationCLIP similarity between ren…Text-to-3D Pipeline
1

Score Distillation (Optimization-based)

A 3D representation (NeRF, Gaussian splats, mesh) is rendered from random viewpoints. A pretrained 2D diffusion model scores each rendering against the text prompt, and gradients flow back to update the 3D representation. Requires 15-60 minutes per asset.

2

Feed-Forward Generation

A trained model directly predicts 3D representations (triplane features, point clouds, Gaussian parameters) from text or images in a single forward pass. LRM and TripoSR achieve this in 1-10 seconds by training on large 3D datasets (Objaverse).

3

Multi-View Diffusion

Models like MVDream and Zero123++ generate consistent multi-view images from text, then reconstruct 3D from these views. This bypasses the Janus problem (different faces on front/back) that plagues single-view SDS.

4

Mesh Extraction + Texturing

Marching cubes extracts a mesh from the density field. PBR textures (albedo, roughness, metallic) are baked either via the same diffusion prior or a dedicated texture generation model. The result is a game/rendering-ready asset.

5

Evaluation

CLIP similarity between renderings and text prompt is the automated metric, but human preference is the real measure. Geometry quality is assessed by mesh smoothness, watertightness, and triangle count. There's no universally accepted benchmark yet.

Current Landscape

Text-to-3D in 2025 is at the 'early useful' stage — roughly where text-to-image was in early 2022. Feed-forward models (Trellis, InstantMesh, LRM) have solved the speed problem but still produce lower-quality geometry than optimization-based methods. The Objaverse dataset (800K 3D objects) has been the critical training resource. The practical workflow is typically text→image→3D rather than text→3D directly, because image generation is more controllable. Game studios and e-commerce platforms are the earliest adopters, using AI-generated 3D as starting points that artists refine.

Key Challenges

The Janus problem — SDS-based methods tend to generate objects with multiple faces (e.g., a face on both front and back of a head) because the 2D prior doesn't enforce 3D consistency

Geometry quality — generated meshes are often blobby, lack sharp edges, and have noisy topology; extracting clean, game-ready meshes remains manual-labor-intensive

Texture bleeding and seams — UV mapping and texture baking from 3D representations produce artifacts at mesh boundaries

Generation speed — optimization-based methods take 30-60 minutes per asset; feed-forward models are fast but lower quality

PBR materials — most methods produce only albedo texture, not physically-based rendering materials (roughness, metallics, normals) needed for modern game engines

Quick Recommendations

Best quality (patient)

ProlificDreamer or DreamCraft3D

VSD produces the highest-fidelity text-to-3D via hours of optimization; best for hero assets

Fast generation (seconds)

Trellis or TripoSR (via text-to-image first)

Generate a concept image with FLUX, then reconstruct 3D in 5-10 seconds; practical for iteration

Game-ready meshes

Meshy v4 or CLAY

Produce textured meshes with PBR materials optimized for game engines; closest to production quality

Batch asset generation

InstantMesh or LRM-based pipeline

Feed-forward inference at scale; generate hundreds of 3D assets per hour for virtual worlds

Research / customization

ThreeStudio framework

Modular implementation of DreamFusion, Magic3D, ProlificDreamer — swap representations, loss functions, and models easily

What's Next

The next 1-2 years will focus on: PBR material generation (not just albedo textures), rigging and animation (currently a separate manual step), scene generation (rooms, environments, not just objects), and multi-object compositional generation. The endgame is 'describe a game level in text, get a playable 3D environment' — we're 3-5 years from that being practical. 3D Gaussian splatting is increasingly the default representation, replacing NeRFs for generation tasks.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Text-to-3D benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000