Text-to-3D
Text-to-3D generates 3D assets — meshes, NeRFs, or Gaussian splats — from text descriptions alone, a capability that barely existed before DreamFusion (2022) showed score distillation sampling could lift 2D diffusion priors into 3D. The field moves at breakneck speed: Magic3D added coarse-to-fine generation, Instant3D achieved single-shot inference, and Meshy and Tripo brought commercial quality. Multi-view consistency remains the core challenge — the "Janus problem" where different viewpoints produce contradictory details. The promise of democratizing 3D content creation for games, VR, and e-commerce is driving massive investment.
Text-to-3D generates 3D assets (meshes, NeRFs, Gaussian splats) from text descriptions. It went from impossible to demo-worthy in 2022-2024 via score distillation (DreamFusion), and practical tools are emerging for game dev and product design. Quality lags 2-3 years behind text-to-image, but the trajectory is steep.
History
NeRF (Mildenhall et al.) enables photorealistic novel view synthesis from multi-view images — provides the 3D representation that text-to-3D methods will later target
DreamFields (Jain et al.) first optimizes NeRF from CLIP guidance, producing basic text-aligned 3D objects
DreamFusion (Poole et al.) introduces Score Distillation Sampling (SDS) — using a 2D diffusion model as a 3D optimization prior — achieving dramatically better text-to-3D results
Magic3D (Lin et al.) adds coarse-to-fine optimization with DMTet meshes, producing higher-resolution textured 3D assets
ProlificDreamer introduces Variational Score Distillation (VSD), fixing SDS's over-saturation and over-smoothing problems
3D Gaussian Splatting (Kerbl et al.) provides a faster, more flexible 3D representation that renders 100× faster than NeRF — adopted widely for text-to-3D
Instant3D and LRM (Large Reconstruction Model) generate 3D from text in seconds via feed-forward prediction instead of per-asset optimization
TripoSR, InstantMesh, and SF3D produce textured meshes from text (via intermediate images) in under 10 seconds
Trellis, CLAY, and Meshy v4 produce production-quality textured meshes with PBR materials; feed-forward models close the quality gap with optimization-based methods
How Text-to-3D Works
Score Distillation (Optimization-based)
A 3D representation (NeRF, Gaussian splats, mesh) is rendered from random viewpoints. A pretrained 2D diffusion model scores each rendering against the text prompt, and gradients flow back to update the 3D representation. Requires 15-60 minutes per asset.
Feed-Forward Generation
A trained model directly predicts 3D representations (triplane features, point clouds, Gaussian parameters) from text or images in a single forward pass. LRM and TripoSR achieve this in 1-10 seconds by training on large 3D datasets (Objaverse).
Multi-View Diffusion
Models like MVDream and Zero123++ generate consistent multi-view images from text, then reconstruct 3D from these views. This bypasses the Janus problem (different faces on front/back) that plagues single-view SDS.
Mesh Extraction + Texturing
Marching cubes extracts a mesh from the density field. PBR textures (albedo, roughness, metallic) are baked either via the same diffusion prior or a dedicated texture generation model. The result is a game/rendering-ready asset.
Evaluation
CLIP similarity between renderings and text prompt is the automated metric, but human preference is the real measure. Geometry quality is assessed by mesh smoothness, watertightness, and triangle count. There's no universally accepted benchmark yet.
Current Landscape
Text-to-3D in 2025 is at the 'early useful' stage — roughly where text-to-image was in early 2022. Feed-forward models (Trellis, InstantMesh, LRM) have solved the speed problem but still produce lower-quality geometry than optimization-based methods. The Objaverse dataset (800K 3D objects) has been the critical training resource. The practical workflow is typically text→image→3D rather than text→3D directly, because image generation is more controllable. Game studios and e-commerce platforms are the earliest adopters, using AI-generated 3D as starting points that artists refine.
Key Challenges
The Janus problem — SDS-based methods tend to generate objects with multiple faces (e.g., a face on both front and back of a head) because the 2D prior doesn't enforce 3D consistency
Geometry quality — generated meshes are often blobby, lack sharp edges, and have noisy topology; extracting clean, game-ready meshes remains manual-labor-intensive
Texture bleeding and seams — UV mapping and texture baking from 3D representations produce artifacts at mesh boundaries
Generation speed — optimization-based methods take 30-60 minutes per asset; feed-forward models are fast but lower quality
PBR materials — most methods produce only albedo texture, not physically-based rendering materials (roughness, metallics, normals) needed for modern game engines
Quick Recommendations
Best quality (patient)
ProlificDreamer or DreamCraft3D
VSD produces the highest-fidelity text-to-3D via hours of optimization; best for hero assets
Fast generation (seconds)
Trellis or TripoSR (via text-to-image first)
Generate a concept image with FLUX, then reconstruct 3D in 5-10 seconds; practical for iteration
Game-ready meshes
Meshy v4 or CLAY
Produce textured meshes with PBR materials optimized for game engines; closest to production quality
Batch asset generation
InstantMesh or LRM-based pipeline
Feed-forward inference at scale; generate hundreds of 3D assets per hour for virtual worlds
Research / customization
ThreeStudio framework
Modular implementation of DreamFusion, Magic3D, ProlificDreamer — swap representations, loss functions, and models easily
What's Next
The next 1-2 years will focus on: PBR material generation (not just albedo textures), rigging and animation (currently a separate manual step), scene generation (rooms, environments, not just objects), and multi-object compositional generation. The endgame is 'describe a game level in text, get a playable 3D environment' — we're 3-5 years from that being practical. 3D Gaussian splatting is increasingly the default representation, replacing NeRFs for generation tasks.
Benchmarks & SOTA
Related Tasks
Something wrong or missing?
Help keep Text-to-3D benchmarks accurate. Report outdated results, missing benchmarks, or errors.