Computer Visiontext-to-3d

Text-to-3D

Text-to-3D generates 3D assets — meshes, NeRFs, or Gaussian splats — from text descriptions alone, a capability that barely existed before DreamFusion (2022) showed score distillation sampling could lift 2D diffusion priors into 3D. The field moves at breakneck speed: Magic3D added coarse-to-fine generation, Instant3D achieved single-shot inference, and Meshy and Tripo brought commercial quality. Multi-view consistency remains the core challenge — the "Janus problem" where different viewpoints produce contradictory details. The promise of democratizing 3D content creation for games, VR, and e-commerce is driving massive investment.

1 datasets0 resultsView full task mapping →

Text-to-3D generates 3D assets (meshes, NeRFs, Gaussian splats) from text descriptions. It went from impossible to demo-worthy in 2022-2024 via score distillation (DreamFusion), and practical tools are emerging for game dev and product design. Quality lags 2-3 years behind text-to-image, but the trajectory is steep.

History

2020

NeRF (Mildenhall et al.) enables photorealistic novel view synthesis from multi-view images — provides the 3D representation that text-to-3D methods will later target

2022

DreamFields (Jain et al.) first optimizes NeRF from CLIP guidance, producing basic text-aligned 3D objects

2022

DreamFusion (Poole et al.) introduces Score Distillation Sampling (SDS) — using a 2D diffusion model as a 3D optimization prior — achieving dramatically better text-to-3D results

2023

Magic3D (Lin et al.) adds coarse-to-fine optimization with DMTet meshes, producing higher-resolution textured 3D assets

2023

ProlificDreamer introduces Variational Score Distillation (VSD), fixing SDS's over-saturation and over-smoothing problems

2023

3D Gaussian Splatting (Kerbl et al.) provides a faster, more flexible 3D representation that renders 100× faster than NeRF — adopted widely for text-to-3D

2024

Instant3D and LRM (Large Reconstruction Model) generate 3D from text in seconds via feed-forward prediction instead of per-asset optimization

2024

TripoSR, InstantMesh, and SF3D produce textured meshes from text (via intermediate images) in under 10 seconds

2025

Trellis, CLAY, and Meshy v4 produce production-quality textured meshes with PBR materials; feed-forward models close the quality gap with optimization-based methods

How Text-to-3D Works

Score Distillation (Optimization-based)

A 3D representation (NeRF, Gaussian splats, mesh) is rendered from random viewpoints. A pretrained 2D diffusion model scores each rendering against the text prompt, and gradients flow back to update the 3D representation. Requires 15-60 minutes per asset.

Feed-Forward Generation

A trained model directly predicts 3D representations (triplane features, point clouds, Gaussian parameters) from text or images in a single forward pass. LRM and TripoSR achieve this in 1-10 seconds by training on large 3D datasets (Objaverse).

Multi-View Diffusion

Models like MVDream and Zero123++ generate consistent multi-view images from text, then reconstruct 3D from these views. This bypasses the Janus problem (different faces on front/back) that plagues single-view SDS.

Mesh Extraction + Texturing

Marching cubes extracts a mesh from the density field. PBR textures (albedo, roughness, metallic) are baked either via the same diffusion prior or a dedicated texture generation model. The result is a game/rendering-ready asset.

Evaluation

CLIP similarity between renderings and text prompt is the automated metric, but human preference is the real measure. Geometry quality is assessed by mesh smoothness, watertightness, and triangle count. There's no universally accepted benchmark yet.

Current Landscape

Text-to-3D in 2025 is at the 'early useful' stage — roughly where text-to-image was in early 2022. Feed-forward models (Trellis, InstantMesh, LRM) have solved the speed problem but still produce lower-quality geometry than optimization-based methods. The Objaverse dataset (800K 3D objects) has been the critical training resource. The practical workflow is typically text→image→3D rather than text→3D directly, because image generation is more controllable. Game studios and e-commerce platforms are the earliest adopters, using AI-generated 3D as starting points that artists refine.

Key Challenges

The Janus problem — SDS-based methods tend to generate objects with multiple faces (e.g., a face on both front and back of a head) because the 2D prior doesn't enforce 3D consistency

Geometry quality — generated meshes are often blobby, lack sharp edges, and have noisy topology; extracting clean, game-ready meshes remains manual-labor-intensive

Texture bleeding and seams — UV mapping and texture baking from 3D representations produce artifacts at mesh boundaries

Generation speed — optimization-based methods take 30-60 minutes per asset; feed-forward models are fast but lower quality

PBR materials — most methods produce only albedo texture, not physically-based rendering materials (roughness, metallics, normals) needed for modern game engines

Quick Recommendations

Best quality (patient)

ProlificDreamer or DreamCraft3D

VSD produces the highest-fidelity text-to-3D via hours of optimization; best for hero assets

Fast generation (seconds)

Trellis or TripoSR (via text-to-image first)

Generate a concept image with FLUX, then reconstruct 3D in 5-10 seconds; practical for iteration

Game-ready meshes

Meshy v4 or CLAY

Produce textured meshes with PBR materials optimized for game engines; closest to production quality

Batch asset generation

InstantMesh or LRM-based pipeline

Feed-forward inference at scale; generate hundreds of 3D assets per hour for virtual worlds

Research / customization

ThreeStudio framework

Modular implementation of DreamFusion, Magic3D, ProlificDreamer — swap representations, loss functions, and models easily

What's Next

The next 1-2 years will focus on: PBR material generation (not just albedo textures), rigging and animation (currently a separate manual step), scene generation (rooms, environments, not just objects), and multi-object compositional generation. The endgame is 'describe a game level in text, get a playable 3D environment' — we're 3-5 years from that being practical. 3D Gaussian splatting is increasingly the default representation, replacing NeRFs for generation tasks.

Benchmarks & SOTA

T3Bench

20230 results

Evaluates text-to-3D generation quality and text alignment

No results tracked yet

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Text-to-3D benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision