Computer Visionimage-to-3d

Image-to-3D

Image-to-3D reconstruction infers full 3D geometry from one or a few images — a fundamentally ill-posed problem that recent models solve with learned geometric priors. Traditional multi-view stereo required dozens of calibrated views, but single-image methods like One-2-3-45 (2023) and TripoSR leverage large-scale 3D training data to hallucinate plausible geometry from a single photo. 3D Gaussian Splatting (2023) revolutionized the representation side, enabling real-time rendering of reconstructed scenes. The practical gap is clear: scanned objects still look better than generated ones, but the convenience of snap-and-reconstruct is reshaping e-commerce product visualization and AR content creation.

1 datasets0 resultsView full task mapping →

Image-to-3D reconstructs a 3D model from one or a few 2D photographs. It's the practical workhorse of 3D content creation — more controllable than text-to-3D because you can specify exactly what the object looks like. Large Reconstruction Models (LRMs) now produce textured meshes from a single image in under 10 seconds, making this viable for e-commerce, gaming, and AR.

History

2016

3D-R2N2 (Choy et al.) uses recurrent networks to reconstruct voxel grids from single or multi-view images — first deep single-image 3D reconstruction

2019

Pixel2Mesh and AtlasNet deform template meshes to match input images, producing smoother but limited-topology outputs

2020

NeRF enables photorealistic reconstruction from dense multi-view captures; requires 50-100 input images per scene

2022

Zero-1-to-3 (Liu et al.) uses a diffusion model to synthesize novel viewpoints from a single image, bootstrapping multi-view reconstruction

2023

One-2-3-45 and Wonder3D combine single-image multi-view generation with 3D reconstruction, producing meshes from one photo in minutes

2023

LRM (Large Reconstruction Model) by Li et al. trains a transformer on Objaverse to directly predict triplane NeRF features from a single image in 5 seconds

2024

TripoSR (Stability AI + Tripo) and InstantMesh produce textured meshes in seconds; 3D Gaussian splatting replaces NeRF as the target representation

2024

DUSt3R (Naver) reconstructs 3D from uncalibrated image pairs without known camera poses — unifying SfM and MVS into a single neural network

2025

Trellis (Microsoft), SF3D, and Hunyuan3D-2 produce PBR-textured meshes from single images; quality approaches manual 3D modeling for simple objects

How Image-to-3D Works

1Image EncodingThe input image is processe…23D Representation Pre…A transformer decoder predi…3Multi-View Generation…Some methods first generate…4Surface ExtractionMarching cubes or different…5Texture MappingUV coordinates are computedImage-to-3D Pipeline
1

Image Encoding

The input image is processed by a pretrained vision encoder (DINOv2, CLIP ViT) into dense feature tokens that capture both appearance and implicit 3D cues (shading, perspective, occlusion).

2

3D Representation Prediction

A transformer decoder predicts a 3D representation — triplane features (3 axis-aligned feature planes), 3D Gaussian parameters, or point cloud coordinates — from the image features. LRM-style models do this in a single forward pass.

3

Multi-View Generation (Optional)

Some methods first generate 4-6 consistent views using a multi-view diffusion model (Zero123++, SV3D), then reconstruct 3D from the synthesized views. This improves unseen-side quality.

4

Surface Extraction

Marching cubes or differentiable iso-surface extraction (FlexiCubes, DMTet) converts the implicit representation into an explicit triangle mesh with vertex normals.

5

Texture Mapping

UV coordinates are computed, and textures are either directly predicted by the model, baked from the neural representation, or painted via texture diffusion. PBR pipelines additionally generate roughness, metallic, and normal maps.

Current Landscape

Image-to-3D in 2025 has reached practical utility for specific use cases — e-commerce product visualization, rapid prototyping, and game asset drafting. Feed-forward models (LRM, TripoSR, Trellis) are fast enough for interactive use, and multi-view diffusion methods (Zero123++, SV3D) handle complex objects better than single-view approaches. The quality gap between AI-generated and artist-created 3D assets has narrowed from 'laughable' (2022) to 'useful starting point' (2025). DUSt3R's emergence for multi-view reconstruction is significant — it replaces classical COLMAP/SfM pipelines with a single neural network.

Key Challenges

Back-side hallucination — with only a front view, the model must guess what the back looks like; errors are common for asymmetric or complex objects

Geometry detail — generated meshes lack the sharp edges, fine features, and clean topology that 3D artists produce; post-processing with remeshing tools is usually needed

Scale ambiguity — a single image doesn't contain absolute size information, making the output dimensionless without external calibration

Transparent and reflective objects — glass, chrome, and other non-Lambertian surfaces violate the assumptions of most reconstruction methods

PBR material estimation — predicting physically-based materials (not just albedo) from a single image is severely underconstrained

Quick Recommendations

Best single-image quality

Trellis (Microsoft) or Hunyuan3D-2

Highest mesh quality and texture fidelity from a single image; Trellis uses structured latent representation for cleaner geometry

Fastest generation

TripoSR or SF3D

Textured mesh in 1-5 seconds; good enough for rapid prototyping and e-commerce product shots

Multi-view reconstruction

DUSt3R / MASt3R + Gaussian splatting

Reconstructs from 2-20 uncalibrated photos without SfM preprocessing; best for scene-level reconstruction

E-commerce / product 3D

Tripo API or Meshy

Optimized for clean product shots with white backgrounds; produce AR-ready meshes with materials

Game assets

InstantMesh + manual retopology

AI generates initial mesh, then tools like ZBrush/Blender clean up topology for game engines

What's Next

Near-term: PBR material estimation, automatic rigging for animation, and scene-level reconstruction from casual phone captures. Medium-term: generating articulated 3D models (hands, robots, furniture with moving parts) from single images. Long-term: integration into 3D-native generative models where the image is just one possible input alongside text, sketches, and partial geometry. The convergence of 3D Gaussian splatting with feed-forward prediction is the most active research direction.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Image-to-3D benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000