Unsupervised 2D to 3D reconstruction

witloof lunchbox
3 min readJun 11, 2021

--

Obtaining 3D models from 2D images when 2D-3D pair datasets are available is by itself a formidable computer vision task. But what if it could be done without the need for 3D data? This articles outlines a shallow summarization of a paper that achieves just that.

Objective

The objective is to train model to generate 3D models from a single 2D image of a symmetric object using only 2D images as training data.

Architecture

To accomplish 2D to 3D mapping without 3D data, firstly a neural network that takes single view 2D images as input and maps it to different output components is created.

Next, it assumed that these components will be information regarding lighting, viewpoint (as the the image was a photo taken of a 3D object), depth, and albedo for the image and it’s horizontal flip. It is assumed that the input will be symmetric about a vertical line (although in practice, the images are not perfectly symmetric), and one of the assumed outputs of the neural net is a confidence map that estimates the probability that a pixel in the image has a symmetric counter part.

Given that all this information can be obtained from the image, computer vision methods can be used to reconstruct the 3D image. Under supervised learning, where 3D models of the 2D training images would be available as training data, the network can be trained to generate this information by comparing the 3D reconstruction from the generated information to the target 3D model of each 2D image from the training dataset. Under this method however, 3D data is not required to train the network to extract useful information from the 2D images.

To train the neural network to generate the required output (depth,view,albedo etc), the depth, view and albedo information are used to generate a 2D reconstruction of the input image (using traditional computer vision methods). This reconstruction is compared with the input to obtain a reconstruction loss. This reconstruction loss can be used to train the network. Once the network is trained, traditional computer vision can be used to reconstruct the 3D model from the output of the neural network.

Results and conclusion

Reconstruction of faces, cats and cars
Reconstruction of paintings

Results show that the model is able to obtain high-fidelity monocular 3D reconstructions of individual object instances as shown in the figures above.

Failure cases

While the model performs well for typical cases, it does fail for extreme cases such as cases with extreme lighting and pose or objects with noisy textures.

--

--

witloof lunchbox
witloof lunchbox

Written by witloof lunchbox

I summarize research papers into short articles

No responses yet