Multi-view 3D Reconstruction with Transformers

1University of British Columbia 2University of Science and Technology of China 3University of Michigan, Ann Arbor 4NetEase Fuxi AI Lab

ICCV (Oral), 2021

Model visualization

We introduce a Transformer-based framework for multi-view 3D object reconstruction that unifies feature extraction and view fusion into a single, cohesive network. By reframing 3D reconstruction as a "sequence-to-sequence" prediction problem, our encoder-decoder structure naturally explores multi-level correspondences and associations between 2D input views and the 3D output volume. We identify and address the "divergence decay" phenomenon by implementing a view-divergence enhancing operation within our self-attention layers. This approach achieves superior results on the ShapeNet dataset while utilizing only 30% of the parameters required by contemporary CNN-based methods, demonstrating superior scaling capabilities as the number of input views increases.

Abstract

Deep CNN-based methods have so far achieved the state of the art results in multi-view 3D object reconstruction. Despite the considerable progress, the two core modules of these methods - view feature extraction and multi-view fusion, are usually investigated separately, and the relations among multiple input views are rarely explored. Inspired by the recent great success in Transformer models, we reformulate the multi-view 3D reconstruction as a sequence-to-sequence prediction problem and propose a framework named 3D Volume Transformer. Unlike previous CNN-based methods using a separate design, we unify the feature extraction and view fusion in a single Transformer network. A natural advantage of our design lies in the exploration of view-to-view relationships using self-attention among multiple unordered inputs. On ShapeNet - a large-scale 3D reconstruction benchmark, our method achieves a new state-of-the-art accuracy in multi-view reconstruction with fewer parameters (70% less) than CNN-based methods. Experimental results also suggest the strong scaling capability of our method. Our code will be made publicly available.

Model visualization

BibTeX

@InProceedings{Wang_2021_ICCV,
    author    = {Wang, Dan and Cui, Xinrui and Chen, Xun and Zou, Zhengxia and Shi, Tianyang and Salcudean, Septimiu and Wang, Z. Jane and Ward, Rabab},
    title     = {Multi-View 3D Reconstruction With Transformers},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {5722-5731}
}