ICCV (Oral), 2021
We introduce a Transformer-based framework for multi-view 3D object reconstruction that unifies feature extraction and view fusion into a single, cohesive network. By reframing 3D reconstruction as a "sequence-to-sequence" prediction problem, our encoder-decoder structure naturally explores multi-level correspondences and associations between 2D input views and the 3D output volume. We identify and address the "divergence decay" phenomenon by implementing a view-divergence enhancing operation within our self-attention layers. This approach achieves superior results on the ShapeNet dataset while utilizing only 30% of the parameters required by contemporary CNN-based methods, demonstrating superior scaling capabilities as the number of input views increases.
Deep CNN-based methods have so far achieved the state of the art results in multi-view 3D object reconstruction. Despite the considerable progress, the two core modules of these methods - view feature extraction and multi-view fusion, are usually investigated separately, and the relations among multiple input views are rarely explored. Inspired by the recent great success in Transformer models, we reformulate the multi-view 3D reconstruction as a sequence-to-sequence prediction problem and propose a framework named 3D Volume Transformer. Unlike previous CNN-based methods using a separate design, we unify the feature extraction and view fusion in a single Transformer network. A natural advantage of our design lies in the exploration of view-to-view relationships using self-attention among multiple unordered inputs. On ShapeNet - a large-scale 3D reconstruction benchmark, our method achieves a new state-of-the-art accuracy in multi-view reconstruction with fewer parameters (70% less) than CNN-based methods. Experimental results also suggest the strong scaling capability of our method. Our code will be made publicly available.
@InProceedings{Wang_2021_ICCV,
author = {Wang, Dan and Cui, Xinrui and Chen, Xun and Zou, Zhengxia and Shi, Tianyang and Salcudean, Septimiu and Wang, Z. Jane and Ward, Rabab},
title = {Multi-View 3D Reconstruction With Transformers},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021},
pages = {5722-5731}
}