In this work, we address the challenging task of
3D object recognition without the reliance on real-world 3D
labeled data. Our goal is to predict the 3D shape, size, and 6D
pose of objects within a single RGB-D image, operating at the
category level and eliminating the need for CAD models during
inference.
While existing self-supervised methods have made
strides in this field, they often suffer from inefficiencies arising
from non-end-to-end processing, reliance on separate models
for different object categories, and slow surface extraction
during the training of implicit reconstruction models; thus
hindering both the speed and real-world applicability of the
3D recognition process. Our proposed method leverages a
multi-stage training pipeline, designed to efficiently transfer
synthetic performance to the real-world domain. This approach
is achieved through a combination of 2D and 3D supervised
losses during the synthetic domain training, followed by the
incorporation of 2D supervised and 3D self-supervised losses on
real-world data in two additional learning stages.
By adopting
this comprehensive strategy, our method successfully overcomes
the aforementioned limitations and outperforms existing self-
supervised 6D pose and size estimation baselines on the NOCS
test-set with a 16.4% absolute improvement in mAP for 6D
pose estimation while running in near real-time at 5 Hz