CDFNet: Cross‐Modal Deep Fusion for Monocular 3D Semantic Scene Completion

20260 citationsJournal Articlegold Open Access

Authors

Xianjing Cheng · Zunyi Medical University

Lintai Wu · Huaqiao University

Jie Wen · Harbin Institute of Technology

Tao Peng · Shandong Institute of Automation

Ying Xu · Shenzhen Institute of Information Technology

Zairong Wei · Zunyi Medical University

Abstract

ABSTRACT Semantic scene completion (SSC) aims to predict the semantic occupancy and geometry of 3D scenes. Recently, most studies focus on camera‐based approaches due to the rich visual cues of images and the cost‐effectiveness of cameras. However, these methods usually lack efficient fusion and fine‐grained processing of cross‐modal semantic information, resulting in sub‐optimal performance. To address these issues, we propose a novel cross‐modal semantic deep fusion framework. Unlike previous approaches, our method effectively integrates 2D textural, 2D spatial and 3D geometric knowledge from three different modalities to reconstruct complete 3D scenes. Specifically, we employ two encoders to extract 2D textural and 2D spatial features from RGB images and depth maps, which are then fused and lifted into 3D space via our tailored cross‐modal semantic fusion module. In contrast to previous methods that encompass extensive redundant 3D voxel features, we design a lightweight voxel feature filter to efficiently eliminate these redundancies. Furthermore, 3D geometric features are extracted from the point cloud derived from the depth map. The 3D features from multiple modalities are deeply fused and further refined by a sparse‐to‐dense voxel completion module, which effectively enriches the semantic information. Besides, we propose a new evaluation metric more suitable for assessing the SSC task with class imbalance issues in the dataset. Extensive experiments show that our method achieves the state of the art in camera‐based semantic scene completion. We will release the source code publicly.

Topics & Keywords

Advanced Vision and Imaging Robotics and Sensor-Based Localization 3D Shape Modeling and Analysis

Publication Details

Published in: CAAI Transactions on Intelligence Technology

DOI: 10.1049/cit2.70124

Field-Weighted Citation Impact: 0.00

Command Palette

CDFNet: Cross‐Modal Deep Fusion for Monocular 3D Semantic Scene Completion

Authors

Abstract

Topics & Keywords

Publication Details