Search for a command to run...
ABSTRACT Semantic scene completion (SSC) aims to predict the semantic occupancy and geometry of 3D scenes. Recently, most studies focus on camera‐based approaches due to the rich visual cues of images and the cost‐effectiveness of cameras. However, these methods usually lack efficient fusion and fine‐grained processing of cross‐modal semantic information, resulting in sub‐optimal performance. To address these issues, we propose a novel cross‐modal semantic deep fusion framework. Unlike previous approaches, our method effectively integrates 2D textural, 2D spatial and 3D geometric knowledge from three different modalities to reconstruct complete 3D scenes. Specifically, we employ two encoders to extract 2D textural and 2D spatial features from RGB images and depth maps, which are then fused and lifted into 3D space via our tailored cross‐modal semantic fusion module. In contrast to previous methods that encompass extensive redundant 3D voxel features, we design a lightweight voxel feature filter to efficiently eliminate these redundancies. Furthermore, 3D geometric features are extracted from the point cloud derived from the depth map. The 3D features from multiple modalities are deeply fused and further refined by a sparse‐to‐dense voxel completion module, which effectively enriches the semantic information. Besides, we propose a new evaluation metric more suitable for assessing the SSC task with class imbalance issues in the dataset. Extensive experiments show that our method achieves the state of the art in camera‐based semantic scene completion. We will release the source code publicly.