Search for a command to run...
Numerous diffusion models have been developed for 2D image synthesis and editing, and recently they are extended to 3D scene editing tasks. However, editing 3D scenes is still in its early stages, and the challenges of scene representations and multi-view consistency need to be addressed. A notable limitation of existing approaches is the need for specific modules for different edits and model retraining for each scene. To tackle these issues, we propose a novel and versatile text-driven 3D scene editing method, termed DN2N, which allows for the direct acquisition of the editing results without the requirement for retraining. Our method employs off-the-shelf text-based editing models of 2D images to modify the multi-view images of a 3D scene. A content filtering process is then applied to discard poorly edited images that disrupt 3D consistency. We consider the remaining inconsistency as a problem of removing noise perturbations and solve it by generating data with similar perturbation characteristics for training. We develop a versatile NeRF model structure and propose two novel cross-view regularization terms to help the DN2N mitigate these perturbations. Empirical results show that our method achieves multiple editing types based solely on text prompts, including but not limited to appearance editing, weather transition, object changing, and style transfer. Most importantly, DN2N exhibits a versatility of editing capabilities, eliminating the need to customize or retrain editing models for specific scenes or editing types. Namely, DN2N achieves comparable total editing time to the 3DGS-based editing method, enhancing its practical value.
Published in: IEEE Transactions on Visualization and Computer Graphics
Volume PP, pp. 1-15