Search for a command to run...
Abstract Diffusion models have achieved state-of-the-art image synthesis, yet unlike GANs, they lack a well-structured latent space for intuitive image editing. Existing diffusion-based editing methods often rely on supervised fine-tuning or text-based guidance, while recent unsupervised techniques leveraging the model’s bottleneck layer suffer from one or more key limitations: (i) they focus only on global attributes, (ii) fail to disentangle local and global semantics, or (iii) require extensive human intervention. To fill this gap, we first propose an unsupervised method for localized image editing in pre-trained unconditional diffusion models that disentangles local and global semantics in the model’s latent space. Given an input image and a user-specified region of interest, our approach uses the denoising network’s Jacobian to map that region to a corresponding latent subspace. We then separate this subspace into shared (global) and region-specific components to uncover latent directions that control local attributes. These directions generalize across images, enabling semantically consistent edits without retraining. We go one step further by extending our method to minimize manual supervision by automatically inferring edit directions from a single reference image and generating region masks without human input. Experiments on multiple datasets show that our method yields more localized, high-fidelity edits than state-of-the-art approaches.
Published in: International Journal of Computer Vision
Volume 134, Issue 4