To address eL-TGIM, we propose a framework called
SeMani (Semantic Manipulation of real-world images), which consists of two phases: semantic alignment and image manipulation.
In the semantic alignment phase, SeMani utilizes a semantic alignment module to identify the regions of the image that need to be manipulated.
In the image manipulation phase, SeMani employs a generative model to create new images based on the entity-irrelevant regions and target descriptions.
Below is an illustration of SeMani.
To implement SeMani, we resort to two popular perspectives for viewing images: discrete and continuous. The discrete perspective draws inspiration from auto-regressive transformers, while the continuous perspective is inspired by denoising diffusion probabilistic models. These perspectives give rise to two variants of SeMani:
SeMani-Trans and
SeMani-Diff. Each variant of SeMani uses specific architectures and generation processes. SeMani can perform manipulation on multiple objects either simultaneously or sequentially, as shown below.
For details about SeMani-Trans and SeMani-Diff, please refer to our paper.