Our method takes in a stack of generated images and produces a final image based on sparse user strokes. (a) In our image stack, images are generated normally through ControlNet, using one or more prompts. The generated images share common spatial structures, as they are produced using the same input condition (e.g., edge maps or depth maps). (b) Upon browsing the image stack, the user selects desired objects and regions via broad brush strokes on the images. In the example below, the user wishes to remove the rock at the Apple bite in the first image and add the red leaf from the third image. To do so, the user draws strokes on the base rock in the first image, the patch of grass in the second image, and the red leaf in the third image. Our system takes the user input and performs a multi-label graph cut optimization in self-attention feature space (K features) to find a segmentation of image regions across the stack that minimizes seams. (c) The graph-cut result is then used to form composite Q, K, V features, which are then injected into the self-attention layers. The final image is a harmonious composite of the user-selected regions.
By supporting the ability to combine generated images, Generative Photomontage allows users to achieve a wider range of results. Our method can be applied to a variety of use cases. Here, we highlight some use cases and show compelling results for each application.
Below, we show qualitative comparisons between our method and related works. In Interactive Digital Photomontage [Agarwala et al. 2004], pixel-space graph-cut may cause seams to fall on undesired edges, and their gradient-domain blending in general does not preserve color, e.g., the bird's yellow beak is not preserved in (f). Blended latent diffusion [Avrahami et al. 2023] and MasaCtrl+ControlNet [Cao et al. 2023] may also lead to color changes (c, f) and structural changes (a, b, d, e).
Our method assumes some spatial consistency among images in the stack. In cases where the images differ significantly in scene structure, our method may produce semantically incorrect outputs. (a) Two images have different horizons in the background. Naively combining two halves of the images leads to an inconsistent horizon (bottom left, circled red). Users can manually designate a consistent horizon by selecting the background of the second image (bottom, middle). (b) Alternatively, users can add a horizon in the input sketch to ControlNet to make it consistent across both images.
Second, our current graph cut parameters are empirically chosen to encourage congruous regions, which penalizes seam circumference. While this works well for many cases, if the target object has a curvy outline, it may require additional user strokes to obtain a finer boundary (see example below). Since graph cut solves in near real-time (~1s), users can quickly check the graph-cut result by visualizing it in image space and iterate as needed.
@article{generativephotomontage,
author = {Sean J. Liu and Nupur Kumari and Ariel Shamir and Jun-Yan Zhu},
title = {Generative Photomontage},
journal = {arXiv preprint arXiv:2408.07116},
year = {2024},
primaryClass = {cs.CV},
}
We are grateful to Kangle Deng for his help with setting up the user survey. We also thank Maxwell Jones, Gaurav Parmar, and Sheng-Yu Wang for helpful comments and suggestions and Or Patashnik for initial discussions. This project is partly supported by the Amazon Faculty Research Award, DARPA ECOLE, the Packard Fellowship, and a joint NSFC-ISF Research Grant no. 3077/23. The website template is taken from CustomDiffusion.