Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention

1Purdue University, 2Toyota Research Institute
MY ALT TEXT

Visual and textual conceptual blending results of IT-Blender based on FLUX.1-dev.

Abstract

Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.

Key Contributions

IT-Blender is a T2I diffusion adapter that can automate the blending process of visual and textual concepts to enhance human creativity.

  1. Cognitively inspired creativity-supporting AI: IT-Blender is inspired by the selective projection process in human cognition, which compares similarities and selectively applies relevant features to blend multiple concepts. The automated cross-modal conceptual blending can assist exploration of the design space in creative domains such as product, character, fashion design, or advertising, where generating diverse, inspiring, and unconventional concepts is crucial.
  2. Model-agnostic native image encoding to preserve detailed visual concepts from a reference image: We leverage the denoising network (both UNet-based and DiT-based) as an image encoder to maintain the details of visual concepts.
  3. Superior performance in applying visual concepts in a disentangled way from the textual concept: We design a novel Blended Attention that enables disentangled blending of textual semantics and detailed visual features, such as texture, material, color, and local shape.

Method

method figures

(a) Image Cross-Attention (imCA) approach with inversion method. They have advantages in applying the details of the visual concepts in a disentangled manner, while the performance is degraded when real images are given as input due to the distribution shift of the inversion chain.
(b) Naïve imCA-based approach without inversion method. Potentially better encodes the details of the clean image rather than noisy images as in inversion-based methods. The performance would be poor because of a significant distribution shift; the reference latents are from a clean image with t = 0 while the noisy latents are from noisy images with a t ≥ 0.
(c) IT-Blender. Can bridge the distributional gap between the noisy stream and the clean reference stream by learning how to map the clean latent (Z_ref) to the noisy latent (Z_noisy) in the projection space. The standard denoising objective is used for both StableDiffusion and FLUX. The comparison between (b) and (c) are provided below.

method equation

(d) Blended Attention. The branch on the left is the original pretrained self-attention module, which can keep the estimation on the original trajectory. The second branch connected to imCA on the right is the key to blended attention, which dynamically aligns SA(Z_noisy) and SA(Z_ref) in the output space of Self-Attention to fetch the useful visual concepts from the reference stream, driven by the query from the noisy stream. The equation is as follows:

method equation

Baseline Comparisons

Baseline comparisons with StableDiffusion 1.5

sd baseline comparison

Baseline comparisons with FLUX.1-dev

flux baseline comparison

Feasible Design

When a reference image is semantically close to a text prompt.
(click the image to zoom in; a pdf file will be opened.)

Imaginative Design

When a reference image is semantically far from a text prompt.
(click the image to zoom in; a pdf file will be opened.)

Societal Impact

Societal Impact

Positive societal impact. IT-Blender can augment human creativity, especially for people in creative industries, e.g., design and marketing. With IT-Blender, designers might be able to have better final design outcome by exploring wide design space in the ideation stage.
Negative societal impact. IT-Blender can be used to apply the design of an existing product to the new products. The user must be aware of the fact that they can infringe on the company’s intellectual property if a specific texture pattern or material combination is registered. We encourage users to use IT-Blender to augment creativity in the ideation stage, rather than directly having a final design outcome.

BibTeX

@article{cho2025imagineforme,
        title={Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention},
        author={Wonwoong Cho, Yanxia Zhang, Yan-Ying Chen, David I. Inouye},
        journal={arXiv preprint arXiv:2506.24085},
        year={2025}
      }