# diffusion_models_for_multitask_generative_modeling__7077a68c.pdf

Published as a conference paper at ICLR 2024

DIFFUSION MODELS FOR MULTI-MODAL GENERATIVE MODELING

Changyou Chen1,2 Han Ding2 Bunyamin Sisman2 Yi Xu2

Ouye Xie2 Benjamin Yao2 Son Tran2 Belinda Zeng2

1University at Buffalo 2Amazon

Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, e.g., images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data with a multi-task loss, which is derived from a new multi-modal variational lower bound that generalizes the standard diffusion model. We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on Image Net indicate the effectiveness of our framework for various multi-modal generative modeling, which we believe is an important research direction worthy of more future explorations.

1 INTRODUCTION

The field of artificial intelligence (AI) has witnessed significant advancements in generative modeling, leading to remarkable progresses such as DALL-E (Ramesh et al., 2022) and GPT-4 (Open AI, 2023). The generative AI paradigm enables the learning of transitions from simple to complex distributions, such as from a standard Gaussian distribution to a high-dimensional image distribution. Compared to discriminative learning, generative mechanisms can arguably prioritize the overall structures of the data, offering better data fitting and potential robustness to data noise. However, while real-world applications often involve data of multiple types (multi-modal), including images, video, text, and labels, most existing generative models primarily focus on generating a single data type or modality. Notably, the diffusion model (Sohl-Dickstein et al., 2015; Ho et al., 2020), a state-of-the-art generative model, has been independently developed for generating image, text, audio, and label data (Dhariwal & Nichol, 2021; Li et al., 2022b; Liu et al., 2023; Han et al., 2022). Can we design a principled way to enable joint modeling and generating multi-modal data within the diffusion-model framework?

Furthermore, leveraging multi-modal information through learning from multiple tasks and data sources has proven to be highly effective to learn generalized representations. Prominent examples include the ALBEF and BLIP models, which jointly learns from multi-modal data to match image and text (Li et al., 2021; 2022a; 2023), and the BERT model, which benefits from multi-task training such as masked token prediction and consecutive sentence prediction (Devlin et al., 2019b). Can we adopt a similar setting to leverage multi-modal data and losses into the diffusion-model framework, so as to better integrate shared information among tasks for better generative modeling?

In this paper, we present our initial endeavor towards this goal by introducing the multi-modal diffusion model with multi-task learning, referred to as MT-Diffusion. MT-Diffusion enables simultaneous modeling and generation of multi-modal data with a unified diffusion model. By multi-task, we emphasize that MT-Diffusion is designed to 1) simultaneously generate multi-modal data (potentially

Published as a conference paper at ICLR 2024

Data Encoder-1

loss Task-2

Forward noising

Reverse denoising

Aggregation

Reverse Denoising

Network Backbone

Diffusion space 𝑡= 0 𝑡= 𝑇

Multi-modal data Multi-task loss

Figure 1: Illustration of the proposed MT-Diffusion on two modalities. The diffusion process is defined in a shared diffusion space for all modality data, which are transformed from the modalityspecific encoders. The forward nosing process includes a forward aggregation step that integrates information from multi-modal data, and the reverse denosing component transforms the diffusion space back to the task-specific data spaces with learnable decoders through a multi-task loss.

heterogeneous such as images and labels) within a unified model; and 2) seamlessly integrate multitask learning losses into the diffusion framework in a principled manner, supported by theoretical foundations. Our multi-task setting is versatile and applicable to numerous practical scenarios. It is worth noting that, as an initial investigation to multi-modal diffusion models, we only focus on two modalities to demonstrate the promise of the research direction, while leaving training with more modalities as interesting future work. Particularly, we construct several practical multi-task generative learning scenarios in experiments to demonstrate the effectiveness of our framework:

Image transition: We consider jointly modeling multi-modal data, such as images and the corresponding semantic segmentation masks, by learning to generate both in the reverse process within our MT-Diffusion framework. We design this task as a synthetic experiment to qualitatively demonstrate the ability of our model on small-scaled datasets. Masked-image training*: Motivated from the previous success on masked-language pretraining such as BERT (Devlin et al., 2019a) in language modeling, we propose to combine a pure generation task with a masked-image generation task for generative training. We demonstrate on the Image Net dataset that our model can be more efficient in training a generative model, and can converge to a point comparable to (if not better than) the heavily tuned single-task diffusion model in terms of generative image quality. Furthermore, it can simultaneously obtain for free a great image-restoration ability for masked image recovery. Joint image-label generation: We jointly model images and the corresponding labels by learning to generate both with our MT-Diffusion. We demonstrate on the Image Net dataset that one can achieve better classification accuracy compared to pure supervised training. Joint image-representation generation: We also investigate simultaneously learning to generate images and representations (e.g., CLIP representations (Radford et al., 2021)) with MT-Diffusion. As this is a larger-scale setting based on stable diffusion, we only provide qualitative results to demonstrate the ability of our model to generate high-quality images from text, while leaving more detailed investigations as interesting future work.

Our solution for these multi-modal generation problems is a novel generalization of the standard diffusion model, designed to handle data from multiple modalities through both innovative algorithm and architecture designs in the diffusion forward and reverse processes. Our general idea is illustrated in Figure 1. In the forward process, multi-modal/multi-task data are first aggregated through some well-designed mechanisms (details in Section 2.2.2) so that the aggregated information can be conveniently applied to the forward noising operation of a diffusion model. To deal with potentially heterogeneous data, an effective encoder architecture design is proposed to encode multi-modal data into a shared diffusion space. In the reverse process, we propose to extend the original UNet architecture in diffusion models to simultaneously reconstruct the multi-modal data from different tasks. To this end, modality-specific decoder heads are designed to be attached to the U-Net architecture to decode the diffusion latent code back to multi-modal data spaces. The forward and reverse processes are then integrated within the diffusion mechanism, leading to a loss derived from a new multi-task evidence lower bound (ELBO), as a multi-task loss. Extensive experiments on the aforementioned problems are conducted to verify the effectiveness of our framework, demonstrating that our model can achieve simultaneous generation without hurting individual task performance, a promising generalization of the standard diffusion model for multi-modal generative learning.

*The recent Diff MAE (Wei et al., 2023) is fundamentally different from ours. It is a standard conditional diffusion model to denoise the pre-masked region; ours models both image and mask generation. The default U-Net architecture is adopted, though the Transformer (Peebles & Xie, 2022) can also be used.

Published as a conference paper at ICLR 2024

2 MULTI-MODAL DIFFUSION MODELS

2.1 PRELIMINARIES ON DENIOSING DIFFUSION PROBABILITY MODELS (DDPM) DDPM is a probability generative model that consists of a forward noising process and a reverse denoising process operated on a diffusion space. The forward process gradually adds Gaussian noise into the data, which ultimately become standard Gaussian samples; and the reverse process parameterizes a neural network model to reverse the forward process. Specifically, given a data sample x from the data distribution, the forward process from time t 1 to time t is defined as q(zt | zt 1) = N(zt; 1 βt zt 1, βt I), where zt represents a noisy version of the original data sample z0 at time t; {βt} is an increasing sequence converging to 1 (making xt converge to a standard Gaussian sample). A reverse process is modeled by a neural network (we consider a U-Net for image data) parameterized by θ as pθ(zt 1 | zt) = N(zt 1; µθ(zt, t), Σθ(zt, t)). Considering all time steps t = 1, , T, the forward and reverse processes define two joint distributions over the same set of random variables {z0, , z T }. By variational principle, a loss corresponding to the evidence lower bound (ELBO) can be derived to optimize the parameterized generative model θ, as L = Eq(z0, ,z T ) h log p(z T ) P

t 1 log pθ(zt 1 | zt)

q(zt | zt 1) i .

2.2 MULTI-MODAL DIFFUSION MODELS

We propose the MT-Diffusion model to jointly model multi-modal data with multi-task learning by generalizing the DDPM framework. We assume each task is associated with one data modality (the data can be the same for different tasks). For example, an unconditional image-generation task is associated with image data, and an image classification task with image-label paired data. Suppose there are N modalities, where modality i is associated with task data from space Xi. Let xi denote one data sample from the i-th modality space, and let X {x1, , x N} be the union data from the N modalities. We note that our setting is quite general in the sense that the data spaces {Xi} can be heterogeneous, e.g., the image space versus the image-label space as from our previous example.

To deal with potential heterogeneity of modality-data spaces, we propose to define MT-Diffusion in a shared latent space, called diffusion space and denoted as Z. To this end, we propose to apply a mapping to project each of the original modality-data space onto the shared diffusion space. We define the mapping with an encoder Ei for task i, i.e., Ei : Xi Z, as illustrated in Figure 1. For simplicity, we consider non-parametric or fixed-parameter encoders. The specified encoder designs are detailed in Section 2.2.4. In the following, we first formally define the proposed MT-Diffusion by specifying the forward and reverse processes, as well as deriving the corresponding variaitonal lower bound, by extending the DDPM framework to handle multiple data sources and multi-task losses.

2.2.1 FORWARD-REVERSE PROCESSES AND THE VARIATIONAL LOWER BOUND

𝑧! 𝑧"#$ 𝑧" 𝑧%

𝑧! 𝑧"#$ 𝑧" 𝑧%

Forward Reverse

Modality data

Figure 2: The forward (left) and reverse (right) processes of the proposed MT-Diffusion by jointly modeling a set of task data.

In our design, the forward and reverse processes will be responsible for integrating multi-modal data information and multi-task losses within the DDPM framework, respectively. This is implemented by first defining joint distributions over data modalities and the diffusion latent variables in both forward and reverse processes. Specifically, in the forward process, the noising transition from time t 1 to time t is defined to be conditioned on the modality data. To this end, we propose to define a joint distribution at time t over the data X = {x1, , x N} and the diffusion latent variable zt, conditioned on information from time t 1, to endow the following decomposed form : q(zt, X | zt 1) = q(zt | zt 1, x1, , x N)

i=1 qi(xi) , (1)

where q(zt | zt 1, x1, , x N) represents the transition distribution of zt from time t 1 to time t, and {qi(xi)} denotes prior distributions of the modality data that we assume to be mutually independent for simplicity. We denote this process as forward aggregation, and the specific probability distributions will be defined in the next section. Furthermore, the reverse process is defined by simply

We assume modality data are time-independent, although it is also feasible to introduce time dependency.

Published as a conference paper at ICLR 2024

reversing the forward distributions, resulting in a joint distribution pθ(zt 1, X | zt) at time t, where θ represents the reverse model parameter. Specifically, starting by sampling z T from p(z T ), we propose to decompose the reverse transition at time t into the following conditional distributions: pθ(zt 1, X | zt) = pθ(zt 1 | zt) QN i=1 pθ(xi | zt). The random variable dependency and the general forward-reverse processes are illustrated in Figure 2. Before specifying these distributions, we first derive an objective by matching the joint distributions of the forward and reverse processes. This results in a multi-task ELBO for the proposed MT-Diffusion, based on which a final loss can be defined in Section 2.2.5. Theorem 1. The negative ELBO of MT-Diffusion endows: L = Eq [L0 + L1 + L2 + L3], where

L0 KL (q(z T | z0, X) p(z T )) , L1 X

t>1 KL (q(zt 1 | z0, zt, X) pθ(zt 1 | zt)) , (2)

i=1 KL (qi(xi) pθ(xi | zt)) , L3 log pθ(z0 | z1) .

Remark 1. We can see that the prior multi-modal data distributions are within the loss term L2. If only a single generation task is considered, the sub-loss L2 will disappeared, reducing to the standard DDPM loss. Our multi-modal diffusion objective defines the posterior of the transition probability q(zt 1 | zt, X) by conditioning on all modality data (in L1), and additionally, as formulated in L2, parameterizes the reverse process to regularize the predicted modality-data distribution pθ(xi | zt 1) so that it matches the prior modality data distribution qi(xi).

2.2.2 FORWARD AGGREGATION

The forward aggregation mainly deals with the posterior transition probability q(zt 1 | z0, zt, X) in L1 of equation 2. To derive an explicit form, we start by specifying the forward transition probability q(zt | zt 1, X), which can consequently induce the marginal distribution q(zt | z0, X) as well as the posterior transition probability. To integrate different task information, we define the forward transition distribution as a Gaussian distribution by aggregating the task information into the mean parameter. Specifically, we define

q(zt | zt 1, X) = N

α t zt 1 + PN i=1 w(i) t Ei(xi) N + 1 , (1 α t N + 1) I

where w(i) t denotes the weight for the i-th modality representation at time t, and {α t} are weights to scale the mean and covariance of the Gaussian transition similar to DDPM. By a change of notation αt α t/(N + 1), the transition distribution can be re-written as q(zt | zt 1, X) = N zt; αt(zt 1 +wt E(x)), (1 αt) I) , which we will use in the following derivations and implementation. With these transition distributions, multi-task information can be seamlessly incorporated into the diffusion process, which can effectively translate to the reverse process with a parametric model to be defined in Section 2.2.3. Now we can derive the marginal and posterior transition distributions, which turn out to also endow simple forms of Gaussian distributions, stated in Theorem 2. Theorem 2. Given the transition distribution equation 3, the marginal transition distribution follows

q(zt | z0, X) = N

zt; αt z0 +

i=1 α(i) t Ei(xi), (1 αt) I

where αt Qt i=1 αi, and α(i) t is recursively defined as α(i) t = αt w(i) t + α(i) t 1 with α(i) 0 0.

Furthermore, the posterior transition follows q(zt 1 | z0, zt, X) = N zt 1; µt(zt, X), βt I , where

µt(zt, X) =

αt(1 αt 1) zt +(1 αt) αt 1 z0 + PN i=1

(1 αt) α(i) t αt (1 αt)wt

zt 1 αt 1 αt ϵ

i=1 w(i) t Ei(xi) , and βt = (1 αt)(1 αt 1)

Remark 2. The posterior transition distribution equation 5 shares a similar form as that in DDPM, with an extra term of PN i=1 w(i) t Ei(xi) representing information aggregated from all tasks (thus aggregation). Note the aggregation is defined in the forward process, enabling a closed-form posterior but without losing too much modeling expressiveness compared to other complex aggregations.

Published as a conference paper at ICLR 2024

𝐸(𝑋) 𝑧" 𝑧"#$

Encoder 𝐸( )

𝐷( ) 𝐸𝑋= UNet(𝑋)

Pretrained generator 𝐺

Identity mapping

Gradient update

Figure 3: Training pipeline and encoder-decoder design choices. ①②③indicate three possible choices for the encoder E( ); gray shaped boxes indicate stop gradients; and black dash lines mean possible connections to the encoder and decoder. Aggregate is implemented through equation 4.

2.2.3 REVERSE PARAMETRIZATION

Based on the ELBO in Theorem 1, the reverse model is responsible for defining two sets of distributions: pθ(zt 1 | zt) and pθ(xi | zt). The first distribution is similar to that in DDPM, and the second one is induced from decoding the diffusion latent code back to modality-data spaces. To leverage these distributions within a unified architecture for task information sharing, we propose to parameterized the reverse model with a shared backbone network followed by N extra modality heads, each corresponding to one modality. The basic structure is illustrated in Figure 1. Specifically, for pθ(zt 1 | zt), we follow DDPM to define it as Gaussian distributions with mean and covariance denoted as µθ(zt, X) and σ2 t I, respectively. Consequently, the KL-divergence in L1 of equation 2 reduces to matching the mean of the two Gaussians with a proper weighting scheme depending on t. Based on the form of the mean of q(zt 1 | z0, zt, X) in equation 5, instead of parameterizing the mean of pθ(zt 1 | zt), we follow DDPM to parameterize the U-Net to predict the intrinsic noise in zt (denoted as ϵ). Specifically, the parametrized U-Net model ϵθ(zt, t) is formulated as: ϵθ(zt, t) = ϵθ( αt z0 + αt X + 1 αtϵ, t) ϵ.

For the decoding distributions pθ(xi | zt) s in the L2 term of equation 2, the distribution forms are modality specific. We consider the following two cases in our experiments:

When modality data are represented as probability vectors, e.g., labels in the classification task, we define pθ(xi | zt) as a discrete distribution, parameterized by the output of the modality head. Consequently, the KL-term in L2 is equivalent to the cross-entropy loss. When modality data are in the form of continuous values, we define pθ(xi | zt) as a Gaussian distribution with the mean parameterized by the output of the modality head. In this case, the KL-divergence in L2 reduces to the MSE loss, similar to the case for pθ(zt 1 | zt).

It is worth noting that different from pθ(zt 1 | zt), the variables xi and zt in pθ(xi | zt) can be in different feature spaces. Thus, a decoder Di( ) in the form of one modality head specified above is applied to project the latent code zt back to the modality-data space, based on which a proper pθ(xi | zt) is defined, as illustrated in Figure 1. Specifically, the decoding process can be written as:

At time t : zt Diffusion denoising ct U-Net(zt, t; θ) Task i decoding xi Di(ct; θ) xi ,

where we use U-Net to denote the output from one particular component of the U-Net, serving as the input to the decoder head (see Section 2.2.4 for more details). In other words, the reverse parameterized model consists of two parts: ϵθ(zt, x, t) and Di(zt, t; θ). Detailed structure designs to integrate the decoders (together with the encoder E( ) in the forward process) into the shared U-Net backbone is discussed in the next section.

2.2.4 ENCODER-DECODER DESIGNS

The encoders aim to map different task-data onto the diffusion space, and the decoders project the diffusion latent code from the shared U-Net backbone back to the task-data spaces. As the encoders are associated with the forward process, we propose to avoid introducing extra trainable parameters in the encoders for simplicity. Furthermore, we propose to introduce trainable parameters to the decoders as they are parts of the parameterized reverse model. There are many possible design

Published as a conference paper at ICLR 2024

choices for the encoders and decoders. Our guideline is to choose architectures to reuse existing components or some pretrained models as best as possible. Based on this principle, we recommend the following designs, with the detailed training pipeline and architectures illustrated in Figure 3.

Encoder Design We consider three scenarios, indicated by ①②③in Figure 3: 1) A modality-data space is the same as the diffusion space, e.g., both image spaces. In this case, we can define the encoder as a simple mapping such as the identity mapping ①in Figure 3; 2) A modality-data space is inhomogeneous with the diffusion space, e.g., a label space vs. an image space. In this case, we propose to use either a pretrained generator (②in Figure 3) or the shared U-Net backbone (③in Figure 3) to transfer modality-data information to the diffusion space. Particularly, for choice ③, we use the cross attention mechanism in the U-Net architecture to map modality-data information to the diffusion space. Decoder Design The decoders are modality and task specific. They accept outputs from one of the U-Net blocks (indicated by black-dash-line connections in Figure 3) and learn to generate the original modality data. For example, in a classification task, the decoder is designed as a classifier that outputs a class label, associated with a cross-entropy loss.

2.2.5 TRAINING AND INFERENCE

Training We propose a simple training loss for our MT-Diffusion, based on the ELBO equation 2 and the specific forward and reverse parameterization described above. In the ELBO, L0 is independent of the model parameter θ, thus it can be omitted in training. By substituting the specific distributions into the ELBO and adopting the simple loss idea in DDPM (Ho et al., 2020) that ignores the weights for different timesteps, the ELBO equation 2 reduces to the following training loss for the proposed MT-Diffusion:

L Lmse + λ X

i=1 KL (qi(xi) pθ(xi | zt)) , where zt q(zt | z0, X) , (6)

and Lmse Eq h P

t>1 ϵθ(zt, t) ϵ 2 + log pθ(z0 | z1) i has the same form as the simple loss in DDPM; λ is the weight scalar (we set it to 0.1 in our experiments). In particular, the KL terms above can endow closed forms depending on the modality-specific qi(xi). For example, in a classification task with xi representing labels, both qi(xi) and pθ(xi | zt) are defined as discrete distributions, making the KL divergence equivalent to the cross entropy. We apply stochastic optimization for model learning. At each iteration, a random timestep t is first sampled. Then the corresponding zt is sampled from the forward process with task information aggregation. We then feed zt to the reverse model to predict the forward noise and the modality data. Finally, gradient descent is applied to update the model parameter based on the loss equation 6.

Inference A distinction of our model is its ability to simultaneously generating multi-modal data. We propose a generic inference procedure that can achieve both unconditional generation (with initially all missing modality data X) or conditional generation (with initially parts of X known). The basic idea is to estimate the potentially missing modality data from the corresponding heads of the reverse model outputs. The specific algorithm is summarized in Algorithm 1 in Appendix C.

3 RELATED WORK

Diffusion-based Models Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) have been state-of-the-art generative models on a variety of applications including image syntheses (Dhariwal & Nichol, 2021), text-to-image generation (Ramesh et al., 2021; Saharia et al., 2022; Yu et al., 2022; Rombach et al., 2022), audio generation (Kong et al., 2021; Liu et al., 2023), video generation (Ho et al., 2022; Harvey et al., 2022; Singer et al., 2023) and text generation (Austin et al., 2021; Li et al., 2022b; He et al., 2022; Gong et al., 2023), etc. All these models, however, only focus on a single generation task, in contrast to our multi-task generation.

Manipulating Diffusion Latent Spaces There have been significant efforts to manipulating latent spaces of pretrained diffusion models for a variety of downstream tasks, including text-driven image editing, inpainting, completion and etc (Gal et al., 2022; Ruiz et al., 2022b; Cohen et al., 2022; Kawar et al., 2022; Meng et al., 2021b; Bau et al., 2021; Avrahami et al., 2022b;a; Bar-Tal et al., 2022; Lugmayr et al., 2022). There are also related works to learning a more discriminative latent space (Zhang et al., 2022; Preechakul et al., 2022) or manipulate a latent space for better representations (Kwon et al., 2023). These methods, however, are task-driven and do not provide

Published as a conference paper at ICLR 2024

Figure 4: Random samples on the night2day (top, unconditional generation) and cityscape (bottom, conditional generation) datasets. The 3 pictures in each block of the cityscape dataset (bottom) correspond to the conditional image (source), the ground-truth and the inferred target image, respectively.

Figure 5: Randomly generated examples of MT-Diffusion with masked-image training. First row: image restoration from random masks; images in each block: original, masked and restored images. Second row: image restoration from half masking; each block contains two restored images to illustrate generation variance. Third row: image generation from scratch with complete masks.

theoretical foundation on the working principles; while all the tasks considered in the literature can be accomplished with our MT-Diffusion framework by incorporating the corresponding task data.

Our model is also related to guided diffusion, which is discussed in details in Appendix D.

4 EXPERIMENTS

As a first work on multi-modal diffusion models, we focus on evaluating our MT-Diffusion on the 4 multi-task learning settings mentioned in the Introduction, while leaving other more complicated settings such as incorporating more tasks into training as interesting future work. We describe detailed hyperparameter settings, the four tasks and encoder-decoder designs in Appendix E. In the four experiments, MT-Diffusion for image translation and joint image-representation generative modeling are mostly for illustration purposes. Thus, the results are mostly qualitative, e.g., Figure 4 demonstrates some visualizations of image-translate. More details are deferred to Appendix E.

4.1 MT-DIFFUSION FOR MASKED-IMAGE TRAINING

Table 1: LPIPS score for masked image recovery. Mask-m means masking an image with m patches.

Model Mask-5 Mask-10 Mask-15 Mask-20

Clean-Masked 0.311 0.414 0.461 0.491

SDEdit Meng et al. (2022) 0.400 0.466 0.497 0.513

MT-Diffusion 0.035 0.068 0.099 0.133

Task description and encoderdecoder design We propose the masked-image training, a new training strategy we design to improve image generation with MT-Diffusion. The task is motivated from the maskedlanguage pretraining paradigm in NLP models such as BERT (Devlin et al., 2019a), which achieves significant success with multi-task training. Specifically, in addition to the standard image generation task that generates images from noise, we define an additional task, called a random inpainting task, which learns to recover from randomly masked images. Consequently, the forward process is defined to start from a clean-masked

Published as a conference paper at ICLR 2024

IS FID s FID Precision Recall

ADM (un-cond) 15.64 23.22 16.53 0.57 0.60

ADM (class cond) 22.69 16.35 17.49 0.59 0.62

Generation with Masked-Image Training (Section 4.1)

MT-Diffusion-U 23.77 26.00 25.22 0.57 0.55 MT-Diffusion-X 34.53 9.85 15.78 0.68 0.64

Generation with Joint Image-Label Modeling (Section 4.2)

MT-Diffusion-M 15.66 33.92 21.63 0.54 0.55 MT-Diffusion-M 26.31 13.48 16.64 0.66 0.61

MT-Diffusion-E 11.86 41.00 22.37 0.48 0.45 MT-Diffusion-E 24.78 11.28 16.08 0.74 0.56 Table 2: Training efficiency comparisons on Image Net-64.

Figure 6: Classification accuracies vs. supervised finetuning iterations on the Image Net validation data. image pair, and gradually add noise to the pair in the forward process. To create the masked images, we randomly sample m Uniform(0 10) patches of size 16 16 from the original image and mask them out with zero pixel-values. These patches are placed randomly so they might overlap with each other. We adopt two choices for the noise prediction network ϵθ( ): the first only considers (zt, t) as input, denoted as MT-Diffusion-U; the other takes (zt, X, t), denoted as MT-Diffusion-X. The former is more specifically designed for unconditional generation, whereas the latter is more suitable for constrained image restoration, a task we defined as inpainting a randomly masked image while keeping the unmasked region unchanged. Note current state-of-the-art diffusion models do not directly deal with this problem. Existing work for image editing and inpainting such as SDEdit (Meng et al., 2021a) and Dreambooth (Ruiz et al., 2022a) do not explicitly enforce unmask region consistency, thus their inpainting results can change the unmasked region. By considering simultaneously generating the clean-masked image pairs with MT-Diffusion, our model can maximally learn to maintain the consistency of the unmasked regions. Similar to the image-transition experiment, we use the identity map as the encoder, and replicate the output block of the original U-Net as the additional masked-image decoder, which consists of a normalization layer and a convolution layer.

Table 3: Generated image comparisons on Image Net-128. The subscript g means classifier guidance; the superscript * on ADM indicates results from the released checkpoint; ADM without * is the one continued training on the released checkpoint. MT-Diffusion represents the MT-Diffusion-X version; superscript f indicating continued finetuning on single-task image generation; and the superscript * indicates results from the constrained image restoration task with 10% random masking (not directly comparable with the other settings).

IS FID s FID Precision Recall

ADM 79.95 8.46 4.92 0.67 0.66

ADM g 151.10 3.56 4.63 0.79 0.57

ADM 74.04 9.62 5.61 0.66 0.64

ADMg 140.89 4.27 5.58 0.79 0.55

MT-Diffusion 78.26 8.42 5.89 0.67 0.65 MT-Diffusiong 81.06 8.06 5.83 0.68 0.65

MT-Diffusionf 84.22 7.01 5.99 0.69 0.64 MT-Diffusionf g 171.65 3.51 5.73 0.83 0.53

MT-Diffusion 135.35 2.15 3.86 0.72 0.68 MT-Diffusion g 138.21 2.02 3.84 0.73 0.69

Results We implement our method based on the guided diffusion codebase (dif) on Image Net (Deng et al., 2009). We first demonstrate that our framework can help to improve the training efficiency for image generation. To this end, we compare our and the ADM models (Dhariwal & Nichol, 2021) before convergence at 1M iteration at image resolution 32. For a fair comparison, all models are trained from scratch with exactly the same hyperparameters (thus the numbers are not directly comparable to the reported values). We adopt the same evaluation metrics as the ADM under 5K samples. Quantitative results are shown in Table 2. Note MT-Diffusion-U is designed to generate images from complete random noise while MT-Diffusion-X needs a conditional masked input, which we simply define as a complete mask (all zeros) for this purpose. It is observed both the two variants significantly outperform the unconditioned ADM, indicating the training efficiency and modeling effectiveness of multi-task generative learning via masked-image training. In addition, MT-Diffusion-X is found to perform better than MT-Diffusion-U. We hypothesize this is because the conditional masked-image information makes the training of the model easier and more effective.

In addition to pure image generation, one unique property of MT-Diffusion-X is its ability to simultaneously perform constrained image restoration. To this end, we randomly mask out some testing images with m = {5, 10, 15, 20} patches and learn to restore the masked patches. We expect

Published as a conference paper at ICLR 2024

MT-Diffusion-X to restore masked images without changing unmasked regions. Some example results are illustrated in Figure 5, which clearly demonstrate the strong ability of MT-Diffusion-X for constrained image restoration as well as generation from scratch. For quantitative evaluation, we adopt the LPIPS score (Zhang et al., 2018) that measures the semantic similarity of the original the restored images, and compare our method with one simple baseline from Meng et al. (2022), denoted as SDEdit-. The results are shown in Table 1, where the row of Clean-Masked denotes the LPIPS scores of clean-masked image pairs that we include for reference. It is clear that MT-Diffusion-X obtains scores closed to zero, indicating the closed similarity between restored images and original images. SDEdit-, on the other hand, obtains very high LPIPS scores, which are even higher than the Clean-Masked baseline. This is expected since SDEditis not specifically designed for such a task.

Finally, to demonstrate the ability of our model to generate high-quality images at convergence, we compare our method with ADM for pure image generation, under a resolution of 128. We train our MT-Diffusion-X from scratch. We evaluate the ADM with two versions: the released checkpoint and a continued trained version from the checkpoint with the same hyperparameters as our model, for a more fair comparison. The results are shown in Table 3 evaluated on 50K samples. Our method achieves comparable performance than ADM (if not better on some metrics such as the IS), while being able to perform more tasks such as the constrained image restoration demonstrated above. It is also noted that the continue-trained version of ADM (started from the released checkpoint) is slightly worse than the released checkpoint, indicating the latter might have been tuned for best performance, and thus it is more fair to compare our methods with the continue-trained version of ADM.

4.2 MT-DIFFUSION FOR JOINT IMAGE-LABEL GENERATION MODELING Task description and encoder-decoder design This is a more heterogeneous case as labels and images are in different spaces. We follow our design principle to use the diffusion U-Net as the encoder for labels via cross attention in the U-Net. The label decoder is an additional head from some layer of the U-Net. We consider two designs: 1) Add one additional MLP layer out of the middle block of the UNet to map the diffusion latent space onto the label space. This introduces minimal extra parameters into the original reverse model but might enforce some discriminative information not helpful for pure image generation. We denote this variant as MT-Diffusion-M. 2) Add a pre-defined classifier at the end of the U-Net output. Since the U-Net output can be used to reconstruct the original image, this structure essentially makes the learning of image generation and classification in a sequential manner. In the experiments, we use the classifier provided in the guided diffusion codebase (dif) as the label decoder. We denote this variant as MT-Diffusion-E. Results We adopt the same experiment setting as the previous section. In addition to measuring the generated image quality, we also measure the classification performance using the classifier in MT-Diffusion-E. We find that for such a heterogeneous setting, continuing finetuning both the generator and classifier with single generation and classification tasks, respectively, can significantly improve single-task performance. Consequently, we adopt similar idea of classifier-free guidance to simultaneously learn a pure-generation model along with the multi-task training with the shared U-Net. We denote the finetuned models as MT-Diffusion-M and MT-Diffusion-E . The results are reported in Table 2. It is observed that MT-Diffusion-M performs better than MT-Diffusion-E, which slightly under-perform the single-task ADM. This is expected and indicates that learning to generate heterogeneous data can trade off single-task performance. We also believe the performance gap is partly due to the un-tuned sub-optimal hyperparameter setting. With single-task finetuning, we can see that both variants outperform ADM in all metrics. To continue finetuning the classifier, we use the training and evaluation script from the codebase (dif), and compare with the pretrained classifier in classifier-guidance ADM (dif). We continue finetuing the classifier from the released checkpoint as a baseline. Top-1 and top-5 accuracies are plotted in Figure 6. It is observed that our classifier consistently outperforms the baseline, although the gap turns smaller with increasing finetuning steps.

5 CONCLUSION

We propose the multi-modal diffusion model with multi-task training, a generalization of the standard diffusion model for multi-modal generation. Our model is general and flexible, which can incorporate potentially heterogeneous modality information into a unified diffusion model, compared to training on a single-task setting. We define several multi-task generative problems and test them on our proposed MT-Diffusion. Extensive experiments are performed to verify the effectiveness of our proposed framework. Interesting future works include improving the framework by better networkarchitectures designs and applying the method to more diverse multi-modal and multi-task settings.

Published as a conference paper at ICLR 2024

Guided diffusion. https://github.com/openai/guided-diffusion. Accessed: 202305-11.

Latent Diffusion Models. https://github.com/Jack000/glid-3-xl. Accessed: 202305-11.

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=h7-Xix PCAL.

Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. ar Xiv preprint ar Xiv:2206.02779, 2022a.

Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208 18218, 2022b.

Zhipeng Bao, Martial Hebert, and Yu-Xiong Wang. Generative modeling for multi-task visual learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 1537 1554. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/bao22c.html.

Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pp. 707 723. Springer, 2022.

David Bau, Alex Andonian, Audrey Cui, Yeon Hwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word. ar Xiv preprint ar Xiv:2103.10951, 2021.

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Semi-parametric neural image synthesis. In Advances in Neural Information Processing Systems, 2022.

Zitian Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Hengshuang Zhao, Erik G. Learned-Miller, and Chuang Gan. Mod-squad: Designing mixture of experts as modular multi-task learners. Ar Xiv, abs/2212.08066, 2022. URL https://api.semanticscholar.org/Corpus ID: 263793067.

Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo J. Kim, and Sung-Hoon Yoon. Perception prioritized training of diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11462 11471, 2022. URL https: //api.semanticscholar.org/Corpus ID:247922317.

Niv Cohen, Rinon Gal, Eli A Meirom, Gal Chechik, and Yuval Atzmon. " this is my unicorn, fluffy": Personalizing frozen vision-language representations. ar Xiv preprint ar Xiv:2204.01694, 2022.

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. Ar Xiv, abs/1810.04805, 2019a.

Published as a conference paper at ICLR 2024

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, Minneapolis, Minnesota, June 2019b. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https: //aclanthology.org/N19-1423.

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780 8794, 2021.

Heshan Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent approach. In International Conference on Learning Representations, 2023. URL https: //api.semanticscholar.org/Corpus ID:259268083.

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. ar Xiv preprint ar Xiv:2208.01618, 2022.

Hyojun Go, Jinyoung Kim, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, and Seungtaek Choi. Addressing negative transfer in diffusion models. Ar Xiv, abs/2306.00354, 2023. URL https://api.semanticscholar.org/Corpus ID:258999775.

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=j Qj-_ r LVXsj.

Xizewen Han, Huangjie Zheng, and Mingyuan Zhou. CARD: Classification and regression diffusion models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum? id=4L2z YEJ9d_.

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7407 7417, 2023. URL https: //api.semanticscholar.org/Corpus ID:257557255.

William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Dietrich Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=0RTJcuv Ht Iu.

Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. ar Xiv preprint ar Xiv:2211.15029, 2022.

Falk Heuer, Sven Mantowsky, Syed Saqib Bukhari, and Georg Schneider. Multitask-centernet (mcn): Efficient and diverse multitask learning using an anchor free approach. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 997 1005, 2021. URL https://api.semanticscholar.org/Corpus ID:236976352.

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.

Jonathan Ho, Tim Salimans, Alexey A. Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022. URL https://openreview.net/forum?id=BBel R2Nd DZ5.

Published as a conference paper at ICLR 2024

Yi Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wen Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17853 17862, 2022. URL https://api.semanticscholar. org/Corpus ID:257687420.

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. Ar Xiv, abs/2212.04089, 2022. URL https://api.semanticscholar.org/Corpus ID:254408495.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.

Junguang Jiang, Baixu Chen, Junwei Pan, Ximei Wang, Liu Dapeng, Jie Jiang, and Mingsheng Long. Forkmerge: Overcoming negative transfer in multi-task learning. Ar Xiv, abs/2301.12618, 2023. URL https://api.semanticscholar.org/Corpus ID:256390285.

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. ar Xiv preprint ar Xiv:2210.09276, 2022.

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=a-x FK8Ymz5J.

Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models already have a semantic latent space. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=pd1P2e UBVfq.

Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and James Hays. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on Graphics (proceedings of SIGGRAPH), 33(4), 2014.

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Neur IPS, 2021.

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.

Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusion LM improves controllable text generation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=3s9Ir Esj Lyk.

Haohe Liu, Zehua Chen, Yiitan Yuan, Xinhao Mei, Xubo Liu, Danilo P. Mandic, Wenwu Wang, and Mark D . Plumbley. Audioldm: Text-to-audio generation with latent diffusion models. Ar Xiv, abs/2301.12503, 2023.

Yang Liu, Zhaowen Wang, Hailin Jin, and Ian James Wassell. Multi-task adversarial network for disentangled feature learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3743 3751, 2018. URL https://api.semanticscholar.org/Corpus ID: 52203209.

Alexander Long, Wei Yin, Thalaiyasingam Ajanthan, Vu Nguyen, Pulak Purkait, Ravi Garg, Alan Blair, Chunhua Shen, and Anton van den Hengel. Retrieval augmented classification for long-tail visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6959 6969, 2022.

Published as a conference paper at ICLR 2024

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11461 11471, 2022.

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021a.

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021b.

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.

Kevin P. Murphy. Probabilistic Machine Learning: An introduction. MIT Press, 2022. URL probml.ai.

Open AI. Gpt-4 technical report. Ar Xiv, abs/2303.08774, 2023.

Byeongjun Park, Sangmin Woo, Hyojun Go, Jin-Young Kim, and Changick Kim. Denoising task routing for diffusion models. Ar Xiv, abs/2310.07138, 2023. URL https://api. semanticscholar.org/Corpus ID:263835004.

William Peebles and Saining Xie. Scalable diffusion models with transformers. ar Xiv preprint ar Xiv:2212.09748, 2022.

Hoang Phan, Ngoc Nguyen Tran, Trung Le, Toan Tran, Nhat Ho, and Dinh Q. Phung. Stochastic multiple target sampling gradient descent. Ar Xiv, abs/2206.01934, 2022. URL https://api. semanticscholar.org/Corpus ID:249395631.

Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ar Xiv preprint ar Xiv:2102.12092, 2021.

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. Ar Xiv, abs/2204.06125, 2022.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684 10695, June 2022.

Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. Latent multi-task architecture learning. In AAAI Conference on Artificial Intelligence, 2017. URL https://api. semanticscholar.org/Corpus ID:115985550.

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. Ar Xiv, abs/2208.12242, 2022a.

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. ar Xiv preprint ar Xiv:2208.12242, 2022b.

Published as a conference paper at ICLR 2024

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. Ar Xiv, abs/2205.11487, 2022.

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models. Ar Xiv, abs/2210.08402, 2022.

Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin. Independent component alignment for multi-task learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20083 20093, 2023. URL https://api.semanticscholar. org/Corpus ID:260098848.

Mohit Sharma, Claudio Fantacci, Yuxiang Zhou, Skanda Koppula, Nicolas Manfred Otto Heess, Jonathan Scholz, and Yusuf Aytar. Lossless adaptation of pretrained vision models for robotic manipulation. Ar Xiv, abs/2304.06600, 2023. URL https://api.semanticscholar.org/ Corpus ID:258108073.

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-tovideo generation without text-video data. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=n Jfyl Dvgzlq.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2256 2265, Lille, France, 07 09 Jul 2015. PMLR.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=St1giar CHLP.

Malik Tiomoko, Hafiz Tiomoko Ali, and Romain Couillet. Deciphering and optimizing multi-task learning: a random matrix approach. In International Conference on Learning Representations, 2021. URL https://api.semanticscholar.org/Corpus ID:235613471.

Nilesh Tripuraneni, Michael I. Jordan, and Chi Jin. On the theory of transfer learning: The importance of task diversity. Ar Xiv, abs/2006.11650, 2020. URL https://api.semanticscholar. org/Corpus ID:219966003.

Haoxiang Wang, Han Zhao, and Bo Li. Bridging multi-task learning and meta-learning: Towards efficient training and effective adaptation. Ar Xiv, abs/2106.09017, 2021. URL https://api. semanticscholar.org/Corpus ID:235446399.

Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. ar Xiv preprint ar Xiv:2211.05778, 2022.

Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Loddon Yuille, and Christoph Feichtenhofer. Diffusion models as masked autoencoders. Ar Xiv, abs/2304.03283, 2023.

Sen Wu, Hongyang Zhang, and Christopher Ré. Understanding and improving information transfer in multi-task learning. Ar Xiv, abs/2005.00944, 2020. URL https://api.semanticscholar. org/Corpus ID:213745217.

Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile diffusion: Text, images and variations all in one diffusion model. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7720 7731, 2022. URL https://api.semanticscholar. org/Corpus ID:253523371.

Published as a conference paper at ICLR 2024

Hanrong Ye and Dan Xu. T ask p rompter : S patial -c hannel m ulti -t ask p rompting for d ense s cene u nderstanding. 2023. URL https://api.semanticscholar.org/Corpus ID: 259905659.

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Benton C. Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. Ar Xiv, abs/2206.10789, 2022.

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.

Zijian Zhang, Zhou Zhao, and Zhijie Lin. Unsupervised representation learning from pre-trained diffusion probabilistic models. ar Xiv preprint ar Xiv:2212.12990, 2022.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.

Published as a conference paper at ICLR 2024

A ELBO OF THE PROPOSED MULTI-TASK DIFFUSION MODEL: THEOREM 1

We give detailed derivations of our multi-task diffusion ELBO in the following:

log p(z T ) T

i=1 log 1 q(xi) X

t 1 log pθ(zt 1 | zt) QN i=1 pθ(xi | zt) q(zt | zt 1, X)

log p(z T ) T

i=1 log 1 q(x) X

t>1 log pθ(zt 1 | zt) QN i=1 pθ(xi | zt) q(zt 1 | zt, z0, X) q(zt 1 | z0, X)

q(zt | z0, X)

log pθ(z0 | z1) QN i=1 pθ(xi | z1) q(z1 | z0, X)

log p(z T ) q(z T | z0, X) X

t>1 log pθ(zt 1 | zt) q(zt 1 | zt, z0, X)

i=1 log pθ(xi | zt)

q(xi) log pθ(z0 | z1)

where we use the fact that

q(zt | zt 1, x) = q(zt | zt 1, x, z0) = q(zt 1 | zt, z0, x)q(zt | z0, x)

q(zt 1 | z0, x) in the second equality.

B CALCULATING THE POSTERIOR DISTRIBUTIONS: THEOREM 2

In our derivation, we will frequently use the following well-known property of Gaussian random variables.

In the following, we first present a Lemma on calculating the posterior distribution of Gaussian random variables, based on which we derive the posterior distribution of our forward process. Lemma 3. Let ϵ1 N(µ1, σ2 1 I), ϵ2 N(µ2, σ2 2 I). Then, for a 0, b 0, the random variable ϵ aϵ1 + bϵ2 follows:

ϵ N aµ1 + bµ2, (a2σ2 1 + b2σ2 2) I .

Note in the forward aggregation, zt 1 + PN i=1 w(i) t Ei(xi) = zt 1 +w(1) t

PN i=1 w(i) t w(1) t Ei(xi)

zt 1 +w(1) t E(X), where we define E(X) PN i=1 w(i) t w(1) t Ei(xi). This is equivalent to the form of

considering only one extra task with modality-data embedding E(X), e.g., it suffices to only consider two tasks in the proof. Consequently, in the following, we give the derivations of the forward posterior distribution with one additional task, in which case we will drop the task index i in x. Generalizing to N task is straightforward. To derive the marginal distribution q(zt | z0, x), let ϵt N (ϵ; 0, I) for t. For the forward process, we have

zt = αt(zt 1 +wt E(x)) +

= αt αt 1 (zt 2 +wt 1E(x)) + p

1 αt 1ϵt 1 + wt E(x) +

= αtαt 1 zt 2 + wt αt + wt 1 αtαt 1 E(x) + p

= αtαt 1 αt 2 (zt 3 +wt 2E(x)) + p

+ wt αt + wt 1 αtαt 1 E(x) + p

= αtαt 1αt 2 zt 3 + wt αt + wt 1 αtαt 1 + wt 2 αtαt 1αt 2 E(x)

1 αtαt 1αt 2ϵ ,

Published as a conference paper at ICLR 2024

where we apply Lemma 3 in the third equation to consolidate the two random Gaussian variables ϵt 1 and ϵt into ϵ.

Let xt Qt i=1 αi, αt = Pt i=1 Qt j=i αj, and define α0 = 0. We have

αt = αt (wt + αt 1) , and

q(zt | z0, x) = N zt; αt z0 + αt E(x), (1 αt) I . (7)

As a special case, if we define the forward transition distribution by letting wt = 1, we will have:

αt = αt (1 + αt 1) . (8)

Lemma 4 (Murphy (2022)). Define the following distributions for the prior and likelihood:

p(x) = N x; µ, Λ 1 , p(y | x) = N y; A x + b, L 1 .

Let Σ = Λ + AT L A 1 . Then the posterior follows:

p(x | y) = N x; Σ AT L(y b) + Λµ , Σ .

In our case in equation 7, we have µ αt 1 z0 + αt 1E(x), Λ 1 1 αt 1 I, A αt I,

b αtwt E(x), L 1 1 αt I, and Σ = (1 αt)(1 αt 1)

1 αt I. Thus, the posterior q(zt 1 | zt, z0, x) =

N zt 1; µt(zt, z0, x), βt I , where

µt(zt, z0, x) = αt(1 αt 1) zt +(1 αt) αt 1 z0 + ((1 αt) αt 1 αt(1 αt 1)wt) E(x)

= αt(1 αt 1) zt +(1 αt) αt 1 z0 + (1 αt) αt/ αt (1 αt)wt E(x) 1 αt

βt = (1 αt)(1 αt 1)

From the marginal distribution q(zt | z0, x), we have

zt = αt z0 + αt E(x) +

1 αtϵ . (10)

Substituting equation 10 into equation 9, we have

µt(zt, x, t)

αt(1 αt 1) zt +(1 αt) αt 1( 1 αt zt αt E(x) 1 αtϵ )

(1 αt) αt αt (1 αt)wt E(x)

= ( αt(1 αt 1) + (1 αt)/ αt) zt (1 αt)wt E(x) (1 αt) q

(1 αt) αt 1

zt 1 αt 1 αt ϵ wt E(x)

Thus, similar to the standard DDPM, we can parameterize the backward denoising process with a neural network to predict the added noise ϵ, except that in our case, the neural network ϵθ would take (zt, x, t) as the input, i.e.,

ϵθ(zt, x, t) = ϵθ( αt z0 + αt x +

1 αtϵ, x, t) ϵ .

Published as a conference paper at ICLR 2024

C INFERENCE

The inference algorithm is shown in Algorithm 1. If not explictly stated, the number be timesteps is set to T = 1000.

Algorithm 1 MT-Diffusion Inference

1: z T N(0, I) 2: if Modality data xi ( i) not initially available then 3: Randomly initialize modality data xi 4: end if 5: for t = T, , 1 do 6: ϵ N(0, I) if t > 1, else ϵ = 0 7: ei = E(xi), i Get modality data encoding 8: ϵθ(zt, t), X = U-Net(zt, t) Get estimated noise and predicted modality data 9: if xi ( i) not initially available then 10: Update xi from the new X, i 11: end if 12: zt 1 = 1 αt

zt 1 αt 1 αt ϵθ(zt, x, t) PN i=1 w(i) t ei + βtϵ Update diffusion latent

13: end for 14: return z0, X

D RELATED WORKS

Connections to Classifier and Classifier-Free Guidance Guided diffusion models aim to leverage prior knowledge from various guidance information for better controllable generation. For example, the classifier guidance method uses the gradient of a pretrained classifier to perturb the reverse process to generate from a class-conditional distribution (Dhariwal & Nichol, 2021). The classifier-free guidance simultaneously learns a guidance model using the same generation network of the diffusion model (Ho & Salimans, 2022). The works that try to utilize external data such as the retrievalaugmented based methods (Blattmann et al., 2022; Long et al., 2022) can also be considered as a special type of guidance. Although the final formulation has some connections with our method (see Appendix D), guided diffusion models essentially only handle a single generation task. Our method, on the other hand, can model multiple tasks within a unified diffusion model.

Our MT-Diffusion formulation endows an closed connection with the classifier guidance and classifierfree guidance mechanisms. Specifically, from equation 5, if one defines the encoder E( ) as the gradient from a pretrained classifier, the posterior mean recovers the one for classifier guidance. By contrast, if one defines the encoder with the reverse U-Net, the posterior mean calculation in equation 5 recovers the classifier-free guidance mechanism. However, an important difference is the forward process, where our framework is designed to aggregate information from different encoders for multi-task learning, whereas both classifer guidance and classifer-free guidance do not. Overall, our method constitutes a broader framework that can be applied to different scenarios, including image transition, masked-image pretraining, joint image-label and image-representation generation investigated in the experiments.

Multi-Task Learning Multi-Task Learning (MTL) is a paradigm in machine learning that involves training a model to perform multiple tasks simultaneously, with the idea that knowledge gained from one task can help improve the performance on other related tasks. Recent development has mainly focused on multi-task learning for predictive models instead of generative models. Apart from investigating theory in multi-task learning (Wang et al., 2021; Tiomoko et al., 2021; Tripuraneni et al., 2020; Wu et al., 2020), many existing works explore different techniques to boost model performance with multi-task learning, including but not limited to architecture designs (Heuer et al., 2021; Ruder et al., 2017; Ye & Xu, 2023; Sharma et al., 2023; Chen et al., 2022), optimization algorithms (Senushkin et al., 2023; Fernando et al., 2023; Jiang et al., 2023; Phan et al., 2022) and task relationship learning (Hu et al., 2022; Ilharco et al., 2022). Recent research interest has also be expanded to applying multi-task learning in generative models (Bao et al., 2022; Liu et al., 2018). However, the generative models are limited to more traditional models such as VAE and GAN. And

Published as a conference paper at ICLR 2024

there is limited work on studying multi-task learning for diffusion models. Our work represents one of the first works on integrating multi-modal generation with multi-task learning in a diffusion model, aiming to further improve generative performance and expand the scope of state-of-the-art diffusion models.

More Recent Development on Diffusion Models There is some recent effort trying to develop multi-task diffusion models. For example, the Versatile Diffusion (Xu et al., 2022) The Versatile diffusion focuses on developing new neural architectures that make different tasks interact with each other within the single-task diffusion framework. This is different from our work in that ours not only introduces a novel neural architecture but also generalizes the single-task diffusion in terms of the loss function. We believe the Versatile diffusion architecture can be incorporated into our framework for more flexible modeling.

Previous efforts have also focused on efficient training of diffusion models, e.g., P2-Weighting (Choi et al., 2022), Min-SNR (Hang et al., 2023), ANT Go et al. (2023), and Task Routing (Park et al., 2023). Our model is orthogonal to these works with no technical overlap. Consequently, we believe there is room to incorporate these techniques into our framework for further improvement, an interesting future direction to be explored.

E DETAILED EXPERIMENTAL SETTINGS AND EXTRA RESULTS

In addition to evaluating on some other datasets that will be described in the specific tasks, we mainly rely on the Image Net-1K dataset (Deng et al., 2009) with resolutions of 64 64 and 128 128, where we adopt the pre-defined training and validation splits. All experiments are conducted on a A100 GPU server consists of 8 GPUs, with a batchsize of 64, if not explicitly specified. When evaluating generation quality, we follow and adopt the popular Inception Score (IS), FID score, s FID score, Precision and Recall metrics (dif), calculated on 10K or 50K samples, where the former is for computational efficiency and latter for comparing with existing results. We note that due to our different hyperparameter settings (specified in the Appendix), some of our results are not directly comparable to some reported results in previous works. For fair comparisons, we rerun some of the baselines on our settings that are consistent with our method. One additional hyperparameter of our model is the task weights in equation 3, which we set to w(i) t = t/(1000 t) to mitigate some potentially negative influence from some heterogeneous tasks on the generated image quality when t is small. We follow most of the parameter settings as in the codebases.

The training procedure is summarized in Algorithm 2.

Algorithm 2 MT-Diffusion Training

1: repeat 2: z0 q(z0), X q(X) Sample z0 and modality data X 3: t Scheduler(1, , T) 4: {ei} = E(X) Get modality data encoding 5: ϵ N(0, I) 6: zt q(zt | z0, X) Forward aggregation via equation 4 7: ei = E(xi), i Get modality data encoding 8: ϵθ(zt, t), X = U-Net(zt, t) Noise and modality data prediction via the reverse model 9: Take gradient descent step based on the loss equation 6 10: until Converged

E.1 EXPERIMENT SETTINGS FOR IMAGE TRANSLATION WITH MT-DIFFUSION

For the Cityscape dataset, the modality data corresponds to the semantic segmentation maps; and for the night2day dataset, the modality data corresponds to images of day time. We adopt the latent diffusion codebase from ldm, and use the provided checkpoints of the VQ-VAE encoder-decoder (kl-f8.pt). We use the VQ-VAE encoder as the encoder for modality data; and construct an additional output head by duplicate the original output block of the U-Net structure as the decoder to generate

Published as a conference paper at ICLR 2024

the modality data, which consists of a normalization layer, a Si LU layer, and a convolution layer. We use the default hyper-parameters for training the models for the two datasets, summarized as:

Attention resolutions: (32, 16, 9) Diffusion steps: 1000 Learn sigma: False Noise schedule: Linear #channels: 320 #heads: 8 #res blocks: 2 Resblock updown: False Use scale shift norm: False Learning rate: 1.0e-4 Batch size: 32

Table 4: Classification performance on Cityscape dataset.

Model Per-pixel acc. Per-class acc. Class IOU

Cycle GAN Zhu et al. (2017) 0.58 0.22 0.16

pix2pix Isola et al. (2017) 0.85 0.40 0.32

Intern Image-H Wang et al. (2022) - - 0.86

Single-task diffusion 0.72 0.54 0.32

MT-Diffusion 0.95 0.85 0.70

Task description and encoder-decoder design This is a more modalityhomologous setting. We adopt two standard datasets, the Cityscale dataset for semantic-labels to photo translation (Cordts et al., 2016) and the night2day dataset for night-to-day photo translation (Laffont et al., 2014). We adopt the public codebase of latent diffusion model (LDM) (ldm). For the translation problem, the task data (original and translated images) are in the same data space, thus we do not need to explicitly define separate encoders Ei( ) for the modality data. Instead, we use the same pretrained image encoder in LDM to map all images to the diffusion latent space. We add another head at the end of the U-Net as the decoder for target translated images.

Results We perform image translation by generating target images conditioned on source images based on Algorithm 1. Some example generated image are illustrated in Figure 4. For quantitative evaluation, we follow Zhu et al. (2017) to measure the performance in terms of per-pixel accuracy, per-class accuracy and class IOU, and compare it with exiting methods (Zhu et al., 2017; Isola et al., 2017; Wang et al., 2022). The results are shown in Table 4. It is cleared that our model obtains the best accuracy compared to the baselines, except for the state-of-the-art Intern Image-H model, which is a much larger image foundation model pretrained on web-scaled data, thus it is not comparable. We also calculate the IS and FID scores on the night2day dataset. Note prior work did not typically calculate these scores. We obtain an FID score of 37.93 and an IS of 3.94, and the IS score is even slightly better than that of the ground-truth data (3.65). As a comparison, the FID and IS scores with a single-task diffusion are 40.73 and 3.84, respectively.

E.2 MT-DIFFUSION FOR MASKED-IMAGE TRAINING

In this task, the modality data is a randomly masked version of the original images. To create a randomly masked image, we random sample a coordinate (x, y) that is within the image, then we masked out a patch (x : min(x + 16, 64)), y : min(y + 16, 64) by setting the corresponding pixel values to zeros. We repeat this process for m to control the ratio of masked regions. We adopt the latent diffusion codebase from dif. We simply define the encoder as the identity map, and define the decoder for the masked images by replicating the output block of the original U-Net, similar to the above Image Translation experiment. We adopt the default hyper-parameters for training the models, if not specified below.

Diffusion steps: 1000 Rescale learned sigmas: False

Published as a conference paper at ICLR 2024

Figure 7: Random samples from scratch with MT-Diffusion by masked-image training on Image Net128 128.

Rescale timesteps: False

Noise schedule: cosine

#channels: 192

#res blocks: 3

Learning rate: 7.0e-5

Batch size: 80

More random samples from scratch and image restoration results from both random masking and half masking are illustrated in Figure 7, 8, 9 and 10.

Published as a conference paper at ICLR 2024

E.3 MT-DIFFUSION FOR JOINT IMAGE-LABEL GENERATION MODELING

In this task, the modality data are discrete labels. We use the original U-Net as the encoder, which tasks a noisy image, a label and a timestep as input. For simplicity, we set the noisy image and the timestep to be zeros, although we believe better results can be obtained by jointly encoding with such information. For decoders, we proposes two options, one go out of the middle block of the U-Net and the other go out of the output block, as described in the main text. For the one from the middle block, we simply add one fully connected layer to define the decoder; and for the one from the output block, we adopt the pre-defined classifier from the codebase dif as the decoder. We adopt the default hyper-parameters for training the models, if not specified below. First, for MT-Diffusion-M:

Diffusion steps: 1000 Rescale learned sigmas: False Rescale timesteps: False Noise schedule: cosine #channels: 192 #res blocks: 3 Learning rate: 7.0e-5 Batch size: 75

The setting for MT-Diffusion-E is the same as MT-Diffusion-M, except with some extra hyperparameters for the pre-difined classifier:

Classifier_attention_resolutions: ( 32,16,8) Classifier_depth: 4 Classifier_pool: attention Classifier_resblock_updown: True Classifier_use_scale_shift_norm: True

E.4 MT-DIFFUSION FOR JOINT IMAGE-REPRESENTATION GENERATION MODELING

Task description and encoder-decoder design Finally, we apply MT-Diffusion for joint imagerepresentation generation. The setting is similar to the image-label generation setting in Section 4.2, by replacing the label data with image representation from the CLIP model Radford et al. (2021). Similarly, we use the original U-Net as the encoder for image representations via the cross-attention mechanism. For the decoder, we append a two-layer MLP to the output of the middle block of the U-Net, which is expected to output image representations. The MLP project the tensor from middle block to dimension of 1024, followed by a Re LU layer, and finally another layer output tensor of 1024. For MT-Diffusion for Joint Image-Representation:

Diffusion steps: 1000 Learning rate: 1.0e-5 Batch size: 2048

Results We conduct large-scale experiments based on the pretrained stable diffusion model Rombach et al. (2022) , by continuing finetuning the model on the LAION dataset Schuhmann et al. (2022) with our MT-Diffusion. We adopt the default hyperparameter setting as that in the codebase. Due to the large-scale nature, it is challenging to make fair quantitative comparisons with related methods. Thus, we only show some generated examples from our method, and leave more extensive comparisons as future work. Some randomly generated examples are shown in Figure 11 and 12, demonstrating impressive generated quality results.

We also provide a visulized comparison between our MT-diffusion with the stable diffusion baseline in Figure 13 and 14. From the generated images, it appears that our method can understand the semantic meaning of the images and generate better looking images.

https://huggingface.co/stabilityai/stable-diffusion-2

Published as a conference paper at ICLR 2024

E.5 DISCUSSION

Computation Efficiency Our multi-modal setting only adds small modality heads to decode back to the modality space. In our two-modality setting, the additional computational overhead is minimal, amounting to approximately 10% more training time per iteration than a pure single-task diffusion model. Importantly, our model remains significantly more efficient than training two individual diffusion models for the two modalities separately in terms of both time and storage efficiency.

Extra Experiments During the rebuttal, we try to design new experiments to demonstrate 1) our model is better than a pure condition model on two modalities; 2) negative transfer phenomenon in our model.

For 1), we compare our model with a pure condition model that learns to recover images from random masked images. We run the experiments on the small CIFAR-10 dataset. We observe that our model can converge faster than the pure conditional baseline, while both converge to comparable final results in terms of both IS and FID scores. However, we would like to emphasize that our model not only can do conditional generation, but also joint generation for multiple modals. Thus, our model represents a more flexible generative model framework.

For 2), we plan to test our model on more tasks and modallities. Specifically, we plan to train our model on 5 tasks to simultaneously learning to generate original images masked images, corresponding captions, random captions, and class labels. After spending significant efforts in implementation, we find it takes too much work to finish the experiment. In addition, we are in lack of GPU resources. Thus, unfortunately, we have to postpone this large-scale experiment. However, we wish to point out that our current results actually have the implications that more similar tasks tend to have more positive transfers. For example, comparing the following two settings indicated in Table 2: 1) simultaneously generating images and the corresponding masked images; 2) simultaneously generating images and the corresponding labels. The former two tasks are considered closer as they are in the same data space. And from Table 2, we can clearly see that the former setting (Generation with Masked-Image Training (Section 4.1)) outperforms the latter one (Generation with Joint Image-Label Modeling (Section 4.2)), indicating there are more positive transfers in the former setting.

Published as a conference paper at ICLR 2024

Figure 8: Random samples from scratch with MT-Diffusion by masked-image training on Image Net64 64.

Published as a conference paper at ICLR 2024

Figure 9: Random samples for image restoration from random masking on Image Net-64 64. In each block, the three images are original image, masked image and restored image, respectively.

Published as a conference paper at ICLR 2024

Figure 10: Random samples for image restoration from half masking on Image Net-64 64. In each block, the first image is the masked image, the rest three are different restored imaegs.

Published as a conference paper at ICLR 2024

Smiling sloth wearing a leather jacket, a cowboy hat and a kilt.

Young badger delicately sniffing a yellow rose, richly textured oil painting.

A bird'seye view of a mountain with tall trees and a clean-water lake.

A car is driving down a curvy road with flowers blooming on the road sides.

A church with beautiful landscape

A dragon wearing a karate suit

A laptop screen showing a bunch of photographs.

A home built in a huge Soap bubble, windows, doors, porches, awnings, middle of SPACE, cyberpunk lights.

A dream of a distant galaxy, by caspar david friedrich, matte painting trending on artstation

A lone tree standing tall against a starry night sky.

A painting of a squirrel eating a burger. A photo of llama wearing sunglasses standing on the deck of a spaceship with the Earth in the background.

Figure 11: Random samples for text-to-image generation finetuned on stable diffusion v2.

Published as a conference paper at ICLR 2024

A beautiful photograph of a girl with Switzerland landscape in the background with trees.

A developer working in an office, photo, detailed image

A hedgehog using a calculator.

A night sky filled with stars above a turbulent sea with giant waves.

A photo of an astronaut riding a horse in the forest. There is a river in front of them with water lilies.

A red colored dog.

A rustic cabin sits on the edge of a giant lake. Wildflowers dot the meadow around the cabin and lake.

A squirrel driving a toy car.

A rustic wooden coffee table adorned with scented candles and many books

A small chair sits in front of a table on the wooden floor. There is a bookshelf nearby the window.

A sunset over a mountain range, vector image.

A watercolor painting of a chair that looks like an octopus.

A street sign that reads Amazon . An epic painting of Gandalf the Black summoning thunder and lightning in the mountain.

An extremely angry bird. An illustration of a slightly conscious neural network.

An image of an animal half mouse half octopus. An impressionist oil painting of a Canadian man

riding a moose through a forest of maple trees

An oil painting of a latent space. City center public park, modern landscape architectural design for industrialpunk, water in the middle, dramatic lighting and composition

Close-up portrait of a smiling businesswoman holding a cell phone, oil painting in the style of Rembrandt

Dense woodland landscape. Dali painting buffalo jumping over Niagara Falls. Racoon reading a book in a library photo close

Figure 12: Random samples for text-to-image generation finetuned on stable diffusion v2. 28

Published as a conference paper at ICLR 2024

Transparent glass apple decorated with magical mul4colored ﬂowers inside

A street sign that reads Amazon

A beautiful photograph of a girl with Switzerland landscape in the background with trees

A brid'seye view of a mountain with tall trees and a cleanwater lake

A church with beautiful landscape

A dragon wearing a karate suit

A red colored dog

A small chair sits in front of a table on the wooden floor. There is a bookshelf nearby the window

Stable Diffusion

Stable Diffusion

Figure 13: Visual comparisons between our MT-diffusion (left) and the stable diffusion baseline (right).

Published as a conference paper at ICLR 2024

A squirrel driving a toy car An extremely angry bird

An impressionist oil painting of a Canadian man riding a moose through a forest of maple trees An image of an animal half mouse half octopus

Smiling sloth wearing a leather jacket, a cowboy hat and a kilt An oil painting of a latent space

A watercolor painting of a chair that looks like an octopus

An illustration of a slightly conscious neural network

Stable Diffusion

Stable Diffusion

Figure 14: Visual comparisons between our MT-diffusion (left) and the stable diffusion baseline (right).