# semanticaware_data_augmentation_for_texttoimage_synthesis__152f7f8c.pdf

Semantic-Aware Data Augmentation for Text-to-Image Synthesis

Zhaorui Tan1,2, Xi Yang1 , Kaizhu Huang3*

1Department of Intelligent Science, Xi an Jiaotong-Liverpool University 2Department of Computer Science, University of Liverpool 3 Data Science Research Center, Duke Kunshan University Zhaorui.Tan21@student.xjtlu.edu.cn, Xi.Yang01@xjtlu.edu.cn, kaizhu.huang@dukekunshan.edu.cn

Data augmentation has been recently leveraged as an effective regularizer in various vision-language deep neural networks. However, in text-to-image synthesis (T2Isyn), current augmentation wisdom still suffers from the semantic mismatch between augmented paired data. Even worse, semantic collapse may occur when generated images are less semantically constrained. In this paper, we develop a novel Semanticaware Data Augmentation (SADA) framework dedicated to T2Isyn. In particular, we propose to augment texts in the semantic space via an Implicit Textual Semantic Preserving Augmentation, in conjunction with a specifically designed Image Semantic Regularization Loss as Generated Image Semantic Conservation, to cope well with semantic mismatch and collapse. As one major contribution, we theoretically show that Implicit Textual Semantic Preserving Augmentation can certify better text-image consistency while Image Semantic Regularization Loss regularizing the semantics of generated images would avoid semantic collapse and enhance image quality. Extensive experiments validate that SADA enhances text-image consistency and improves image quality significantly in T2Isyn models across various backbones. Especially, incorporating SADA during the tuning process of Stable Diffusion models also yields performance improvements.

1 Introduction

Text-to-image synthesis (T2Isyn) is one mainstream task in the visual-language learning community that has yielded tremendous results. Image and text augmentations are two popular methods for regularizing visual-language models (Naveed 2021; Liu et al. 2020). As shown in Figure 2 (a), existing T2Isyn backbones (Xu et al. 2018; Tao et al. 2022; Wang et al. 2022) typically concatenate noises to textual embeddings as the primary text augmentation method (Reed et al. 2016) whilst employing simply basic image augmentations (e.g,, Crop, Flip) on images raw space. Recent studies (Dong et al. 2017; Cheng et al. 2020) suggest text augmentation to be more critical and robust than image augmentation for T2Isyn, given that real texts and their augmentations involve the inference process.

*Corresponding authors Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

(a) Examples generated on 𝑒! of DF-GAN trained with different augmentations.

+SADA +Add Noise +Random Mask

DF-GAN (Baseline) + Mixup + Diff Aug

(b) Completely Different (c) Extremely Similar

(d) Semantic Collapse Prevented

wo/ 𝐺𝑖𝑠𝐶 w/ weak 𝐺𝑖𝑠𝐶 (𝐿"#) w/ strong 𝐺𝑖𝑠𝐶 (𝐿$)

𝑒! of a small bird with a red head, breast, and belly and black wings.

Semantic Collapse

Figure 1: (a) Current augmentations cause semantic mismatch and quality degradation in T2Isyn task. (b)(c) Illustrations of semantic collapse. (d) Our method prevents semantic collapse. See Supplementary Materials D for more.

Albeit their effectiveness, we argue that current popular augmentation methods exhibit two major limitations in the T2Isyn task: 1) Semantic mismatch exists between augmented texts/images and generated pairs, it triggers accompanied semantic distribution disruption across both modalities, leading to augmented texts/images lacking corresponding visual/textual representations. As shown in Figure 1 (a), advanced image augmentation, such as Mixup (Zhang et al. 2017a), Diff Aug (Zhao et al. 2020), along with text augmentation like Random Mask1 or Add Noise2 might weaken both semantic and visual supervision from real images. 2) Semantic collapse occurs in the generation process, i.e., when two slightly semantic distinct textual embeddings are given, the model may generate either completely different or extremely similar images. This indicates that the models may be underfitting or over-fitting semantically (see Figure 1 (b)(c)). Both issues will compromise semantic consistency and generation quality. While imposing semantic constraints on generated images can alleviate semantic collapse, the study (Wang et al. 2022) solely focuses on regulating the direction of semantic shift, which may not be entirely adequate. Motivated by these findings, this paper proposes a novel Semantic-aware Data Augmentation (SADA) framework that offers semantic preservation of texts and images. SADA consists of an Implicit Textual Semantic Preserving Aug-

1Randomly masking words in raw texts. 2Directly adding random noise to textual semantic embeddings.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

(a) Most previous methods

Supervision

(b) Training with 𝑰𝑻𝑨and semantic constraint

𝒆𝒇|𝒔 Fake images

Supervision

Generated Image Semantic Constraint

Crop, Flip,

Crop, Flip,

𝑺(𝜶, 𝒆𝒇|𝒔 $ , 𝒆𝒔|𝒓)

(c) Training 𝑰𝑻𝑨𝑻

𝒆𝒇|𝒔 Fake images 𝑳𝒊𝒅

Concatenate

Concatenate

Concatenate

Figure 2: L(θ, ) is optimization loss for G. S(θ, ( , )) measures semantic consistency. (a) Simplified training paradigm of previous methods. (b) Training paradigm of SADA. (c) Training of ITAT where generators are frozen.

mentation (ITA) and a Generated Image Semantic Conservation (Gis C). ITA efficiently augments textual data and alleviates the semantic mismatch; Gis C preserves generated image semantics distribution by adopting constraints on semantic shifts. As one major contribution, we show that SADA can both certify better text-image consistency and avoid semantic collapse with a theoretical guarantee. Specifically, ITA preserves the semantics of augmented text by adding perturbations to semantic embeddings while constraining its distribution without using extra models. It bypasses the risks of semantic mismatch and enforces the corresponding visual representations of augmented textual embeddings. Crucially, we provide a theoretical basis for ITA enhancing text-image consistency, a premise backed by the group theory for data augmentation (Chen, Dobriban, and Lee 2020). As illustrated in Figure 2 (b), the augmented text embeddings are engaged with the inference process, providing semantic supervision to enhance their regularization role. On the implementation front, two variants for ITA: a closed-form calculation ITAC (training-free), and its simple learnable equivalent ITAT . It is further proved that a theoretical equivalence of ITAC arrives at the same solution to recent methods (Dong et al. 2017; Cheng et al. 2020) that employ auxiliary models for textual augmentation when these auxiliary models are well-trained. This suggests that ITAC offers an elegant and simplified alternative to prevent semantic mismatch. Meanwhile, we identify that an effective Gis C diminishes semantic collapse and benefits the generated image quality. Inspired by variance-preservation (Bardes, Ponce, and Le Cun 2021), we design an Image Semantic Regularization Loss (Lr) to serve as a Gis C with ITAC, which constrains both the semantic shift direction and distance of generated images (see Figure 3 (d)). Through Lipschitz continuity and semantic constraint tightness analysis (as seen

in Propositions 4.3 and 4.4), we theoretically justify that Lr prevents the semantic collapse, consequently yielding superior image quality compared to methods that solely bound semantic direction (Gal et al. 2022). Notably, SADA can serve as a theoretical framework for other empirical forms of ITA and Gis C in the future. Our contributions can be summarized as follows: This paper proposes a novel Semantic-aware Data Augmentation (SADA) framework that consists of an Implicit Textual Semantic Preserving Augmentation (ITA) and a Generated Image Semantic Conservation (Gis C). Drawing upon the group theory for data augmentation (Chen, Dobriban, and Lee 2020), we prove that ITA certifies a text-image consistency improvement. As evidenced empirically, ITA bypasses semantic mismatch while ensuring visual representation for augmented textual embeddings. We make the first attempt to theoretically and empirically show that Gis C can additionally affect the raw space to improve image quality. We theoretically justify that using Image Semantic Regularization Loss Lr to achieve Gis C prevents semantic collapse through the analysis of Lipschitz continuity and semantic constraint tightness. Extensive experimental results show that SADA can be simply applied to typical T2Isyn frameworks, such as diffusion-model-based frameworks, effectively improving text-image consistency and image quality. The extended version with full Supplementary Materials is available at https://arxiv.org/abs/2312.07951.

2 Related Work T2Isyn Frameworks and Encoders: Current T2Isyn models have four main typical frameworks: attentional stacked GANs accompanied with a perceptual loss produced by pretrained encoders (Zhang et al. 2017b, 2018; Xu et al. 2018; Zhu et al. 2019; Ruan et al. 2021), one-way output fusion GANs (Tao et al. 2022), VAE-GANs with transformers (Gu et al. 2022), and diffusion models (DMs) (Dhariwal and Nichol 2021). Two encoders commonly used for T2Isyn are DAMSM (Xu et al. 2018; Tao et al. 2022) and CLIP (Radford et al. 2021). Our proposed SADA is readily applied to these current frameworks with different encoders. Augmentations for T2Isyn: Most T2Isyn models (Reed et al. 2016; Xu et al. 2018; Tao et al. 2022; Gu et al. 2022) only use basic augmentations such as image corp, flip, and noise concatenation to textual embedding without exploiting further augmentation facilities. To preserve textual semantics, I2T2I (Dong et al. 2017) and Ri Fe GAN (Cheng et al. 2020) preserve textual semantics using an extra pre-trained captioning model and an attentional caption-matching model respectively, to generate more captions for real images and to refine retrieved texts for T2Isyn. They still suffer from semantic conflicts between input and retrieved texts, and their costly retrieval process leads to infeasibility on large datasets, prompting us to propose a more tractable augmentation method. Variance Preservation: Stylegan-nada (Gal et al. 2022) presents semantic Direction Bounding (Ldb) to constrain

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

semantic shift directions of texts and generated images, which may not guarantee the prevention of semantic collapse. Inspired by variance preservation in contrastive learning (Bardes, Ponce, and Le Cun 2021) based on the principle of maximizing the information content (Ermolov et al. 2021; Zbontar et al. 2021; Bardes, Ponce, and Le Cun 2021), we constrain the variables of the generated image semantic embeddings to have a particular variance along with its semantic shift direction.

3 Implicit Textual Semantic Preserving Augmentation

Consider observations ˆX1, ..., ˆXk ˆ X sampled i.i.d. from a probability distribution P in the sample space ˆ X, where each ˆ X includes real image r and its paired text s. According to ˆX ˆ X, we then have X1, ..., Xk X where each X includes real image embedding er and text embedding es. We take G with parameter θ as a universal annotation for generators in different frameworks; L(θ, ) represents total losses for G used in the framework. Following the Group-Theoretic Framework for Data Augmentation (Chen, Dobriban, and Lee 2020), we also assume that:

Assumption 3.1. If original and augmented data are a group that is exact invariant (i.e., the distribution of the augmented data is equal to that of the original data), semantic distributions of texts/images are exact invariant.

Consider augmented samples X X , where X includes er, and augmented textual embedding e s. According to Assumption 3.1, we have an equality in distribution:

X =d X , (1)

which infers that both X and X are sampled from X. Bringing it down to textual embedding specifically, we further draw an assumption:

Assumption 3.2. If the semantic embedding es of a given text follows a distribution Qs, then e s sampled from Qs also preserves the main semantics of es.

This assumption can be intuitively understood to mean that for the given text, there is usually a group of synonymous texts. Satisfying exact invariant, e s sampled from Qs preserves the main semantics of es. e s can be guaranteed to drop within the textual semantic distribution and correspond to a visual representation that shares the same semantic distribution with the generated image on es. Thus, e s can be used to generate a reasonable image. Under Assumption 3.2, we propose the Implicit Textual Semantic Preserving Augmentation (ITA) that can obtain Qs. As shown in Figure 3 (a)(b), ITA boosts the generalization of the model by augmenting implicit textual data under Qs.

3.1 Training Objectives for G with ITA

The general sample objective with ITA is defined as:

min θ ˆRk(θ) := 1

i=1 L(θ, ITA(Xi)). (2)

𝑒! 𝑒"|! mapping

𝑒! $ 𝑒"|! $

Other Methods

(c) + ITA + 𝑳𝒅𝒃

Direction Bounding

(d) + ITA + 𝑳𝒓

Direction and Distance Bounding

No Bounding

ℚ! (ℚ!|() ℚ! (ℚ!|() ℚ! (ℚ!|()

Figure 3: Diagram of augmentation effects of our proposed SADA (+ITA, +ITA + Ldb, ITA + Lr).

We then define the solution of θ based on Empirical Risk Minimization (ERM) (Naumovich 1998) as:

ERM: θ IT A arg min θ Θ 1 k

i=1 L(θ, ITA(Xi)), (3)

where Θ is defined as some parameter space. See detailed derivation based on ERM in Supplementary Materials A.1. Proposition 3.3 (ITA increases T2Isyn semantic consistency). Assume exact invariance holds. Consider an unaugmented text-image generator ˆθ(X) of G and its augmented version ˆθIT A. For any real-valued convex loss S(θ, ) that measures the semantic consistency, we have:

E[S(θ, ˆθ(X))] E[S(θ, ˆθIT A(X))], (4)

which means with ITA, a model can have lower E[S(θ, ˆθIT A(X)] thus a better text-image consistency.

Proof. we obtain a direct consequence that: Cov[ˆθIT A(X)] Cov[ˆθ(X)] , where Cov[ ] means the covariance matrix decreases in the Loewner order. Therefore, G with ITA can obtain better text-image consistency. See proof details in Supplementary Materials A.2.

For a clear explanation, we specify a form S(θ, ) := S(θ, ( , )) where ( , ) take a es and er for semantic consistency measuring, and θ denotes the set of training parameters. Since we preserve the semantics of e s, its generated images should also semantically match es. Thus, the total semantic loss of G is defined as:

LS =S(θ, (es, G(es))) + S(θ, (e s, G(e s)))

+S(θ, (es, G(e s))) + S(θ, (e s, G(es))) , (5)

where G = h(G( )), ( ) takes a textual embedding and h( ) maps images into semantic space. Typically, as the first term is included in the basic framework, it is omitted while other terms are added for SADA applications.

3.2 Obtaining Closed-from ITAC Theoretical Derivation of ITAC Assume that exact invariance holds. We treat each textual semantic embedding es as a Gaussian-like distribution ϕ = N(es, σ), where each sample e s N(es, σ) can maintain the main semantic ms of es. In other words, σ is the variation range of es conditioned by ms, ϕ derives into:

ϕ = N(es, σ|ms) . (6)

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

ℎ! = 𝑒( + ( ℎ!"# ( 𝑤+ 𝑏)

Figure 4: Network structure of ITAC and ITAT . Note that es and e s are equivalent to es|r and e s|r respectively.

By sampling e s from ϕ, we can efficiently obtain augmented textual embedding for training. We need to draw support from real images to determine the semantics ms that need to be preserved. Empirically, real texts are created based on real images. es is thus naturally depending on er, leading to the inference: es|r es, ms|r ms, Qs|r Qs. Given a bunch of real images, σ|ms is assumed to represent the level of variation inherent in text embeddings, conditioned on the real images. We can redefine ϕ in Eq. (6) for ITAC augmentation as: ϕ N(es|r, σ|ms|r) = N(es|r, β Css|r I), where C denotes covariance matrix of semantic embeddings; r, s stand for real images and real texts; Css|r is the self-covariance of es conditioned by semantic embedding of real images er; I denotes an identity matrix; β is a positive hyper-parameter for controlling sampling range. As such, we define: ϕ Qs|r. According to (Kay 1993), conditional Css|r is equivalent to:

Css|r = Css Csr C 1 rr Crs , (7) where all covariances can be directly calculated. Then ϕ is calculated from the dataset using semantic embeddings of texts and images for s and r. In practice, Css|r is calculated using real images and their given texts from the training set.

Remarks of ITAC We explore the connections between ITAC and previous methods (Dong et al. 2017; Cheng et al. 2020), assuming all models are well-trained. Proposition 3.4. ITAC can be considered a closed-form solution for general textual semantic preserving augmentation methods of T2Isyn. Proof details can be seen in Supplementary Materials A.2. Therefore, training with bare ITAC is equivalent to using other textual semantic preserving augmentation methods.

ITAC Structure Based on Eq. (7), we obtain e s|r from calculated ITAC: e s|r = e s|r ϕ = es|r + z es|r + ϵ β Css|r I, (8)

where z N(0, β Css|r I), ϵ is sampled from a uniform distribution U( 1, 1), as shown in Figure 4. ITAC requires no training and can be used to train or tune a T2Isyn model.

3.3 Obtaining Learnable ITAT We also design a learnable ITAT as a clever substitute. Proposition 3.4 certifies that well-trained ITAT is equivalent to ITAC. To obtain ITAT through training, we need to achieve the following objectives: max α Ld(α, (e s|r, es|r)), min α S(α, (es|r, G(e s|r))) ,

where Ld(α, , ) denotes a distance measurement, enforcing that the augmented e s|r should be far from es|r as much as possible; α is training parameters of ITAT . S(α, ( , )) bounds the consistency between es|r and generated images on e s|r , preserving the semantics of e s|r. The first objective can be easily reformed as minimizing the inverse distance:

min α Lid(α, (e s|r, es|r)) := min α Ld(α, (e s|r, es|r)).

The final loss for training ITAT is a weighted combination of Ld and S(α, ( , )):

LIT AT =r Lid(α, (e s|r, es|r))

+ (1 r) S(α, (es|r, G(e s|r)), (9)

where r is a hyper-parameter controlling the augmentation strength. Note that LIT AT is only used for optimizing α of ITAT and parameters of G are frozen here (as Figure 2 (c)).

ITAT Structure Since the augmented e s|r should maintain the semantics in es|r, ϵ in Eq. (8) is maximized but does not disrupt the semantics in es|r. As such, ϵ is not a pure noise but a es|r-conditioned variable. Hence, Eq. (8) can be reformed as e s|r = es|r + f(es|r) to achieve ITAT , where f(es|r) means a series of transformations of es|r. The final ITAT process can be formulated as e s|r = ITAT (es|r) = es|r +f(es|r). We deploy a recurrent-like structure as shown in Figure 4 to learn the augmentation. ITAT takes es|r as an input. For ith step in overall n steps, there is a group of Multilayer Perceptrons to learn the weights wi and bias bi conditioned by es|r for the previous module s output hi 1. Then hi = es|r + (hi 1 wi + bi) will be output to the following processes. We empirically set n = 2 for all our experiments. ITAT can be trained simultaneously with generative frameworks from scratch or used as a tuning trick.

4 Generated Image Semantic Conservation

Enabled by ITA s providing es|r, e s|r, we show that using Generated Image Semantic Conservation (Gis C) will affect generated images raw space. Consider a frozen pretrained image encoder (EI) that maps images into the same semantic space. Consider a feasible and trainable generator G that learns how to generate text-consistent images: G(X) F, EI(F) E, where F and E are the sets for generated images f and their semantic embeddings ef. Since images are generated on texts, we have ef|s ef. We show that semantically constraining generated images can additionally affect their raw space.

Proposition 4.1. Assume that EI is linear and well-trained. Constraining the distribution QE of ef|s can additionally constrain the distribution F of f.

Proof. There are two scenarios: 1) If EI is inevitable, Proposition 4.1 is obvious. 2) If EI is not inevitable, it is impossible that F all locates in the Null(EI) (nullspace of EI) for well trained EI, thus constraining F can affect E. See more proof details in Supplementary Materials A.2.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

We further assume that the positive effeteness of feasible Gis C can pass to the raw generated image space. The nonlinear case is non-trivial to proof. Our results of using nonlinear encoders (DAMSM (Xu et al. 2018) and CLIP (Radford et al. 2021)) with different feasible Gis C methods suggest that Proposition 4.1 holds for non-linear EI and positively affect image quality.

4.1 Image Semantic Regularization Loss

We design an Image Semantic Regularization Loss Lr to attain Gis C for preventing semantic collapse and providing tighter semantic constraints than direction bounding Ldb (Gal et al. 2022).

Theoretical Derivation of Lr To tackle semantic collapse empirically, we constrain the semantic distribution of generated images, which draws inspiration from the principle of maximizing the information content of the embeddings through variance preservation (Bardes, Ponce, and Le Cun 2021). Since semantic redundancies undescribed by texts in real images are not compulsory to appear in generated images, the generated images are not required to be the same as real images. Therefore, conditioned by the texts, generated images should obtain semantic variation in real images. For example, when text changes from orange to banana , orange in real images should likewise shift to banana despite the redundancies, and fake images should obtain this variance (Tan et al. 2023). If exact invariance holds and the model is well-trained, the text-conditioned semantic distribution of its generated images Qf|s = N(mf|s, Cff|s I) should have the semantic variance as close as that of the real images Qrr|s = N(mr|s, Crr|s I):

min ef ||Cff|s I Crr|s I||2, Crr|s =Crr Crs C 1 ss Csr , (10)

where Crr|s is the self-covariance of er conditioned by real text embeddings. Aim to maintain latent space alignment, an existing Gis C method, direction bonding (Gal et al. 2022) is defined as:

Ldb = 1 (e s|r es|r) (e f|s ef|s)

||(e s|r es|r)||2 ||(e f|s ef|s)||2 . (11)

Ldb follows that semantic features are usually linearized (Bengio et al. 2013; Upchurch et al. 2017; Wang et al. 2021). Given a pair of encoders that maps texts and images into the same semantic space, inspired by Ldb, we assume that:

Assumption 4.2. If the paired encoders are well-trained, aligned, and their semantic features are linearized. The semantic shifts images are proportional to texts:

(e f|s ef|s) (e s|r es|r). (12)

Assumption 4.2 holds for T2Isyn intuitively because when given textual semantics changes, its generated image s semantics also change, whose shifting direction and distance are based on textual semantics changes. Otherwise, semantic mismatch and collapse would happen. If Assumption 4.2

holds, based on ITAC that preserves e s|r es|r, we have:

e f|s ef|s ϵ β d(Cff|s)

s.t. e s|r es|r ϵ β d(Css|r) . (13)

If we force that each dimension of ϵ d i=1 { 1, 1} where d = {1, ..., n} and n is the dimension of the semantic embedding, we have:

e f|s ef|s = ϵ β d(Cff|s)

s.t. e s|r es|r = ϵ β d(Css|r) . (14)

Derived form Eqs. (10) and (14), we define our Image Semantic Regularization Loss Lr as:

Lr = φ || (e f|s ef|s) ϵ β d(Crr|s)||2 , (15)

where β d(Cff|s) can be considered a data-based regularized term. ϵ constrains the shifting direction, as shown in Figure 3 (d). φ is a hyper-parameter for balancing Lr with other loss. Note that for ITAT , the range of e s|r es|r is not closed-form. Thus, we cannot apply Lr with ITAT .

Remarks of Lr We show the effect of Lr on the semantic space of generated images:

Proposition 4.3 (Lr prevent semantic collapse: completely different). Lr leads to |e f|s ef|s| is less than or equal to a sequence Λ of positive constants, further constrains the semantic manifold of generated embeddings to meet the Lipschitz condition.

Proof. From Eq. (15), we have the constraint |e f|s ef|s|

Λ. Therefore, we have: |e f|s ef|s| |e s|r es|r| K, s.t. e s|r = es|r ,

where K is a Lipschitz constant. See more proof details in Supplementary Materials A.2.

Proposition 4.3 justifies why image quality can be improved with Lr. According to Proposition 4.1, we believe that the Lipschitz continuity can be passed to visual feature distribution, leading to better continuity in visual space as well. Our experiments verify that with Lr methods, T2Isyn models achieve the best image quality.

Proposition 4.4 (Lr prevent semantic collapse: extremely similar). Lr prevents |e f|s ef|s| = 0 and provides tighter image semantic constraints than direction bounding Ldb.

Proof. For Eq. (11), assume Ldb = 0 and use e s|r to substitute es|r, combining with Eq. (8), we have: |e f|s ef|s| 0 . Preservation of semantic collapse is not guaranteed due to the distance between e f|s (e f|s) and ef|s is not strictly contained. Assume Lr = 0, we have: |e f|s ef|s| > 0 , where provides tighter constraints than Ldb. See visual explanation in Figure 3 (c)(d) and proof details in Supplementary Materials A.2.

Propositions 4.3-4.4 show that Lr prevents semantic collapse. See SADA algorithms in Supplementary Materials B.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Image Retrieval Text Retrieval Top1 Top5 Top1 Top5 CLIP 30.40 54.73 49.88 74.96 Tuned 44.43 72.38 61.20 85.16 +ITA 44.88(+0.45) 72.42(+0.04) 62.76(+1.56) 85.38(+0.22)

Table 1: Text-Image Retrieval results of CLIP tune w/ and wo/ SADA ITA. Please refer to Supplementary Material D.1 for tuning CLIP with different numbers of samples.

5 Experiments Our experiments include three parts: 1) To demonstrate how ITA improves text-image consistency, we apply ITA of SADA to Text-Image Retrieval tasks. 2) To exhibit the feasibility of our SADA, we conduct extensive experiments by using different T2Isyn frameworks with GANs, Transformers, and Diffusion Models (DM) as backbones on different datasets. 3) Detailed ablation studies are performed; we compare our SADA with other typical augmentation methods to show that SADA certifies an improvement in textimage consistency and image quality in T2Isyn tasks. Particularly noteworthy is the observation that Gis C can alleviate semantic collapse. Due to page limitations, key findings are presented in the main paper. For detailed application and training information, as well as more comprehensive results and visualizations, please refer to Supplementary Materials C and D. Codes are available at https: //github.com/zhaorui-tan/SADA.

5.1 SADA on Text-Image Retrieval Experimental setup We compare tuning CLIP (Wang et al. 2022)(Vi T-B/16) performance w/ ITA and wo/ ITA on the COCO (Lin et al. 2014) dataset. Evaluation is based on Top1 and Top5 retrieval accuracy under identical hyperparameter settings.

Results As exhibited in Table 1, using ITA results in a boost in image-text retrieval accuracy in both the Top1 and Top5 rankings, reflecting its proficiency in enhancing the consistency between text and images. The increase of 0.45% and 1.56% in Top1 retrieval accuracy explicitly suggests a precise semantic consistency achieved with SADA, providing empirical validation to our Proposition 3.3.

5.2 SADA on Various T2Isyn Frameworks Experimental setup We test SADA on GAN-based Attn GAN (Xu et al. 2018) and DF-GAN (Tao et al. 2022), transformer-based VQ-GAN+CLIP (Wang et al. 2022), vanilla DM-based conditional DDPM (Ho, Jain, and Abbeel 2020) and Stable Diffusion (SD) (Rombach et al. 2021) with different pretrianed text-image encoders (CLIP and DAMSM (Xu et al. 2018)). Parameter settings follow the original models of each framework for all experiments unless specified. Datasets CUB (Wah et al. 2011), COCO (Lin et al. 2014), MNIST, and Pok emon BLIP (Deng 2012) are employed for training and tuning (see the 2nd column in Table 2 for settings). Supplementary Material D.2 offers additional SD-tuned results. For qualitative evaluation, we use

Backbone Encoder, Method Settings, Dataset CS FID

Transformer CLIP VQ-GAN+CLIP 62.78 16.16 +SADA Tune COCO 62.81 15.56 DM CLIP SD 72.72 55.98 +SADA Tune Pok emon BLIP 73.80 46.07 DM CLIP DDPM 70.77 8.61 +SADA Train MNIST 70.91 7.78 GANs DAMSM Attn GAN 68.00 23.98 +SADA Train CUB 68.20 13.17 GANs DAMSM Attn GAN 62.59 29.60 +SADA Tune COCO 64.59 22.70 GANs DAMSM DF-GAN 58.10 12.10 +SADA Train CUB 58.24 10.45 GANs DAMSM DF-GAN 50.71 15.22 +SADA Train COCO 51.02 12.49

Table 2: Performance evaluation of SADA with different backbones with different datasets. Results better than the baseline are in bold.

CLIPScore (CS) (Hessel et al. 2021) to assess text-image consistency (scaled by 100) and Fr echet Inception Distance (FID) (Heusel et al. 2017) to evaluate image quality (computed over 30K generated images).

Results As shown in Table 2 and corresponding Figure 6, the effectiveness of our SADA can be well supported by improvements across all different backbones, datasets, and text-image encoders, which experimentally validate the efficacy of SADA in enhancing text-image consistency and image quality. Notably, facilitated by ITAC + Lr, Attn GAN achieves 13.17 from 23.98 on CUB. For tuning VQGAN+CLIP and SD that have been pre-trained on largescale data, SADA still guarantees improvements. These results support Propositions 3.3, 4.1 and 4.3. It s worth noting that the tuning results of models with DM backbones (SD) are influenced by the limited size of the Pok emon BLIP dataset, resulting in a relatively high FID score. Under these constraints, tuning with SADA performed better than the baseline, improving the CS from 72.72 to 73.80 and lowering the FID from 55.98 to 46.07.

5.3 Ablation Studies

Experimental setup Based on Attn GAN and DF-GAN, we compare Mixup (Zhang et al. 2017a), Diff Aug (Zhao et al. 2020), Random Mask (Rand Mask), Add Noise, with SADA components in terms of CS and FID. Refer to Supplementary Materials C, D.3 for more detailed settings and the impact of r in ITAT .

Quantitative results Quantitative results are reported in Table 3.3 We discuss the results from different aspects. 1). Effect of other competitors: Mixup and Diff Aug weaken visual supervision, resulting in worse FID than baselines. They also waken text-image consistency under most situations. Moreover, Random Mask and Add Noise are sen-

3Note for task 2, we use the best results among current augmentations as the baseline since no released checkpoint is available.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

D D P M + S A D A + + + + + + + + D F G A N + Mi x u p + A d d N oi s e + R a n d o m M a s k + Diff A u g

T e xt: T hi s bir d h a s a y ell o w cr e st a n d bl a c k b e a k.

G e n er at e d e x a m pl e s wit h s e m a nti c c oll a p s e ar e hi g hli g ht e d .

Figure 5: Generated examples of DF-GAN and DDPM trained with different augmentations on es|r as ascending Noise N(0,β Css|r I) is given. Input noise is fixed for each column. See full examples in Supplementary Materials Figures 18, 19 & 20.

+SADA Baseline

VQ-GAN+CLIP COCO DDPM MNIST Attn GAN CUB DF-GAN CUB Attn GAN COCO DF-GAN COCO

Two equestrians are riding horses down a path.

Yellow and blue butterfly sitting on top of a white surface.

A small bird with a red head, breast, and belly and black wings.

Cattle grazing on grass near a lake surrounded by mountain

A couch and chair are sitting in a room.

A drawing of a white ball with spikes on it.

Generated 5,6,7, 8,9 examples

This is a grey bodied bird with light grey wings and a white breast.

A kitchen has white counters and a wooden floor.

Generated 0,1,2, 3,4 examples

SD Pokémon BLIP

Figure 6: Generated examples of different backbones with different datasets wo/ SADA and w/ SADA. See more examples of different frameworks in Supplementary Materials D.

sitive to frameworks and datasets, thus they cannot guarantee consistent improvements. 2). ITA improves text-image consistency: Regarding text-image consistency, using ITA wo/, or w/ Gis C all lead to improvement in semantics, supporting Proposition 3.3. However, ITAT consumes more time to converge due to its training, weakening its semantic enhancement at the early stage (as in Task 5). As it converged with longer training time, ITAT improves text-image consistency as in Task 6. 3). Gis C promotes image quality: For image quality, it can be observed that using bare ITA wo/ Gis C, FID is improved in most situations; but using constraints such as Ldb and Lr with ITAT and ITAC can further improve image quality except ITAT + Ldb in Task 1. These support our Proposition 4.1 and Proposition 4.3. 4). Lr provides a tighter generated images semantic constraint than Ldb: Specifically, compared with Ldb, using our proposed Lr with ITAC provides the best FID and is usually accompanied by a good text-image consistency, thus validating our Proposition 4.4.

Attn GAN DF-GAN Settings Task 1: Train Task 2: Train Task 3: Train CUB CS FID CS FID CS FID Paper 68.00 23.98 - 14.81 - - RM 68.00 23.98 - 14.81 58.10 12.10

+Mixup 65.82 41.47 57.29 28.73 57.36 25.77 +Diff Aug 66.94 22.53 58.22 17.27 58.05 12.35 +Rand Mask 67.80 15.59 57.96 15.42 58.07 15.17 +Add Noise 67.79 17.29 57.46 48.23 57.58 42.07 +ITAT 68.53 14.14 58.09 14.03 58.80 12.17 +ITAT +Ldb 68.10 14.55 58.07 11.74 58.67 11.58 +ITAC 68.42 13.68 58.25 12.70 58.23 11.81 +ITAC+Ldb 68.18 13.74 58.30 12.93 58.23 10.77 +ITAC+Lr 68.20 13.17 58.27 11.70 58.24 10.45

Settings Task 4: Tune Task 5: Tune Task 6: Tune COCO CS FID CS FID CS FID Paper 50.48 35.49 - 19.23 - - RM 50.48 35.49 50.94 15.41 50.94 15.41 + Tuned 62.59 29.60 50.63 15.67 50.71 15.22

+Mixup 62.30 33.41 50.38 23.80 50.83 22.86 +Diff Aug 65.44 33.86 49.45 21.31 50.94 18.97 +Rand Mask 63.76 23.82 50.54 15.74 50.64 15.33 +Add Noise 64.77 35.47 50.94 34.90 50.80 33.84 +ITAT +Ldb 63.31 26.65 50.60 15.05 50.77 13.67 +ITAC+Ldb 63.97 25.82 50.92 14.71 50.98 13.28 +ITAC+Lr 64.59 22.70 50.81 13.71 51.02 12.49

Table 3: CS and FID for Attn GAN, and DF-GAN with Mixup, Random Mask, Add Noise, and the proposed SADA components on CUB and COCO. *: Baseline results; Bold: Results better than the baseline; : Best results; Underlines: Second best results; RM : Released Model; e : epochs.

Qualitative Results As depicted in Figure 5 and further examples in Supplementary Materials D, we derived several key insights. 1). Semantic collapse happens in the absence of a sufficient Gis C: As seen in Figure 5, neither non-augmented nor other augmented methods fail to prevent semantic collapse in different backbones. The application of Gis C through SADA serves to alleviate this issue effectively. 2). ITA preserves textual semantics: It shows that generated images of models wo/ ITA on e s|r still maintain the main semantics of es|r though they have low quality, indi-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

person playing handball color, light

delivery truck, color

cloud, color

pot of food, color

Figure 7: Generated examples of SD tuned on the Emoji dataset wo/ and w/ SADA. A significant improvement in diversity with +ITAC + Lr can be observed, especially in terms of skin color and view perspective.

cating the textual semantic preservation of ITA. 3). SADA enhances generated image diversity: SADA appears to improve image diversity when input noise is not fixed significantly and es|r of testing text is used. The greatest improvement in image diversity was achieved by ITAC+Lr, as the detailed semantics of birds, are more varied than the other semantics. Textual unmentioned details such as skin colors as shown in Figure 7 is more various when using SADA. More textual unmentioned details can be observed in Supplementary Materials Figure 11 (highlighting wing bars, color, and background). 4). ITA with Gis C improves the model generalization by preventing semantic collapse: Using ITAT + Ldb and ITAC+Ldb/Lr lead to obvious image quality improvement when more Noise is given, corresponding to our Proposition 4.1 and Proposition 4.3. However, with ITAC + Ldb, though the model can produce high-quality images, generated images on es|r and e s|r are quite similar while ITAC + Lr varies a lot, especially in the background, implying a not guaranteed semantic preservation of Ldb and a tighter constraint of Lr as proved in Proposition 4.4. Furthermore, ITAC + Lr provides the best image quality across all experiments.

5.4 SADA on Complex Sentences and Simple Sentences We also notice that semantic collapse is more severe when a complex description is given. Applying SADA alleviates the semantic collapse across all descriptions. We explore the effect of SADA on complex sentences and simple sentences. We use textual embeddings of sentences in Table 4 and illustrate interpolation examples at the inference stage between es|r and e s|r as shown in Figure 8 right side, where Noise N(0, β Css|r I). It can be observed that models trained with SADA can alleviate the semantic collapse that occurs in models without SADA, and its semantics can resist even larger Noise given. Using e s|r at the inference stage

sent1 this is a yellow bird with a tail.

sent2 this is a small yellow bird with a tail and gray wings with white stripes.

sent3 this is a small yellow bird with a grey long tail and gray wings with white stripes.

Table 4: Rough, detailed, and in-between description used for generation.

DF-GAN SADA sent3 sent2 sent1 sent3 sent2 sent1

+푁표ആഒ +1.5N표ആഒ ǉ Ǘ|ǖ ǉǗ|ǖ

Random Mask

sent1 sent2 sent3

Figure 8: Left: Generated results of DF-GAN with different methods on rough to detailed sentences. Right: Interpolation examples at the inference stage between er|s and e r|s of DFGAN and it with SADA on rough to detailed sentences. e r|s, input noise for generator G, and textual conditions are the same across all rows. Examples of significant collapse are highlighted in red.

can cause image quality degradation, which reveals the robustness of the models. As shown in Figure 8, on the left side, DF-GAN with SADA generates more text-consistent images with better quality from rough to precise descriptions compared to other augmentations. The Right side indicates that DF-GAN without augmentations experiences semantic collapse when larger Noise is given. The semantic collapse is more severe when a complex description is given. Applying SADA alleviates the semantic collapse across all descriptions. The model with SADA can generate reasonably good and textconsistent images when the 1.5Noise with complex description is given. These visualizations further verified the effectiveness of our proposed SADA.

6 Conclusion

In this paper, we propose a Semantic-aware Data Augmentation framework (SADA) that consists of ITA (including ITAT and ITAC) and Lr. We theoretically prove that using ITA with T2Isyn models leads to text-image consistency improvement. We also show that using Gis C can improve generated image quality, and our proposed ITAC + Lr promotes image quality the most. ITA relies on estimating the covariance of semantic embeddings, which may, however, be unreliable in the case of unbalanced datasets. We will explore this topic in the future.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgments

The work was partially supported by the following: National Natural Science Foundation of China under No. 92370119, No. 62376113, and No. 62206225; Jiangsu Science and Technology Program (Natural Science Foundation of Jiangsu Province) under No. BE2020006-4; Natural Science Foundation of the Jiangsu Higher Education Institutions of China under No. 22KJB520039.

Bardes, A.; Ponce, J.; and Le Cun, Y. 2021. Vicreg: Variance-invariance-covariance regularization for selfsupervised learning. ar Xiv preprint ar Xiv:2105.04906.

Bengio, Y.; Mesnil, G.; Dauphin, Y.; and Rifai, S. 2013. Better mixing via deep representations. In International Conference on Machine Learning, 552 560. PMLR.

Chen, S.; Dobriban, E.; and Lee, J. H. 2020. A grouptheoretic framework for data augmentation. The Journal of Machine Learning Research, 21(1): 9885 9955.

Cheng, J.; Wu, F.; Tian, Y.; Wang, L.; and Tao, D. 2020. Ri Fe GAN: Rich feature generation for text-to-image synthesis from prior knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10911 10920.

Deng, L. 2012. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6): 141 142.

Dhariwal, P.; and Nichol, A. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34.

Dong, H.; Zhang, J.; Mc Ilwraith, D.; and Guo, Y. 2017. I2t2i: Learning text to image synthesis with textual data augmentation. In 2017 IEEE International Conference on Image Processing (ICIP), 2015 2019. IEEE.

Ermolov, A.; Siarohin, A.; Sangineto, E.; and Sebe, N. 2021. Whitening for self-supervised representation learning. In International Conference on Machine Learning, 3015 3024. PMLR.

Gal, R.; Patashnik, O.; Maron, H.; Bermano, A. H.; Chechik, G.; and Cohen-Or, D. 2022. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4): 1 13.

Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; and Guo, B. 2022. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10696 10706.

Hessel, J.; Holtzman, A.; Forbes, M.; Bras, R. L.; and Choi, Y. 2021. Clipscore: A reference-free evaluation metric for image captioning. ar Xiv preprint ar Xiv:2104.08718.

Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30.

Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 6840 6851.

Kay, S. M. 1993. Fundamentals of statistical signal processing: estimation theory. Prentice-Hall, Inc.

Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll ar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 740 755. Springer.

Liu, P.; Wang, X.; Xiang, C.; and Meng, W. 2020. A survey of text data augmentation. In 2020 International Conference on Computer Communication and Network Security (CCNS), 191 195. IEEE.

Naumovich, V. 1998. Statistical learning theory. Johm Wiley.

Naveed, H. 2021. Survey: Image mixing and deleting for data augmentation. ar Xiv preprint ar Xiv:2106.07085.

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748 8763. PMLR.

Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; and Lee, H. 2016. Generative adversarial text to image synthesis. In International Conference on Machine Learning, 1060 1069. PMLR.

Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. ar Xiv:2112.10752.

Ruan, S.; Zhang, Y.; Zhang, K.; Fan, Y.; Tang, F.; Liu, Q.; and Chen, E. 2021. DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13960 13969.

Tan, Z.; Yang, X.; Ye, Z.; Wang, Q.; Yan, Y.; Nguyen, A.; and Huang, K. 2023. Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation. Pattern Recognition, 144: 109883.

Tao, M.; Tang, H.; Wu, F.; Jing, X.-Y.; Bao, B.-K.; and Xu, C. 2022. Df-gan: A simple and effective baseline for text-toimage synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16515 16525.

Upchurch, P.; Gardner, J.; Pleiss, G.; Pless, R.; Snavely, N.; Bala, K.; and Weinberger, K. 2017. Deep feature interpolation for image content changes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7064 7073.

Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset.

Wang, Y.; Huang, G.; Song, S.; Pan, X.; Xia, Y.; and Wu, C. 2021. Regularizing deep networks with semantic data augmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Wang, Z.; Liu, W.; He, Q.; Wu, X.; and Yi, Z. 2022. CLIPGEN: Language-Free Training of a Text-to-Image Generator with CLIP. ar Xiv preprint ar Xiv:2203.00386. Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; and He, X. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1316 1324. Zbontar, J.; Jing, L.; Misra, I.; Le Cun, Y.; and Deny, S. 2021. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, 12310 12320. PMLR. Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017a. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412. Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; and Metaxas, D. N. 2017b. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, 5907 5915. Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; and Metaxas, D. N. 2018. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8): 1947 1962. Zhao, S.; Liu, Z.; Lin, J.; Zhu, J.-Y.; and Han, S. 2020. Differentiable augmentation for data-efficient gan training. Advances in Neural Information Processing Systems, 33: 7559 7570. Zhu, M.; Pan, P.; Chen, W.; and Yang, Y. 2019. Dm-gan: Dynamic memory generative adversarial networks for textto-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5802 5810.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)