# iffont_ideographic_description_sequencefollowing_font_generation__a2cec4b5.pdf IF-Font: Ideographic Description Sequence-Following Font Generation Xinping Chen, Xiao Ke , Wenzhong Guo Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China {221027017, kex, guowenzhong}@fzu.edu.cn Few-shot font generation (FFG) aims to learn the target style from a limited number of reference glyphs and generate the remaining glyphs in the target font. Previous works focus on disentangling the content and style features of glyphs, combining the content features of the source glyph with the style features of the reference glyph to generate new glyphs. However, the disentanglement is challenging due to the complexity of glyphs, often resulting in glyphs that are influenced by the style of the source glyph and prone to artifacts. We propose IF-Font, a novel paradigm which incorporates Ideographic Description Sequence (IDS) instead of the source glyph to control the semantics of generated glyphs. To achieve this, we quantize the reference glyphs into tokens, and model the token distribution of target glyphs using corresponding IDS and reference tokens. The proposed method excels in synthesizing glyphs with neat and correct strokes, and enables the creation of new glyphs based on provided IDS. Extensive experiments demonstrate that our method greatly outperforms state-of-the-art methods in both one-shot and few-shot settings, particularly when the target styles differ significantly from the training font styles. The code is available at https://github.com/Stareven233/IF-Font. CNN Decoder Transformer Decoder Img Encode 福 Output Output猫 Autoregressive Text Encode content style style content Figure 1: Comparison of two font generation paradigms. Left: The style-content disentangling paradigm. It assumes that a glyph can be decomposed into two distinct attributes: content and style. Right: The proposed paradigm. We first autoregressively predict the target tokens and decode them with a VQ decoder. Orange boxes show the main difference between the two paradigms. Corresponding author 38th Conference on Neural Information Processing Systems (Neur IPS 2024). 1 Introduction At the heart of font generation lies the extraction of styles from some reference glyphs of a certain font, and generate the remaining glyphs of this font. Some languages, such as Chinese, Japanese, and Korean, have a large number of characters and intricate glyph structures. Font generation can significantly reduce the labor intensity of font designers and support tasks like handwriting imitation, ancient book restoration, and internationalization of film and television productions. EMD [58] and SA-VAE [44] are based on the belief that the target glyph can be generated by integrating the content features of the source glyph with the style features of the reference glyph, as illustrated in Fig. 1 (left). The majority of subsequent works [53, 38, 47, 25, 50] continues this paradigm, but this makes font generation a sub-task of image-to-image translation, where the source glyph is morphed to match the style of the reference glyphs, rather than being truly generated . Due to the complex structure of glyphs, achieving a distinct boundary between style and content features requires considerable effort. Consequently, glyphs produced through the disentangling strategy typically maintain similar stroke thickness to the reference glyph but align more closely with the content font regarding spatial structure, size, and inclination. To this end, DG-Font [54] incorporates deformable convolution. Diff-Font [15] integrate diffusion process to improve the network s learning capabilities. CF-Font [50] proposes content fusion. Additionally, several approaches [38, 47, 25, 30] combine fine-grained prior information such as strokes and components, to further enhance generation quality. These methods essentially follow the content-style disentanglement paradigm, in scenarios where the content font differs substantially from the target font, the resulting glyphs are susceptible to artifacts such as missing strokes, blur, and smudge. We abandon the source glyphs in favor of Ideographic Description Sequence (IDS) to convey content information. It is based on a simple fact: without disentangling features, there is no risk of incomplete disentanglement. Consequently, font generation is reframed as sequence prediction task, where the objective is to generate the tokens of the target glyph based on the given IDS and reference glyphs. This approach mitigates the impact of source glyphs on the outputs and diminishes artifacts by leveraging the prior knowledge embedded in the quantized codebook. The users are allowed to formulate IDSs to create non-existing Chinese characters, such as kokuji2 (Japanese-invented Chinese characters), provided that the corresponding structures and components have been learned during training. This endows the model with certain cross-linguistic capabilities. We refer to this method as Ideographic Description Sequence-Following Font Generation, or IF-Font. In summary, the key contributions of this paper are as follows: We propose IF-Font, which abandons the previous content-style disentanglement paradigm and generates glyphs through next-token prediction. We devise a novel IDS Hierarchical Analysis (IHA) module that analyzes the spatial structures and components of Chinese characters. It allows our decoder flexibly control the generated content with the encoded semantic features. Leveraging corresponding IDSs, we design the Structure-Style Aggregation (SSA) block to extract and efficiently aggregate the style features of reference glyphs. 2 Related Works Image-to-image translation Image-to-image translation (I2I) aims to learn a mapping from a source domain to a target domain, requiring the transformation of images in the source domain into those in the reference style s target domain while preserving their content. Pix2Pix [21] is the first I2I method that trains GANs [13] using paired data. Cycle GAN [60] achieves unsupervised training through cycle consistency loss, although it is limited to transformations between two domains. UNIT [29] enforces the latent codes of images from two distinct domains to be identical, while employing separate generators for images in each domain. This process embodies the concept of disentanglement. MUNIT [19] further refines UNIT s latent code into content and style codes. Multimodal image translation can be achieved by combining the content code with different style codes. 2https://www.sljfaq.org/afaq/kokuji-list.html While applying an image-to-image translation framework for font generation is currently the mainstream approach, we believe this to be inappropriate. Unlike ordinary images, the boundaries between content and style in glyphs are ambiguous. For example, although handwriting will certainly differ when the same characters are written by different writers, their semantic meanings remain unchanged. Given that glyph features are challenging to disentangle, we utilize style-neutral IDSs to determine characters, thus avoiding any influence on the styles of results due to insufficient disentanglement of content glyphs. Few-shot font generation Few-shot learning [12] represents the prevailing research focus in font generation, aiming to simulate the target style with just a handful of reference glyphs. Font generation methods fall into two categories based on their utilization of implicit structural information within glyphs. Methods treating glyphs as general images possess broader applicability, enabling generation across various languages. Conversely, methods leveraging structural information typically yield higher quality outputs but are confined to specific language. Among the methods that do not incorporate structural information, EMD [58] stands out as the earliest attempt to disentangle glyphs into content and style features. DG-Font [54] employs deformable convolution [7] to capture the glyph deformations. Font RL [31] uses reinforcement learning [45] to draw the skeleton of Chinese characters. Font Diffuser [55] models the font generation task as a noiseto-denoise paradigm. Shamir et al. [42] explores a parametric representation of oriental alphabets, which can elegantly balance glyph quality and compression. In vector font generation, Deepvecfont [51] exhaustively exploit the dual-modality information (raster images and draw-command sequences) of fonts to synthesize vector glyphs. CLIPFont [43] controls the desired font style through text description rather than relying on style reference images. In terms of methods that incorporate structural information, SA-VAE [44] utilizes the radicals and spatial structures of Chinese characters. Calli GAN [53] adopts the Zi2Zi framework [48] and fully decompose Chinese characters into sequences of components. SC-Font [23] further decomposes Chinese characters into stroke granularity. DM-Font [4] proposes dual memory to update component features continuously. LF-Font [38] represents component-wise style through low-rank matrix factorization [3]. MX-Font [37] automatically extracts the features through multiple localized experts. FS-Font [47] demands that reference glyphs include all components of the target glyph, otherwise may result in a degradation of generation quality. CG-GAN [25] employs GRU [6] and attention mechanism to predict component sequences. XMP-Font [30] performs multimodal pre-training on Chinese character strokes and glyphs data. Most of the above methods are constrained by the content-style disentanglement paradigm. They often neglect the presence of Ideographic Description Character (IDC) which refers to the spatial structure of Chinese characters, suffering from artifacts and inconsistent styles. Vector quantized generative models Vector Quantization (VQ) typically follows a two-stage training scheme. Initially, it employs a codebook to record and update vectors, converting them from a continuous feature space to a discrete latent space. Subsequently, it models the distribution of these quantized vectors with a decoder to predict tokens, which are the codebook indices, and then restores the tokens to a image. VQ-VAE [35] is the first to incorporate quantization into the variational autoencoder (VAE) [24] framework. VQ-VAE2 [41] performs multi-scale quantization and adopts rejection sampling [1] VQGAN [9] acquires the codebook with the help of GAN [13] and employs Transformer [49] to replace the Pixel CNN [34] used by VQ-VAE [35]. RQ-VAE [26] proposes a residual quantizer. BEi T [2] performs masked image modeling (MIM) on the patch view with the supervision of visual tokens. Mask GIT [5] directly models visual tokens and proposes parallel decoding. MAGE [27] is similar to Mask GIT [5], but introduces Vi T [8] and contrastive learning [14]. DQ-VAE [18] further encodes images with variable-length codes. Since quantization is equivalent to tokenizing images, many methods attempt to enable multimodal generation. The d VAE proposed by DALL-E [40] relaxes the discrete sampling problem utilizing Gumbel-Softmax [33, 22], outputing the probability distribution of codebook codes. SEED [11] designed a Causal Q-Former to extract image embeddings and quantize them. LQAE [28] trains VQ-VAE [35] to quantize the image into the frozen LLM codebook space directly. SPAE [56] introduces multi-layer and coarse-to-fine pyramid quantization and semantic guidance with CLIP [39]. V2L [61] further proposes global and local quantizers. Given the absence of a pre-trained model tailored for IDS, we directly concatenate visual tokens with IDS tokens to performed autoregression. IDS Encoder Decoder VQGAN Decompose 礻 一口田 Ref Encoder Reference Images: Target Char: "福时劝" Reference Chars: Figure 2: Overview of our proposed method. The overall framework mainly consists of three parts: IDS Hierarchical Analysis module Eι, Structure-Style Aggregation block Er, and a decoder D. As shown in Fig. 2, given a target character cy, reference characters Cr = {ci r}k i=1 and reference glyphs Gr = {gi r}k i=1, the goal of our framework is to generate a glyph ˆgy that conforms to the semantics of cy and has a style consistent with Gr. To achieve this objective, we analyze cy with IHA to derive its associated IDS ιy, which is then encoded as a semantic feature fι. Likewise, we can obtain the IDS Ir = {ιi r}k i=1 corresponding to Cr. Following this, we employ the similarity module Esim to assess the relationship between ιy and Ir. Combined with fι and the output of Esim, the features Fr corresponding to Gr are fused into the final style feature fr in the SSA block. ιy is reshaped as initial tokens t<0, which is fed into the decoder D along with fr for autoregressive modeling. Finally, the predicted glyph tokens ˆty are decoded with the pre-trained VQGAN to obtain the generated glyph ˆgy. 3.1 IDS Hierarchical Analysis A simple alternative to using a source glyph as input is to directly employ the character itself to control the semantics of the output. However, considering the vast number of characters in Chinese, this approach proves to be impractical due to its expensive cost. Moreover, it overlooks the structural intricacies of characters. Ideograph Description Sequence is a structural description grammar for Chinese characters defined by the Unicode Standard, which consists of description characters and basic components (mainly Chinese characters) through a prefix notation. Decomposing Chinese characters into their corresponding IDSs can notably streamline the vocabulary, allowing characters with similar structures or components to share common features. However, a Chinese character may have multiple equivalent IDSs. Many Chinese characters have a top-bottom or left-right structure, the IDCs follows a long-tail distribution, presenting challenges for model training. Fortunately, the left-middle-right structure of Chinese characters can be equivalently represented by two left-right structures. Similarly, the top-middle-bottom structure equals two top-bottom structures. The examples can be found in Fig. 3. Based on the above observation, we employ a IDS Hierarchical Analysis (IHA) module. Instead of rigidly querying the decomposition table when determining the IDS of a character, we examine whether the character follows a left-middle-right or top-middle-bottom structure. Subsequently, we construct multiple equivalent IDSs for the same character through random selection. To summarize, cy and Cr are initially decomposed into ιy and Ir respectively. In the IDS encoder, ιy is padded to the maximum sequence length l I and encoded into the associated semantic feature fι Rl c. 对 又寸 木 又寸 木 = = 寸 木 Figure 3: The illustration of equivalent IDSs. Figure 4: The illustration of the fusion module Efuse of SSA block. 3.2 Structure-Style Aggregation Many previous methods [55, 10, 59, 25, 30, 54] overlook interactions between reference and target characters during the extraction of reference styles, resulting in a lack of relevance in the extracted features. The more closely the reference character resembles the target character, the more effortlessly the generation process can preserve the style. Ideally, employing the target glyph itself as the reference, known as self-reconstruction, should yield the most effective output. Although FS-Font [47] endeavors to ensure that the reference characters cover all components of the target character, its implementation hinges on predefined content-reference mapping, which may limit its adaptability. To address this issue, we propose a Structure-Style Aggregation (SSA) block, as shown in Fig. 2. We convert the reference glyph Gr into the latent space of VQGAN and encode it one by one into the corresponding intermediate features Fr = { f i r Rh w c}k i=1. The similarity module Esim evaluates the resemblance between each reference IDS Ir and the target IDS ιy, considering whether they share identical structures or components. It produces a set of similarity weights Sim = {simi R1}k i=1, which can guide the subsequent feature fusion process. The fusion module Efuse consists of two branches: global and local style feature aggregation, as shown in Fig. 4. The global features mainly focus on the glyph layout, stroke thickness, and inclination, which can be obtained by merging the coarse style features Fr with the similarity weight Sim obtained in the previous step: frg = softmax(Sim) Fr Rh w c. (1) While local features are more concerned with the strokes, such as stroke length, stroke edge, and other nuances, we adopt cross-attention to gather the required style feature according to the needs of the target character: F s = flatten2( Fr) Rk h w c, Q = Layer Norm(Lq(fι)) Rl c, K = Layer Norm(Lk(F s)) Rk h w c, V = Lv(F s) Rk h w c, A = dropout(QKT c ) Rl k h w, frl = softmax(A)V Rl c, where flatten2( ) denotes flattening the first two dimensions of the feature, and Lq, Lk, Lv are linear projections, and Layer Norm( ) denotes layer normalization. In Eq. 3, we obtain the aggregated style feature based on Eq. 1 and Eq. 2, where denotes concatenation operator: fr = Layer Norm (frg frl) . (3) 3.3 Style Contrast Enhancement There are some strategies to maintain style consistency: integrating consistency loss [59, 25], introducing a discriminator to determine the generated style [25, 47, 36], or treating the extracted style feature as a variable for further optimization [50]. These approaches are indeed beneficial for improving the generation quality, but they may be inflexible or introduce additional parameters. In this paper, we propose a streamlined approach named the Style Contrast Enhancement (SCE) module, which promotes the proximity of representations for the same style and the distance between representations for different styles. We apply a linear projection to the style feature fr, resulting in a contrastive feature e = MLP(fr). In one batch, we denote the indices of contrastive features corresponding to all samples as E = {i N | 0 i < 2N}, where N represents the batch size. The dimensionality of E is double the batch size N due to our utilization of a momentum encoder [16]. Each sample xa within the batch undergoes processing by both the encoder and the momentum encoder, yielding two outputs that serve as positive pairs. The negative sample set is defined as E = {i E | s(xi) = s(xa)}, while the positive sample set is E+ = {i E | i = a, s(xi) = s(xa)}, where s( ) denotes the operator used to retrieve the corresponding style. The contrastive loss can be calculated as follows: p E+ exp(e T a ep/τ) P p E+ exp(e Ta ep/τ) + P n E exp(e Ta en/τ). (4) 3.4 Generation The decoder D is provided with both semantic feature fι and style feature fr. It treats fι as the initial tokens t<0 = fι, and then predicts the distribution of the next token autoregressively as p(ti | t