# harmonizing_visual_text_comprehension_and_generation__1f823e7d.pdf Harmonizing Visual Text Comprehension and Generation Zhen Zhao1,2,*, Jingqun Tang2, , Binghong Wu2, Chunhui Lin2, Shu Wei2, Hao Liu2, Xin Tan1, Zhizhong Zhang1,3, Can Huang2, Yuan Xie1, 1East China Normal University 2 Byte Dance 3 Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai, China 51255901056@stu.ecnu.edu.cn, tangjingqun@bytedance.com, yxie@cs.ecnu.edu.cn In this work, we present Text Harmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. Simultaneously generating images and texts typically results in performance degradation due to the inherent inconsistency between vision and language modalities. To overcome this challenge, existing approaches resort to modality-specific data for supervised fine-tuning, necessitating distinct model instances. We propose Slide-Lo RA, which dynamically aggregates modality-specific and modality-agnostic Lo RA experts, partially decoupling the multimodal generation space. Slide-Lo RA harmonizes the generation of vision and language within a singular model instance, thereby facilitating a more unified generative process. Additionally, we develop a high-quality image caption dataset, Detailed Text Caps-100K, synthesized with a sophisticated closed-source MLLM to enhance visual text generation capabilities further. Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach. Empowered by Slide-Lo RA, Text Harmony achieves comparable performance to modality-specific fine-tuning results with only a 2% increase in parameters and shows an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. Our work delineates the viability of an integrated approach to multimodal generation within the visual text domain, setting a foundation for subsequent inquiries. Code is available at https://github.com/bytedance/Text Harmony. 1 Introduction Visual text comprehension and generation tasks such as scene text detection and recognition [79, 63, 40, 75, 62, 61, 24, 58], document understanding [64, 23], visual question answering (VQA) [26, 15, 27, 39, 59], key information extraction (KIE) [64, 23], multi-modal retrieval[33, 34, 30, 31, 36, 29, 32, 35, 28], visual text generation, editing, and erasure [66, 6, 5] are consistently of significant value for both academic research and practical applications. Recently, remarkable advancements have been achieved in visual text comprehension and generation, driven by the evolution of Multimodal Large Language Models (MLLMs) and diffusion models. Foremost text-centric MLLMs [73, 20, 27, 39] utilize a cohesive framework to comprehend text-rich images comprehensively, whereas diffusion-based approaches [66, 6, 5] introduce innovative modifications to enhance visual text generation capabilities. As depicted in Figure 1, text-centric MLLMs and diffusion models are capable of handling language and vision modalities adeptly, with MLLMs generating texts and diffusion models producing images. However, integrating language and vision generation capabilities within a large multimodal model for visual text scenarios remains unexplored. This paper focuses on Work done when Zhen Zhao was an intern at Byte Dance. Corresponding authors. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). the simultaneous manipulation of language and vision generations to further streamline the processing of diverse text-centric multimodal tasks. Text Monkey Text Diffusor Text Harmony (1) visual text comprehension (2) visual text generation (3) comprehension and generation (a) Comparison of model pipelines Give a detailed description of this image. The image shows a book cover with the title DREAM HEALTH and the subtitle How to Live a Healthy, Balanced Life in an Unbalanced World . The book is written by Dr. Brian Wilmovski. The left of the cover stands a cheerful man and the right is a boat on the lake. The whole image presents a positive and calm atmosphere under the sun. SOUTHERNMOST [140, 114, 428, 148] HOTEL [208, 196, 331, 196] IN THE USA [228, 200, 331, 220] RECEPTION [188, 265, 331, 306] Locate and recognize all the text in this picture. Generate an image: A bear holding a board saying Hello World . Fill the masked part of the image with play . Visual Text Comprehension Visual Text Generation Visual Text Perception Visual Text Editing (b) Visual text-related multimodal generation tasks Figure 1: Figure (a) illustrates the different types of image-text generation models: visual text comprehension models can only generate text, visual text generation models can only generate images, and Text Harmony can generate both text and images. Figure (b) illustrates the versatility of Text Harmony in generating different modalities for various text-centric tasks. In the general multimodal domain, some pioneering efforts [57, 16, 80, 65] empower MLLMs with the ability to generate images beyond texts, vastly extending the versatility of multimodal models. Such advancements inspire us to develop a text-centric multimodal generative model. Our foundational model follows these approaches, incorporating a VIT-based image encoder, a text tokenizer, an LLM, a text detokenizer, and a diffusion-based image decoder. Previous works [57, 80, 65] and our pilot experiments (Figure 2) have shown that multimodal generation often leads to a notable decline in performance due to the substantial inconsistency between language and vision modalities in the generation space. Prior studies [57, 16, 80, 65] commonly rely on modality-specific supervised fine-tuning to bolster generative capacities. A nuanced challenge involves boosting generative capabilities across modalities using a singular model instance. Mixed-of-Experts (Mo E) [54] architecture is widely adopted in LLMs because it improves the model s scalability while keeping the cost of inference similar to that of smaller models. Impressed by MOE-based models [18, 9] that set up task-specific experts to handle different tasks efficiently, we propose adapting modality-specific experts to partially decouple the generation of images and texts. Transforming a dense multimodal generative framework into an Mo E-based sparse model poses significant challenges due to the high computational demand and extensive training data requirements. To tackle these challenges, we incorporate Low-Rank Adaptation(Lo RA) [22] experts instead. Specifically, we integrate multiple Lo RA experts into the vision encoder and LLM components, encompassing modality-agnostic experts, vision modality generation experts, and language modality generation experts. Modality-specific experts are instrumental in refining and integrating modality-specific generative representations, while modality-agnostic experts enhance certain general representations. A dynamic gating network then amalgamates these modality-specific and general generative rep- resentations, assigning precise expert weights. Consequently, with minimal parameter increase, we enhance image comprehension and generation capabilities within a singular model instance. Slide-Lo RA achieves results comparable to those obtained through separate modality-specific finetuning, demonstrating the efficacy of our approach in bridging the gap between language and vision modalities in multimodal generation. 0.36 0.33 0.35 Text Generation Doc VQA Tab Fact Text VQA Any Text-Bench (NED) MARIO-Eval (CLIP score) Image Generation Uni-Modal Output Multi-Modal Output Text Harmony Figure 2: Comparison of single-modal and multi-modal output performance in text generation and image generation Tasks. Uni-Modal Output represents the results achieved by modality-specific supervised fine-tuning. Multi-Modal Output represents the results achieved by modal-independent supervised fine-tuning. Compared to the multi-modal output, a major performance degradation in the uni-modal output is observed for both text generation and image generation tasks. Despite the strides made with Slide-Lo RA in harmonizing comprehension and generation, the image quality produced by Text Harmony requires improvement. High-quality, detailed image caption data is crucial for visual text-centric image generation tasks. Thus, a detailed image caption dataset, Detailed Text Caps-100K, is created using an advanced closed-source MLLM with prompt engineering. By incorporating Detailed Text Caps-100K in the supervised fine-tuning stage, Text Harmony s image generation quality significantly improves over using original simple captions alone. Utilizing the above approaches, Text Harmony is versatile in various visual text-centric tasks. In visual text perception tasks, Text Harmony is able to perform well in text detection and recognition, achieving state-of-the-art performance in text grounding tasks. In visual text comprehension tasks, Text Harmony achieves comparable performance to dedicated text comprehension models. In image generation tasks, Text Harmony matches the performance to dedicated visual text generation models. The contribution of this paper can be summarised in three folds: We introduce Text Harmony, a versatile large multimodal that allows for the unification of diverse visual text perception, comprehension, and generation tasks. Text Harmony performs comparably to specialized models in visual text perception, comprehension, generation, and editing. To mitigate the inconsistency between vision and language modalities in the generative space, we propose Slide-Lo RA. Slide-Lo RA dynamically aggregates modality-specific and modalityagnostic Lo RA experts, partially decoupling the multimodal generative space. With Slide-Lo RA, Text Harmony achieves comparable performance to modality-specific fine-tuning results by only adding 2% parameters in a singular model instance. A high-quality dataset of detailed visual text image captions (Detailed Text Caps-100K) is constructed with a closed-source MLLM to enhance the performance of visual text generation. 2 Related Work 2.1 Visual Text Comprehension Recent multi-modal large language models have increasingly focused on comprehending images with textual information [26, 39, 15, 14, 72, 69, 41, 76, 60, 53, 67]. Among them, Uni Doc [15] and m PLUG-Doc Owl [72] creates noval text-oriented instruction-following datasets. m PLUG-Doc Owl 1.5 [21] builds a parsing system for visual text and employs large-scale pre-training. Monkey [26], Doc Pedia [14], and HRVDA [27] comprehend dense text by supporting higher image resolution. Text Monkey [39] adopts shifted window attention and filters out significant tokens. 2.2 Visual Text Generation Diffusion-based text-to-image generation has recently achieved impressive progress [19, 50, 77, 52, 48, 7], while the capability of rendering accurate, coherent text in images remains an open problem. Diff UTE [4] designs a text editing model through fine-grained glyph and position information control. Glyph Control [71] generates visual text images by first rendering the glyph and then performing the denoising process. Text Diffuer [6] builds a large-scale text images dataset with OCR annotations and generates visual text conditioned on character-level layouts. Further, Text Diffuer-2 [5] leverages a language model as the layout planner, relieving the character-level guidance. Anytext [66] integrates the text-control diffusion pipeline with an auxiliary latent module and a text embedding module, which has achieved remarkable success in multilingual visual text generation. 2.3 Unified Multi-Modal Comprehension and Generation Current multi-modal large language models (M-LLMs) [82, 2, 1, 11] largely use predicting the next text token as the training objective but exert no supervision for visual data [57]. Recent researches [57, 56, 65, 80, 16, 17, 70] have been attempting to empower M-LLMs to generate visual elements. The Emu family [57, 56] learns with a predict-the-next-element objective in multi-modality and decodes the regressed visual embeddings by a visual decoder. Mini GPT-5 [80] introduces a fixed number of special tokens into the LLM s vocabulary as the generative tokens for images. SEED-LLa MA [16] proposes the SEED tokenizer, which produces discrete visual codes with causal dependency and high-level semantics. MM-Interleaved [65] extracts fine-grained visual details from multiple images multi-scale feature maps, proving effective in generating interleaved image-text sequences. They all focus on generic multimodal generation, whereas no such work exists yet in the visual text domain. 3 Methodology 3.1 Model Architecture Figure 3 presents an overview of Text Harmony. The backbone network follows the paradigm established by MM-Interleaved [65], where a Vision Encoder, an LLM, and an Image Decoder are internally integrated to empower the model to generate both visual and textual content. Specifically, the image embedding extracted by the Vision Encoder is abstracted by a Q-Former [25] and aligns with text tokens. Image and text tokens are then concatenated and forwarded through the LLM. The output token of the LLM either predicts text content or serves as the conditional input of image detokenization. The image decoder perceives the conditional input from LLM and generates images based on the denoising diffusion process [19]. Given the multi-modal input X, the multi-modal generation task involves generating interleaved token sequences Y , which can be detokenized into both image and text contents. The token generation stage is achieved by maximizing the conditional probability under the classic auto-regressive paradigm as follows: P(Y |X) = YL l=1 p(Y l|X, Y