# multimodal_chainofthought_reasoning_in_language_models__7a8cd32d.pdf Published in Transactions on Machine Learning Research (05/2024) Multimodal Chain-of-Thought Reasoning in Language Models Zhuosheng Zhang zhangzs@sjtu.edu.cn School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University Aston Zhang az@astonzhang.com Gen AI, Meta Mu Li muli@cs.cmu.edu Amazon Web Services Hai Zhao zhaohai@cs.sjtu.edu.cn Department of Computer Science and Engineering, Shanghai Jiao Tong University George Karypis gkarypis@amazon.com Amazon Web Services Alex Smola alex@smola.org Amazon Web Services Reviewed on Open Review: https: // openreview. net/ forum? id= y1p PWFVfv R Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (Co T) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing Co T studies have primarily focused on the language modality. We propose Multimodal-Co T that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on Science QA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal Co T, our model under 1 billion parameters achieves state-of-the-art performance on the Science QA benchmark. Our analysis indicates that Multimodal-Co T offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot. 1 Introduction Imagine reading a textbook with no figures or tables. Our ability to knowledge acquisition is greatly strengthened by jointly modeling diverse data modalities, such as vision, language, and audio. Recently, large language models (LLMs) (Brown et al., 2020; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022) have shown impressive performance in complex reasoning by generating intermediate reasoning steps before inferring the answer. The intriguing technique is called chain-of-thought (Co T) reasoning (Wei et al., 2022b; Kojima et al., 2022; Zhang et al., 2023d). Work done at Amazon Web Services. Correspondence to: Zhuosheng Zhang and Aston Zhang. Published in Transactions on Machine Learning Research (05/2024) Rationale: Will these magnets attract or repel? To find out, look at which poles are closest to each other. The north pole of one magnet is closest to the south pole of the other magnet. Poles that are different attract. So, these magnets will attract each other. Answer: The answer is (A). Vision Language Input Question: Will these magnets attract or repel each other? Context: Two magnets are placed as shown. Hint: Magnets that attract pull together. Magnets that repel push apart. Options: (B) repel (A) attract Figure 1: Example of the multimodal Co T task. However, existing studies related to Co T reasoning are largely isolated in the language modality (Wang et al., 2022c; Zhou et al., 2022; Lu et al., 2022b; Fu et al., 2022), with little consideration of multimodal scenarios. To elicit Co T reasoning in multimodality, we advocate a Multimodal-Co T paradigm. Given the inputs in different modalities, Multimodal-Co T decomposes multi-step problems into intermediate reasoning steps (rationale) and then infers the answer. Since vision and language are the most popular modalities, we focus on those two modalities in this work. An example is shown in Figure 1. In general, Multimodal-Co T reasoning can be elicited through two primary paradigms: (i) prompting LLMs and (ii) fine-tuning smaller models.1 We will delve into these paradigms and delineate their associated challenges as follows. The most immediate way to perform Multimodal-Co T is to transform the input of different modalities into a unified modality and prompt LLMs to perform Co T (Zhang et al., 2023a; Lu et al., 2023; Liu et al., 2023; Alayrac et al., 2022; Hao et al., 2022; Yasunaga et al., 2022). For example, it is possible to generate a caption for an image by a captioning model and then concatenate the caption with the original language input to be fed into LLMs (Lu et al., 2022a). The development of large multimodal models such as GPT-4V (Open AI, 2023) and Gemini (Reid et al., 2024) has notably enhanced the quality of generated captions, resulting in finer-grained and more detailed descriptions. However, the captioning process still incurs significant information loss when transforming vision signals into textual descriptions. Consequently, using image captions rather than vision features may suffer from a lack of mutual synergy in the representation space of different modalities. In addition, LLMs either have paywalls or resource-consuming to deploy locally. To facilitate the interaction between modalities, another potential solution is to fine-tune smaller language models (LMs) by fusing multimodal features (Zhang et al., 2023c; Zhao et al., 2023). As this approach allows the flexibility of adjusting model architectures to incorporate multimodal features, we study fine-tuning models in this work instead of prompting LLMs. The key challenge is that language models under 100 billion parameters tend to generate hallucinated rationales that mislead the answer inference (Ho et al., 2022; Magister et al., 2022; Ji et al., 2022; Zhang et al., 2023b). To mitigate the challenge of hallucination, we propose Multimodal-Co T that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference.2 In this way, answer inference can leverage better generated rationales that are based on multimodal information. Our experiments were conducted on the Science QA (Lu et al., 2022a) and A-OKVQA (Schwenk et al., 2022) datasets, which are the latest multimodal reasoning benchmarks with annotated reasoning chains. Our method achieves state-of-the-art performance on the Science QA benchmark upon the release. We find that Multimodal-Co T is beneficial in mitigating hallucination and boosting convergence. Our contributions are summarized as follows: (i) To the best of our knowledge, this work is the first to study Co T reasoning in different modalities in scientific peer-reviewed literature. (ii) We propose a two-stage framework by fine-tuning language models to fuse vision and language representations to perform Multimodal-Co T. The model is able to generate informative rationales to facilitate inferring final answers. (iii) We elicit the analysis of why the naive way of employing Co T fails in the context and how incorporating vision features alleviates the problem. The approach has been shown to be generally effective across tasks and backbone models. 1We refer to small models as models with less than 1 billion parameters (hereinafter dubbed as 1B-models). 2This work focuses on the language and vision modalities. Published in Transactions on Machine Learning Research (05/2024) Table 1: Representative Co T techniques (FT: fine-tuning; KD: knowledge distillation). Segment 1: in-context learning techniques; Segment 2: fine-tuning techniques. To the best of our knowledge, our work is the first to study Co T reasoning in different modalities in scientific peer-reviewed literature. Besides, we focus on 1B-models, without relying on the outputs of LLMs. Models Mutimodal Model / Engine Training Co T Role Co T Source Zero-Shot-Co T (Kojima et al., 2022) GPT-3.5 (175B) ICL Reasoning Template Few-Shot-Co T (Wei et al., 2022b) Pa LM (540B) ICL Reasoning Hand-crafted Self-Consistency-Co T (Wang et al., 2022b) Codex (175B) ICL Reasoning Hand-crafted Least-to-Most Prompting (Zhou et al., 2022) Codex (175B) ICL Reasoning Hand-crafted Retrieval-Co T (Zhang et al., 2023d) GPT-3.5 (175B) ICL Reasoning Auto-generated Prompt PG-Co T (Lu et al., 2022b) GPT-3.5 (175B) ICL Reasoning Hand-crafted Auto-Co T (Zhang et al., 2023d) Codex (175B) ICL Reasoning Auto-generated Complexity-Co T (Fu et al., 2022) GPT-3.5 (175B) ICL Reasoning Hand-crafted Few-Shot-Po T (Chen et al., 2022) GPT-3.5 (175B) ICL Reasoning Hand-crafted Unified QA (Lu et al., 2022a) T5 (770M) FT Explanation Crawled Fine-Tuned T5 XXL (Magister et al., 2022) T5 (11B) KD Reasoning LLM-generated Fine-Tune-Co T (Ho et al., 2022) GPT-3 (6.7B) KD Reasoning LLM-generated Multimodal-Co T (our work) T5 (770M) FT Reasoning Crawled 2 Background This section reviews studies eliciting Co T reasoning by prompting and fine-tuning language models. 2.1 Co T Reasoning with LLMs Recently, Co T has been widely used to elicit the multi-step reasoning abilities of LLMs (Wei et al., 2022b). Concretely, Co T techniques encourage the LLM to generate intermediate reasoning chains for solving a problem. Studies have shown that LLMs can perform Co T reasoning with two major paradigms of techniques: Zero-Shot-Co T (Kojima et al., 2022) and Few-Shot-Co T (Wei et al., 2022b; Zhang et al., 2023d). For Zero-Shot-Co T, Kojima et al. (2022) showed that LLMs are decent zero-shot reasoners by adding a prompt like Let s think step by step after the test question to invoke Co T reasoning. For Few-Shot-Co T, a few step-by-step reasoning demonstrations are used as conditions for inference. Each demonstration has a question and a reasoning chain that leads to the final answer. The demonstrations are commonly obtained by hand-crafting or automatic generation. These two techniques, hand-crafting and automatic generation are thus referred to as Manual-Co T (Wei et al., 2022b) and Auto-Co T (Zhang et al., 2023d). With effective demonstrations, Few-Shot-Co T often achieves stronger performance than Zero-Shot-Co T and has attracted more research interest. Therefore, most recent studies focused on how to improve Few-Shot-Co T. Those studies are categorized into two major research lines: (i) optimizing the demonstrations; (ii) optimizing the reasoning chains. Table 1 compares typical Co T techniques. Optimizing Demonstrations The performance of Few-Shot-Co T relies on the quality of demonstrations. As reported in Wei et al. (2022b), using demonstrations written by different annotators results in dramatic accuracy disparity in reasoning tasks. Beyond hand-crafting the demonstrations, recent studies have investigated ways to optimize the demonstration selection process. Notably, Rubin et al. (2022) retrieved the semantically similar demonstrations with the test instance. However, this approach shows a degraded performance when there are mistakes in the reasoning chains (Zhang et al., 2023d). To address the limitation, Zhang et al. (2023d) found that the key is the diversity of demonstration questions and proposed Auto-Co T: (i) partition questions of a given dataset into a few clusters; (ii) sample a representative question from each cluster and generate its reasoning chain using Zero-Shot-Co T with simple heuristics. In addition, reinforcement learning (RL) and complexity-based selection strategies were proposed to obtain effective demonstrations. Fu et al. (2022) chose examples with complex reasoning chains (i.e., with more reasoning steps) as the demonstrations. Lu et al. (2022b) trained an agent to find optimal in-context examples from a candidate pool and maximize the prediction rewards on given training examples when interacting with GPT-3.5. Published in Transactions on Machine Learning Research (05/2024) Optimizing Reasoning Chains A notable way to optimize reasoning chains is problem decomposition. Zhou et al. (2022) proposed least-to-most prompting to decompose complex problems into sub-problems and then solve these sub-problems sequentially. As a result, solving a given sub-problem is facilitated by the answers to previously solved sub-problems. Similarly, Khot et al. (2022) used diverse decomposition structures and designed different prompts to answer each sub-question. In addition to prompting the reasoning chains as natural language texts, Chen et al. (2022) proposed program-of-thoughts (Po T), which modeled the reasoning process as a program and prompted LLMs to derive the answer by executing the generated programs. Another trend is to vote over multiple reasoning paths for a test question. Wang et al. (2022b) introduced a self-consistency decoding strategy to sample multiple outputs of LLMs and then took a majority over the final answers. Wang et al. (2022c) and Li et al. (2022c) introduced randomness in the input space to produce more diverse outputs for voting. 2.2 Eliciting Co T Reasoning by Fine-Tuning Models A recent interest is eliciting Co T reasoning by fine-tuning language models. Lu et al. (2022a) fine-tuned the encoder-decoder T5 model on a large-scale dataset with Co T annotations. However, a dramatic performance decline is observed when using Co T to infer the answer, i.e., generating the reasoning chain before the answer (reasoning). Instead, Co T is only used as an explanation after the answer. Magister et al. (2022) and Ho et al. (2022) employed knowledge distillation by fine-tuning a student model on the chain-of-thought outputs generated by a larger teacher model. Wang et al. (2022a) proposed an iterative context-aware prompting approach to dynamically synthesize prompts conditioned on the current step s contexts. There is a key challenge in training 1B-models to be Co T reasoners. As observed by Wei et al. (2022b), models under 100 billion parameters tend to produce illogical Co T that leads to wrong answers. In other words, it might be harder for 1B-models to generate effective Co T than directly generating the answer. It becomes even more challenging in a multimodal setting where answering the question also requires understanding the multimodal inputs. In the following part, we will explore the challenge of Multimodal-Co T and investigate how to perform effective multi-step reasoning. 3 Challenge of Multimodal-Co T Existing studies have suggested that the Co T reasoning ability may emerge in language models at a certain scale, e.g., over 100 billion parameters (Wei et al., 2022a). However, it remains an unresolved challenge to elicit such reasoning abilities in 1B-models, let alone in the multimodal scenario. This work focuses on 1B-models as they can be fine-tuned and deployed with consumer-grade GPUs (e.g., 32G memory). In this section, we will investigate why 1B-models fail at Co T reasoning and study how to design an effective approach to overcome the challenge. 3.1 Towards the Role of Co T To begin with, we fine-tune a text-only baseline for Co T reasoning on the Science QA benchmark (Lu et al., 2022a). We adopt FLAN-Alpaca Base as the backbone language model.3 Our task is modeled as a text generation problem, where the model takes the textual information as the input and generates the output sequence that consists of the rationale and the answer. Table 2: Effects of Co T in the one-stage setting. Method Format Accuracy No-Co T QCM A 81.63 Reasoning QCM RA 69.32 Explanation QCM AR 69.68 As an example shown in Figure 1, the model takes the concatenation of tokens of the question text (Q), the context text (C), and multiple options (M) as the input. To study the effect of Co T, we compare the performance with three variants: (i) No-Co T which predicts the answer directly (QCM A); (ii) Reasoning where answer inference is conditioned to the rationale (QCM RA); (iii) Explanation where the rationale is used for explaining the answer inference (QCM AR). 3https://github.com/declare-lab/flan-alpaca. It is a 200M T5 model (Raffel et al., 2020) fine-tuned on Stanford Alpaca data (Taori et al., 2023). Implementation details are presented in Section 5.2. Published in Transactions on Machine Learning Research (05/2024) Generated Rationale: Will these magnets attract or repel? To find out, look at which poles are closest to each other. The south pole of one magnet is closest to the south pole of the other magnet. Poles that are the same repel. So, these magnets will repel each other. Answer: The answer is (B). Options: (B) repel (A) attract Question: Will these magnets attract or repel each other? Context: Two magnets are placed as shown. Hint: Magnets that attract pull together. Magnets that repel push apart. Gold Rationale: Will these magnets attract or repel? To find out, look at which poles are closest to each other. The north pole of one magnet is closest to the south pole of the other magnet. Poles that are different attract. So, these magnets will attract each other. Answer: The answer is (A). Generated Rationale: Will these magnets attract or repel? To find out, look at which poles are closest to each other. The north pole of one magnet is closest to the south pole of the other magnet. Poles that are different attract. So, these magnets will attract each other. Answer: The answer is (A). + Vision Features Figure 2: Example of the two-stage framework without vision features (baseline) and with vision features (ours) for generating rationales and predicting answers. The upper part presents the problem details with a gold rationale, and the lower part shows the outputs of the baseline and our method incorporated with vision features. We observe that the baseline fails to predict the right answer due to the misleading by hallucinated rationales. More examples are shown in Appendix A.1. Surprisingly, as shown in Table 2, we observe a 12.31% accuracy decrease (81.63% 69.32%) if the model predicts rationales before answers (QCM RA). The results imply that the rationales might not necessarily contribute to predicting the right answer. According to Lu et al. (2022a), the plausible reason might be that the model exceeds the maximum token limits before obtaining the required answer or stops generating the prediction early. However, we find that the maximum length of the generated outputs (RA) is always less than 400 tokens, which is below the length limit of language models (i.e., 512 in T5 models). Therefore, it deserves a more in-depth investigation into why the rationales harm answer inference. 3.2 Misleading by Hallucinated Rationales Table 3: Two-stage setting of (i) rationale generation (Rouge L) and (ii) answer inference (Accuracy). Method (i) QCM R (ii) QCMR A Two-Stage Framework 90.73 78.57 w/ Captions 90.88 79.37 w/ Vision Features 93.46 85.31 To dive into how the rationales affect the answer prediction, we separate the Co T problem into two stages, rationale generation and answer inference.4 We report the Rouge L score and accuracy for the rationale generation and answer inference, respectively. Table 3 shows the results based on the two-stage framework. Although the two-stage baseline model achieves a 90.73 Rouge L score of the rationale generation, the answer inference accuracy is only 78.57%. Compared with the QCM A variant (81.63%) in Table 2, the result shows that the generated rationale in the two-stage framework does not improve answer accuracy. Then, we randomly sample 50 error cases and find that the model tends to generate hallucinated rationales that mislead the answer inference. As an example shown in Figure 2, the model (left part) hallucinates that, The south pole of one magnet is closest to the south pole of the other magnet , due to the lack of reference to the vision content. We find that such mistakes occur at a ratio of 56% among the error cases (Figure 3(a)). 3.3 Multimodality Contributes to Effective Rationales We speculate that such a phenomenon of hallucination is due to a lack of necessary vision contexts for performing effective Multimodal-Co T. To inject vision information, a simple way is to transform the image into a caption (Lu et al., 2022a) and then append the caption in the input of both stages. 4The details will be presented in Section 4. Published in Transactions on Machine Learning Research (05/2024) Hallucination (a) ratio of hallucination mistakes (b) correction rate w/ vision features Figure 3: The ratio of (a) hallucination mistakes and (b) correction rate w/ vision features. However, as shown in Table 3, using captions only yields marginal performance gains ( 0.80%). Then, we explore an advanced technique by incorporating vision features into the language model. Concretely, we feed the image to the Vi T model (Dosovitskiy et al., 2021b) to extract vision features. Then we fuse the vision features with the encoded language representations before feeding the decoder (more details will be presented in Section 4). Interestingly, with vision features, the Rouge L score of the rationale generation has boosted to 93.46% (QCM R), which correspondingly contributes to better answer accuracy of 85.31% (QCMR A). With those effective rationales, the phenomenon of hallucination is mitigated 60.7% hallucination mistakes in Section 3.2 have been corrected (Figure 3(b)), as an example shown in Figure 2 (right part).5 The analysis so far compellingly shows that vision features are indeed beneficial for generating effective rationales and contributing to accurate answer inference. As the two-stage method achieves better performance than one-stage methods, we choose the two-stage method in our Multimodal-Co T framework. 4 Multimodal-Co T In light of the discussions in Section 3, we propose Multimodal-Co T. The key motivation is the anticipation that the answer inference can leverage better generated rationales that are based on multimodal information. In this section, we will overview the procedure of the framework and elaborate on the technical design of the model architecture. Rationale Generation Will these magnets attract or repel? To find out, look at which poles are closest to each other. The north pole of one magnet is closest to the south pole of the other magnet. Poles that are different attract. So, these magnets will attract each other. Answer Inference The answer is (A). Options: (B) repel (A) attract Question: Will these magnets attract or repel each other? Context: Two magnets are placed as shown. Hint: Magnets that attract pull together. Magnets that repel push apart. Figure 4: Overview of our Multimodal-Co T framework. Multimodal-Co T consists of two stages: (i) rationale generation and (ii) answer inference. Both stages share the same model structure but differ in the input and output. In the first stage, we feed the model with language and vision inputs to generate rationales. In the second stage, we append the original language input with the rationale generated from the first stage. Then, we feed the updated language input with the original vision input to the model to infer the answer. 4.1 Framework Overview Multimodal-Co T consists of two operation stages: (i) rationale generation and (ii) answer inference. Both stages share the same model structure but differ in the input X and output Y . The overall procedure is illustrated in Figure 4. We will take vision-language as an example to show how Multimodal-Co T works. In the rationale generation stage, we feed the model with X = {X1 language, Xvision} where X1 language represents the language input in the first stage and Xvision represents the vision input, i.e., the image. For example, X can be instantiated as a concatenation of question, context, and options of a multiple choice reasoning 5The left mistakes are mainly about map understanding, requiring extra commonsense signals (Section 6.7). Published in Transactions on Machine Learning Research (05/2024) problem (Lu et al., 2022a) as shown in Figure 4. The goal is to learn a rationale generation model R = F(X) where R is the rationale. In the answer inference stage, the rationale R is appended to the original language input X1 language to construct the language input in the second stage, X2 language = X1 language R where denotes concatenation. Then, we feed the updated input X = {X2 language, Xvision} to the answer inference model to infer the final answer A = F(X ). In both stages, we train two models with the same architecture independently. They take the annotated elements (e.g., X R, XR A, respectively) from the training set for supervised learning. During inference, given X, the rationales for the test sets are generated using the model trained in the first stage; they are used in the second stage for answer inference. 4.2 Model Architecture Given language input Xlanguage {X1 language, X2 language} and vision input Xvision, we compute the probability of generating target text Y (either the rationale or the answer in Figure 4) of length N by p(Y |Xlanguage, Xvision) = i=1 pθ (Yi | Xlanguage, Xvision, Y