# pali_a_jointlyscaled_multilingual_languageimage_model__245f94f9.pdf

Published as a conference paper at ICLR 2023

PALI: A JOINTLY-SCALED MULTILINGUAL LANGUAGE-IMAGE MODEL

Xi Chen , Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut Google Research

Effective scaling and a ﬂexible task interface enable large language models to excel at many tasks. We present Pa LI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. Pa LI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train Pa LI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (Vi Ts). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We ﬁnd that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter Vi T (Vi T-e) to quantify the beneﬁts from even larger-capacity vision models. To train Pa LI, we create a large multilingual mix of pre-training tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. Pa LI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

1 INTRODUCTION

Increasing neural network capacity has been a successful trend in the modeling of language and vision tasks. On the language side, models such as T5 (Raffel et al., 2020), GPT-3 (Brown et al., 2020), Megatron-Turing (Shoeybi et al., 2019), GLa M (Du et al., 2022), Chinchilla (Hoffmann et al., 2022), and Pa LM (Chowdhery et al., 2022) have shown signiﬁcant advantages from training large Transformers on large amounts text data. On the vision side, CNNs (Mahajan et al., 2018; Huang et al., 2019; Kolesnikov et al., 2020), Vision Transformers (Dosovitskiy et al., 2021), and other models (Tolstikhin et al., 2021; Riquelme et al., 2021) have seen similar beneﬁts from scale (Zhai et al., 2022a), albeit to a lesser extent than in language. Language-and-vision modeling has followed a similar trend, e.g., Sim VLM (Wang et al., 2021), Florence (Yuan et al., 2021), Co Ca (Yu et al., 2022), GIT (Wang et al., 2022a), BEi T-3 (Wang et al., 2022c), and Flamingo (Alayrac et al., 2022).

We introduce Pa LI, a model that performs image-only, language-only, and image+language tasks across many languages, using a single image-and-text to text interface. A key characteristic of Pa LI is a more balanced parameter share between the language and vision components, with more capacity to the vision backbone yielding large gains in performance. Another key ingredient to Pa LI is the reuse of large unimodal backbones for language and vision modeling, in order to transfer existing capabilities and reduce training cost. On the language side, we reuse the 13B-parameter model m T5-XXL (Xue et al., 2021), which already packages language understanding and generation capabilities. We show that these capabilities are maintained and extended into a multimodal setting. On the vision side, in addition to reusing the 2B-parameter Vi T-G model (Zhai et al., 2022a), we

Correspondence: chillxichen@google.com

Published as a conference paper at ICLR 2023

train a 4B-parameter model, which we call Vi T-e ( enormous ). Vi T-e achieves good performance on image-only tasks, such as 90.9% Image Net ﬁne-tuning, and 84.9% on Object Net (Barbu et al., 2019).

We ﬁnd beneﬁts from jointly scaling both the vision and the language components, with vision providing a better return on investment (accuracy improvement per parameter/FLOP). As a result, the capacity of our largest Pa LI model, Pa LI-17B, is distributed relatively equitably between the two modalities, with the Vi T-e component accounting for about 25% of the total parameter count. This is not always the case for prior work in large-capacity vision and language modeling (Wang et al., 2022a; Alayrac et al., 2022), due to the prior scale mismatch between vision and language backbones. We enable knowledge-sharing between multiple image and/or language tasks by casting them into a generalized VQA-like task. We frame all tasks using an image+query to answer modeling interface, in which both the query and answer are expressed as text tokens. This allows Pa LI to capitalize on transfer learning across tasks, and enhance language-and-image understanding capabilities in a wide range of vision and language problems: image captioning, visual question-answering, scene-text understanding, and others (Figure 1).

To train Pa LI-17B, we build a new high-volume image-and-language dataset Web LI, which consists of 10 billion images and tens of billions of image-text pairs. Importantly, the Web LI dataset contains text in over 100 languages. By training the model to perform multimodal tasks in many languages, we greatly increase the task diversity, and test the model s ability to effectively scale both across tasks and across languages. As a reference for future usage, we provide a data card to report information about the Web LI and its construction.

Pa LI-17B achieves state-of-the-art (SOTA) results on multiple benchmarks, outperforming some strong models. Speciﬁcally, Pa LI outperforms recent and concurrent models on the long-standing COCO Captioning benchmark (Chen et al., 2015), with 149.1 CIDEr score on the Karpathy split (Karpathy & Fei-Fei, 2015). Pa LI also achieves a new SOTA of 84.3% on VQAv2 (Goyal et al., 2017) while using an open-vocabulary text generative setting that is similar to Flamingo (Alayrac et al., 2022). This result outperforms even models evaluated in a ﬁxed-vocabulary classiﬁcation setting, e.g. Co Ca (Yu et al., 2022), Sim VLM (Wang et al., 2021), BEi T-3 (Wang et al., 2022c). Last but not least, our work provides a scaling roadmap for future multimodal models. Our results support the conclusion that scaling the components of each modality yields better performance compared to more skewed alternatives. Model scaling is also important for language-image understanding in multiple languages. In summary, our contributions are the following:

We design a simple, modularized and scalable sequence-to-sequence learning architecture that can be efﬁciently trained by reusing existing Transformer-based unimodal checkpoints.

We perform joint scaling on both the language and vision components for a wide range of parameters, and show no saturation of performance on both components for the largest model size we consider, Pa LI-17B. More importantly, we show that multimodal performance greatly beneﬁts from scaling the vision component beyond the previous-largest Vi T, which provides a scaling roadmap for future vision & language models.

We empirically validate that a mixture-of-objectives beneﬁts the performance of large vision & language models.

We scale up pre-training data to include over 100 languages, and train a large-capacity multilingual multimodal model. We show that a properly-scaled model can handle well a large number of languages, while still achieving SOTA performance on English-only tasks.

2 RELATED WORK

Pre-trained models have proven effective in both vision (Dosovitskiy et al., 2021; Zhai et al., 2022a) and language (Raffel et al., 2020; Brown et al., 2020) tasks. Image-text pre-training has also become the default approach to tackle V&L tasks (Tan & Bansal, 2019; Chen et al., 2020; Zhang et al., 2021; Cho et al., 2021; Hu et al., 2022). While beneﬁting from the text representation and generation capabilities of the Transformer architecture, some of these vision-language models rely on external systems (such as Fast(er) R-CNN (Ren et al., 2015)) to provide detected object names and the related precomputed dense features. Such reliance limited the capability to scale up the model and performance. With the introduction of Vision Transformers (Dosovitskiy et al., 2021), vision and

Published as a conference paper at ICLR 2023

language modalities can be jointly modeled by transformers in a more scalable fashion (Yuan et al., 2021; Yu et al., 2022; Wang et al., 2022a; Alayrac et al., 2022).

One approach for image-text pre-training is contrastive learning (Radford et al., 2021; Jia et al., 2021). Zhai et al. (2022b) show that with a pre-trained and locked vision model, one needs to train only a paired text encoder model to get good language embeddings. Yuan et al. (2021) extend contrastively pre-trained models to more downstream tasks with task-speciﬁc adaptations. Beside image and language, MERLOT (Zellers et al., 2021) has found success in video understanding and reasoning through video-language pretraining. Another approach is to train vision-language models to generate text autoregressively (Donahue et al., 2015; Vinyals et al., 2015). This approach has the advantage of a uniﬁed formulation of vision-language tasks as a text generation problem (Cho et al., 2021; Wang et al., 2022b; Piergiovanni et al., 2022b). In Cho et al. (2021), the vision-language model is trained to recover masked text. Sim VLM (Wang et al., 2021) propose an image-language pre-training approach leveraging a preﬁx language modeling objective. The uniﬁed framework OFA (Wang et al., 2022b) extends the generation capability to include text to image generation. Concurrent with our work, Uniﬁed-IO (Lu et al., 2022) further scaled up the number of objectives and tasks and demonstrated decent performance across the board through only multi-task pre-training without task-speciﬁc ﬁne-tuning.

Recent works explore joint vision and language modeling with increased model capacity. Co Ca (Yu et al., 2022) pre-trains a 2.1B image-text encoder-decoder model jointly with contrastive loss and generative loss. GIT (Wang et al., 2022a) trains a model consisting of a single image encoder and a text decoder with a captioning (generative) loss, where the image encoder is pre-trained with contrastive loss. In their latest version, GIT2, the model size is scaled up to 5.1B, with the majority of parameters on the vision side (4.8B). BEi T-3 (Wang et al., 2022c) presents an architecture with vision, language, and vision-language experts, operating with a shared multi-head self-attention followed by a switch for expert modules, resulting in a 1.9B model trained from scratch on a variety of public image, text and image-text datasets. Flamingo (Alayrac et al., 2022) is built upon a 70B language model (Hoffmann et al., 2022) as a decoder-only model whose majority of parameters are frozen in order to preserve language-generation capabilities, along with a 435M vision encoder.

Vision-language pre-training also beneﬁts from automatically mined and ﬁltered large-scale datasets such as Conceptual Captions (CC3M) and CC12M (Sharma et al., 2018; Changpinyo et al., 2021), with 3 and 12 million image-text pairs, respectively. With more relaxed ﬁltering, LEMON (Hu et al., 2022) collected a larger dataset with 200M examples, which is further expanded to 800M examples in GIT (Wang et al., 2022a). For better scaling the model, larger, noisier datasets such as the ALIGN dataset (1.8B) (Jia et al., 2021) have been constructed, which has beneﬁted Sim VLM (Wang et al., 2021) and Co Ca (Yu et al., 2022). While these image-text datasets have fueled the foundational V&L models with state-of-the-art performance, they are English-only, and there has been limited attempts to create datasets not English-centric and unlock the multilingual capability of these models.

3 THE PALI MODEL

3.1 ARCHITECTURE

With Pa LI, we aim to perform both unimodal (language, vision) and multimodal (language and vision) tasks. Typically, many of these tasks are best handled by different models. For instance, image classiﬁcation, and many formulations of VQA, require predicting elements from a ﬁxed set, while language-only tasks and image captioning require open-vocabulary text generation. Similar to the recent work OFA (Wang et al., 2022b) and a concurrent work (Lu et al., 2022), we resolve this by using a sufﬁciently general interface for all tasks considered: the model accepts as input an image and text string, and generates text as output. The same interface is used both during pre-training and ﬁne-tuning. Since all tasks are performed with the same model, i.e. we have no tasks-speciﬁc parameters or heads , we use text-based prompts to indicate to the model which task to perform.

Figure 1 shows a high-level schematic of the model architecture. At its core, Pa LI has a text encoderdecoder Transformer (Vaswani et al., 2017). To include vision as input, the text encoder is fed with a sequence of visual tokens : output patch features of a Vision Transformer which takes as input an image. No pooling is applied to the output of the Vision Transformer before passing the visual tokens to the encoder-decoder model via cross-attention. We reuse previously trained unimodal checkpoints.

Published as a conference paper at ICLR 2023

Figure 1: The Pa LI main architecture is simple and scalable. It uses an encoder-decoder Transformer model, with a large-capacity Vi T component for image processing.

For the text encoder-decoder, we reuse pre-trained m T5 (Xue et al., 2021) models, while for the image encoder, we reuse large vanilla Vi T models (Dosovitskiy et al., 2021; Zhai et al., 2022a).

The visual component We introduce and train the largest vanilla Vi T architecture to date, named Vi T-e. Vi T-e has the same architecture and uses the same training recipe as the 1.8B parameter Vi T-G model (Zhai et al., 2022a), while scaling to 4B parameters. The only other difference is that we apply learning rate cool-down twice, once with and once without inception crop augmentation, and average ( soup ) the weights of the two models as in Wortsman et al. (2022). While the scaling laws have been studied in both the vision domain and the language domain, scaling behaviour is less explored in combined vision and language models. Scaling up vision backbones leads to saturating gains on classiﬁcation tasks such as Image Net (Zhai et al., 2022a). We further conﬁrm this, observing that Vi T-e is only marginally better than Vi T-G on Image Net (Table 16). However, we observe substantial performance improvements from Vi T-e on vision-language tasks in Pa LI (Section 4). For example, Vi T-e yields almost three additional CIDEr points over Vi T-G on the COCO captioning task. This hints towards future headroom for vision-language tasks with even larger Vi T backbones.

The language component We adopt the m T5 (Xue et al., 2021) backbone as our language component. We experiment using the pre-trained m T5-Large (1B parameters) and m T5-XXL (13B parameters), from which we initialize the language encoder-decoder of Pa LI. We train on a mix of many tasks, including pure language understanding tasks (Section A.2). This helps avoid catastrophic forgetting of the m T5 s language understanding and generation abilities. As a result, Pa LI-17B continues to achieve similar levels of language-understanding accuracy on both the English benchmarks (Wang et al., 2019a) and across languages measured by the XTREME benchmark (Hu et al., 2020) (Section 4).

The overall model Three model sizes are considered (Table 7): 1) Pa LI-3B, where the language component is initialized from m T5-Large (Xue et al., 2021) (1B parameters), and the vision component is Vi T-G (Zhai et al., 2022a) (1.8B parameters). 2) Pa LI-15B, where the language component is initialized from m T5-XXL (Xue et al., 2021) (13B parameters), and the vision component is Vi T-G (1.8B parameters). 3) Pa LI-17B, where the language model is initialized from m T5-XXL, and the vision component is the newly-trained Vi T-e model (4B parameters).

Web LI Dataset Scaling studies for deep learning show that larger models require larger datasets to train effectively (Hoffmann et al., 2022; Kaplan et al., 2020; Zhai et al., 2022a). To unlock the potential of multilingual image-language pre-training, we introduce Web LI, a multilingual imagelanguage dataset built from images and texts available on the public web. Web LI scales up the image language data collection from English-only datasets to 109 languages, which enables us to pre-train Pa LI multilingually, and perform downstream tasks across many languages. The data collection process is similar to those reported in (Jia et al., 2021; Zhai et al., 2022b). Due to the abundance of multilingual content on the internet, the collection process for the Web LI dataset can be scaled to cover 10 billion images and 12 billion alt-texts. In addition to annotation with web text, we use publicly available automatic service to extract OCR annotations on all images, resulting in 29 billion image-OCR pairs. To balance quality and retain scale, we ﬁlter the dataset to the highest quality subset retaining only the top 10% scoring of the original Web LI image-text pairs (about 1B examples), which we use to train Pa LI. Examples and statistics for the Web LI corpus and a complete datasheet (Pushkarna et al., 2022) are shown in Appendix B (Figure 4) and G.

Published as a conference paper at ICLR 2023

Training mixture To accommodate diverse tasks in the image-language space, we train Pa LI using a mixture of eight pre-training tasks. This mixture is designed to span a range of general capabilities useful for downstream tasks. Span corruption on text-only data uses the same technique described by Xue et al. (2021) on text-only examples. Split-captioning on Web LI alt-text data is inspired by the pre-training objective of Wang et al. (2021), and works by splitting each alt-text string randomly into two parts, cap1 and cap2 , used for input and target, respectively. Captioning on CC3M-35L with the alt-text string in language lang as the target, based on the Conceptual Captions (Sharma et al., 2018) training data and machine translated alt-texts. OCR on Web LI OCR-text data uses the concatenation of the annotated OCR texts in language lang (Kil et al., 2022) produced by publicly available automatic service for the input image. English and Cross-Lingual VQA is VQ2A-CC3M (Changpinyo et al., 2022a), translated in the same way as CC3M-35L. Note that we use English answers in all instances here, as the English-native answers for VQA are often short and too prone to errors to perform out-of-context automatic translation. English and Cross-Lingual visual question generation (VQG) is also based on native and translated VQ2A-CC3M-35L VQA triplets. Similarly, we use only English answers here. English-only Object-Aware (OA) VQA is based on VQA triplets derived from automatically-produced, non-exhaustive object labels, inspired by Piergiovanni et al. (2022a). The QA pairs include listing all the objects in the image and whether a subset of objects are in the image. To create these examples, we require object-level annotations, for which we use Open Images (Kuznetsova et al., 2020). Object detection is a generative objectdetection task inspired by Chen et al. (2021; 2022).

We specify each task using a training data source and a template-based prompt, and train the model using a language-model style teacher forcing (Goodfellow et al., 2016) with a standard softmax cross-entropy loss. The coefﬁcients for the training mixture are empirically determined, with 1.6B total examples in the mixture (Appendix A.2). The whole mixture is slightly smaller and designed to be cleaner than the datasets used in Sim VLM (1.8B), Co Ca (1.8B), and Flamingo (2.3B). However, unlike the aforementioned datasets, examples in our 1.6B dataset follow a long-tailed distribution over the 100+ languages covered. To prevent leakage between the pre-training examples and the downstream benchmarks. Web LI has undergone near de-duplication (Jia et al., 2021) of the images against the train, validation, and test splits of 68 common vision/vision-language datasets. For other datasets in the mixture, we performed the same de-duplication against all the downstream tasks.

3.3 MODEL TRAINING

All Pa LI variants are trained for one epoch over the entire pre-training dataset (1.6B) with 224 224 image resolution. Only the parameters of the language component are updated, the vision component is frozen, which is beneﬁcial (Sec. 4.6). For the largest model, Pa LI-17B, we perform an additional high-res (588 588) phase similar to previous works (Radford et al., 2021; Yuan et al., 2021; Yu et al., 2022). This phase is only for 10k steps, covering 10M examples in total, with all the parameters of Pa LI updated. More details for training Pa LI and the Vi T-e backbone are in Appendix A.1.

4 EXPERIMENTS

We ﬁne-tune and evaluate Pa LI-3B and Pa LI-15B checkpoints at 490 490 resolutions. For Pa LI-17B, unless otherwise stated, the checkpoint produced by the two-phase pre-training is ﬁne-tuned and evaluated at 588 588 resolution. For all the benchmarks, cross-entropy loss is used for ﬁne-tuning.

4.1 IMAGE CAPTIONING

We ﬁne-tune on COCO Captions (Chen et al., 2015) on the widely adopted Karpathy split (Karpathy & Fei-Fei, 2015). Pa LI outperforms the latest SOTA trained with cross-entropy loss (Wang et al., 2022c), and establishes a new high of CIDEr score (Vedantam et al., 2015) at 149.1 (Table 1) for models without CIDEr-optimization. No Caps (Agrawal et al., 2019) is an evaluation benchmark for image captioning that has similar style to COCO, but targets many more visual concepts than those included in the COCO. We follow previous works by evaluating No Caps using a model ﬁnetuned on COCO. Pa LI-17B achieves a 124.4 CIDEr score on test, comparable to the recent result of 124.8 from GIT2 (Wang et al., 2022a). GIT2 achieves 124.2, 125.5, 122.3 on in-domain, neardomain, and out-of-domain splits of the No Caps test set, respectively. Pa LI-17B achieves 121.1,

Published as a conference paper at ICLR 2023

124.4 and 126.7, respectively. This suggests that for Pa LI-17B, the domain transfer from COCO to No Caps is slightly sub-optimal compared with models pre-trained with English only. Nevertheless, Pa LI-17B outperforms all prior models on recognizing and describing long-tail objects outside of COCO s domain. Text Caps (Sidorov et al., 2020) focuses on captioning for images containing text. Viz Wiz-Cap (Gurari et al., 2020) contains images taken by people who are blind, which also involves scene-text understanding. We ﬁne-tune on Text Caps and Viz Wiz-Cap using OCR strings generated by publicly available automatic service, similar to the protocol used in (Yang et al., 2021). Further details, including results evaluating Pa LI-17B without OCR as input, are provided in Appendix C.5.

Table 1: CIDEr results for image captioning over the English benchmarks COCO Captions (Karpathy split), No Caps, Text Caps, and Viz Wiz-Cap.

COCO No Caps Text Caps Viz Wiz-Cap

Model Karpathy-test val test val test test-dev test-std

LEMON (0.7B) 139.1 117.3 114.3 - - - - Sim VLM 143.3 112.2 110.3 - - - - Co Ca (2.1B) 143.6 122.4 120.6 - - - - GIT (0.7B) 144.8 125.5 123.4 143.7 138.2 113.1 114.4 GIT2 (5.1B) 145.0 126.9 124.8 148.6 145.0 119.4 120.8 OFA (0.9B) 145.3 - - - - - - Flamingo (80B) 138.1 - - - - - - BEi T-3 (1.9B) 147.6 - - - - - -

Pa LI-3B 145.4 121.1 - 143.6 - 117.2 - Pa LI-15B 146.2 121.2 - 150.1 - 121.7 - Pa LI-17B 149.1 127.0 124.4 160.0 160.4 123.0 124.7

Multilingual captioning on Crossmodal-3600 Following Thapliyal et al. (2022), we ﬁne-tune Pa LI models on COCO-35L, which is COCO captions translated into 35 languages similar to CC3M35L, before evaluating on Crossmodal-3600. We used the checkpoints pre-trained at 224 224 resolution and ﬁne-tuned on COCO-35L at the same resolution. We normalize the unicode, tokenize, and remove all punctuation before calculating CIDEr scores. For languages without word boundaries such as Chinese, Japanese, Korean and Thai, a neural model is used for segmenting the text. To illustrate the range of improvements over a variety of language families with different scripts and different resources, we use seven languages in Table 2 to show their exact CIDEr scores, in addition to the 35-language average score. Pa LI outperforms previous SOTA by large margins. Note that due to different linguistic structures, the variance of CIDEr scores across different languages does not indicate lower quality of prediction on certain languages. In Appendix C.2, we back-translate the non-English predictions to English, and demonstrated that the capability of Pa LI on both English and other languages is rather consistent.

Table 2: CIDEr scores on image captioning for the Crossmodal-3600 benchmark for seven diverse languages (English, French, Hindi, Hebrew, Romanian, Thai, and Chinese), as well as the average of the 35 languages covered by the benchmark.

Model en fr hi iw ro th zh 35-lang avg.

Thapliyal et al. (2022) (0.8B) 57.6 40.9 20.6 16.1 13.9 35.5 19.8 28.9

Pa LI-3B 92.8 68.6 30.3 39.2 30.3 65.9 32.2 47.0 Pa LI-17B 98.1 75.5 31.3 46.8 35.8 72.1 36.5 53.6

4.2 VISUAL QUESTION ANSWERING

All the VQA ﬁne-tuning experiments in this paper are performed in the open-vocabulary setting using the 250k m T5 (Xue et al., 2021) vocabulary (Table 3). Most prior works, e.g. Sim VLM (Wang et al., 2021), Co Ca (Yu et al., 2022) and BEi T-3 (Wang et al., 2022c), use the VQA-as-classiﬁcation setting, where the best answer among a predeﬁned set (usually of size 3k) needs to be selected. Note that the VQA-as-open-generation setting is challenging because: (1) The generated text is directly compared to the desired answer and only an exact match is counted as accurate. (2) The Pa LI vocabulary covers 100+ languages and is signiﬁcantly larger than both those used in the classiﬁcation setting, and those used by previous single-language open-generation models (Alayrac et al., 2022; Wang et al., 2022a).

Published as a conference paper at ICLR 2023

Table 3: VQA Accuracy results on VQAv2, OKVQA, Text VQA, Viz Wiz-QA, and ANLS result on ST-VQA. Pa LI models are evaluated in the open-vocabulary generation setting, and still outperform previous models that use closed-vocabulary classiﬁcation (Sim VLM, Co Ca, BEi T-3, OFA). The result on OKVQA by Flamingo (with * ) is obtained in a 32-shot learning setup. Mia (Qiao et al., 2021) (with ) is the winning model of Text VQA Challenge 2021, based on ﬁne-tuning T5-XL (Raffel et al., 2020). Numbers shown in gray are from models using closed-vocabulary classiﬁcation.

VQAv2 OKVQA Text VQA Viz Wiz-QA ST-VQA

Method test-dev test-std val val test test-dev test val test

Sim VLM 80.03 80.34 - - - - - - - Co Ca (2.1B) 82.3 82.3 - - - - - - - GIT (0.7B) 78.56 78.81 - 59.93 59.75 68.0 67.5 69.1 69.6 GIT2 (5.1B) 81.74 81.92 - 68.38 67.27 70.97 70.1 75.1 75.8 OFA (0.9B) 82.0 82.0 - - - - - - - Flamingo (80B) 82.0 82.1 57.8 57.1 54.1 65.7 65.4 - - BEi T-3 (1.9B) 84.2 84.0 - - - - - - - KAT - - 54.4 - - - - - - Mia - - - - 73.67 - - - -

Pa LI-3B 81.4 - 52.4 60.12 - 67.5 - 67.5 69.7 Pa LI-15B 82.9 - 56.5 65.49 - 71.1 - 73.2 76.5 Pa LI-17B 84.3 84.3 64.5 71.81 73.06 74.4 73.3 77.1 79.9

On VQAv2, Pa LI achieves 84.3 accuracy on VQAv2, and outperforms previous SOTA as follows: (1) By +2.2 accuracy points on the open-vocabulary generation setting, compared to Flamingo (Alayrac et al., 2022). (2) By +0.3 accuracy points when compared against the best result on the closedvocabulary classiﬁcation setting, BEi T-3 (Wang et al., 2022c). OKVQA requires external knowledge to answer its questions, that is, knowledge not directly present in the image input, and instead needs to be indirectly inferred by the model. Pa LI-17B achieves 64.5 accuracy, pushing SOTA for the pretrain-ﬁnetune setup higher by 10.1 accuracy points, compared to KAT (Gui et al., 2021) at 54.4 accuracy. The best result for the 32-shot learning setup is from Flamingo (Alayrac et al., 2022) at 57.8 accuracy. The results from Flamingo and Pa LI-17B suggest that leveraging external knowledge does not necessarily require speciﬁc training, and instead can be achieved with generic large-capacity models trained on large amounts of data. Text VQA (Singh et al., 2019), Viz Wiz-QA (Gurari et al., 2018) and ST-VQA (Biten et al., 2019) require the ability to perform question answering in the presence of text in the input image. We ﬁne-tune using OCR strings generated by publicly available automatic service, similar to the protocol in TAP (Yang et al., 2021) and Mia (Qiao et al., 2021). Evaluation on Text VQA and Viz Wiz-QA without OCR as input is provided in Appendix C.5.

Cross-lingual and Multilingual VQA on x GQA and Ma XM Both x GQA (Pfeiffer et al., 2022) and Ma XM (Changpinyo et al., 2022b) are test-only VQA benchmarks that require multilingual understanding of visual questions. The setting in x GQA is cross-lingual (English-answers only), whereas for Ma XM it is multilingual (answer in the same language as the question). We evaluate Pa LI17B pre-trained at 224 image resolution and ﬁne-tuned on the native and translated VQAv2 (Goyal et al., 2017) (the Karpathy train split) in the 13 languages covered by x GQA and Ma XM (VQAv2-13L) at 378 resolution. Table 4 shows signiﬁcant gains on both benchmarks across all languages.

Table 4: Cross-lingual VQA results on x GQA (Pfeiffer et al., 2022) (left) and multilingual VQA results on Ma XM (Changpinyo et al., 2022b) (right). All models are ﬁne-tuned on translated VQAv2 in 13 languages. Exact-match accuracy is reported. Referenced MPT results are from (Changpinyo et al., 2022b)

x GQA Ma XM

Model en bn de id ko pt ru zh en fr hi iw ro th zh

MPT 41.5 38.6 40.5 39.5 38.7 39.8 39.5 39.5 36.6 36.2 55.1 40.6 42.3 50.0 30.3 Pa LI-17B 54.2 50.0 52.2 50.6 50.4 51.3 50.3 50.6 56.4 46.4 67.3 60.0 57.4 65.6 46.9

4.3 LANGUAGE-UNDERSTANDING CAPABILITIES

Since Pa LI is pre-trained with a diverse mixture of multimodal tasks with image and text data, it raises the question on whether it would forget its language modeling capability, and therefore

Published as a conference paper at ICLR 2023

COCO-Cap No Caps Text Caps VQAv2 Text VQA OKVQA Viz Wiz-QA 7-Task Avg.

Absolute score difference

Pa LI-3B (L, Vi T-G) Pa LI-15B (XXL, Vi T-G) Pa LI-17B (XXL, Vi T-e) Pa LI-17B (w/ high-res phase)

Figure 2: Pa LI scaling for a number of tasks. We report CIDEr scores for captioning tasks, and accuracy scores for VQA tasks. Both scaling the language side (from 1B to 13B parameters) and the vision side of the model (from 2B to 4B parameters) yield improvements across all tasks. The results represented by solid bars are from the standard 224 224 resolution pre-training. The empty orange bars correspond to Pa LI-17B checkpoints with the high resolution pre-training phase.

exhibit inferior performance on language-understanding tasks compared to its unimodal starting checkpoint (m T5-XXL in the case of Pa LI-17B). Therefore, we compare m T5-XXL and Pa LI17B on a range of language understanding benchmarks, including the English-only Super GLUE benchmark (Wang et al., 2019a), as well as three multilingual benchmarks from the XTREME (Hu et al., 2020): XNLI (Conneau et al., 2018), which is a textual entailment task covering 14 languages, XQu AD (Artetxe et al., 2020) and Ty Di QA-Gold P (Clark et al., 2020), which are both questionanswering tasks covering 10 and 11 languages, respectively. For the three XTREME benchmarks, we evaluate in the zero-shot (ZS) transfer setting, whereas for Super GLUE the models are ﬁne-tuned (FT). Table 11 in Appendix C.1 summarizes the results. Despite the pre-training mixture heavily favoring the V&L tasks, Pa LI-17B is able to maintain a high-level of language-understanding capabilities for English, and it is on-par with the state-of-the-art m T5-XXL checkpoint on the XTREME benchmarks.

4.4 ZERO-SHOT IMAGE CLASSIFICATION

We evaluate the Pa LI checkpoints (without high-res phase) at 224 224 resolution on Image Net and Image Net OOD evaluation sets: Image Net (Deng et al., 2009), Image Net-R (Hendrycks et al., 2021a), Image Net-A (Hendrycks et al., 2021b), Image Net-Sketch (Wang et al., 2019b), Image Net-v2 (Recht et al., 2019) and Object Net (Barbu et al., 2019). We use the same interface as for all other tasks. Instead of training a classiﬁer on top of Pa LI, we condition on the image and use Pa LI s decoder to score strings corresponding to each class directly. (See Appendix C.8 for details) The top-1 accuracies are presented in Table 5, where it clearly shows that Pa LI-17B is signiﬁcantly better than smaller variants. We are not aware of any previous work for large scale zero-shot evaluation on Image Net with a generative model. However, Pa LI with a zero-shot setting outperforms the 1-shot learning result from Flamingo (Alayrac et al., 2022).

Table 5: Top 1 accuracy results of 0-shot image classiﬁcation on Image Net, Image Net-R, Image Net-A, Image Net-Sketch, Imagenet-v2, and Object Net. Top-5 results are in the Appendix (Table 21).

Model (Image Net data) INet INet-R INet-A INet-Sketch INet-v2 Obj Net

Flamingo-80B (1-shot) 71.9 - - - - - Flamingo-80B (5-shot) 77.3 - - - - -

Pa LI-3B (0-shot) 70.06 80.15 37.92 61.11 62.55 38.87 Pa LI-15B (0-shot) 70.27 81.21 41.16 61.03 62.81 39.51 Pa LI-17B (0-shot) 72.11 81.97 44.70 63.83 64.46 42.62

4.5 MODEL SCALING

Due to the modular architecture, the image and language components of Pa LI can be scaled independently. We demonstrate that jointly scaling the capacity of both components leads to performance improvements. Figure 2 quantiﬁes this improvement across seven V&L benchmarks where we

Published as a conference paper at ICLR 2023

have also evaluated the Pa LI-17B checkpoint without the high resolution pre-training phase for fair comparison. These improvements are noticeable both when scaling the language-model capacity (from L to XXL), and the vision-model capacity (from Vi T-G to Vi T-e). Figure 2 also shows that scaling the visual component is important: when scaling from a Vi T-G to a Vi T-e model, although the overall model size is increased by only about 13% (+2B parameters), the average performance improvement over all seven benchmarks (additional +3.2) is larger than the one obtained with much larger increases in the capacity of the language model (+3.1) which takes more parameters (+12B). The high-resolution pre-training phase at 588 588 resolution brings an additional +2.0 points, which also indicates the potential of scaling up the vision component of the model. This observation also resonates with the signiﬁcant improvement from Pa LI-15B to 17B on generative Image Net zero-shot classiﬁcation (Table 5). Table 12 shows the results of a 5B version of Pa LI with m T5-L and Vi T-e on two benchmarks, which also resonates with the ﬁnding of the beneﬁt of joint scaling. For context, in prior work, V&L scaling is usually conducted at lower model capacity: for instance, Co Ca (Yu et al., 2022) scales up to 2.1B parameters, or scaling is done primarily via the language-modeling backbone, e.g. Flamingo (Alayrac et al., 2022) scales the text backbone to 80B but the image backbone remains at 435M. Finally, on the Crossmodal-3600 benchmark, we show that scale has a large impact on multilingual performance as well (Figure 5 in the Appendix).

4.6 ABLATION STUDIES

We examine the composition of the task mixture and demonstrate the effectiveness of our multipleobjective mixture design. To this end, we pre-train a Pa LI-3B model with 200M data coverage for each setting, before ﬁne-tuning on a combination of English and multilingual V&L tasks (Table 6). Aside from the four tasks from our main evaluation for Pa LI, we also add a VQAv2-based VQG benchmark (Akula et al., 2021). The relative weight of each components remains the same as the full mixture (Table 9). As a ﬁrst observation, the split-cap objective on Web LI appears to be the most critical, across all benchmarks. Second, the object-related components also boost performance on all benchmarks. Third, the captioning objective on CC3M-35L helps on COCO; on XM-3600, its positive contribution for non-EN languages and the slight degradation for English is a reﬂection of CC3M-35L having a much higher non-EN example ratio (34/35) compared to Web LI alt-text (60% English, Figure 4). Fourth, adding VQA helps Text VQA; in addition, the VQG objective improves the model s VQG capability without impacting the performance on other benchmarks. Last but not least, the OCR objective positively impacts OCR-related tasks such as Text VQA, at a slight negative impact on captioning performance. We also note that VQAv2, due to its large training set size, is much less sensitive to the change in pre-training mixture. In addition, we perform ablations to quantify the positive impact of initializing from uni-modal checkpoints, as opposed to from-scratch training (Table 13); the minor accuracy improvement from freezing the Vi T backbone during pre-training (Table 14); the effect of pretraining with non-English Web LI examples on multi-(cross-)lingual performance (Table 15).

Table 6: Mixture of objectives (Pa LI-3B). Text VQA is ﬁne-tuned with 490 490 resolution, while all other benchmarks are ﬁne-tuned with 224 224. Results for VQAv2 are on the Karpathy validation set. XM-3600 denotes Crossmodal-3600, and 6L is the average of the six non-English languages in Table 2. The order in which the components are ablated follows the presented order in Sec. 3.2, and "object-related" refers to the object-aware QA and generative object detection components together. Text VQA is ﬁne-tuned without detected OCR string to better showcase the model s OCR capability

Component COCO Text VQA VQAv2 XM-3600 (EN / 6L) VQG (ZS / FT)

Full mixture 141.4 41.6 76.0 93.8 / 42.5 96.7 / 194.0

w/o split-cap 140.4 (-1.0) 38.8 (-2.8) 75.5 (-0.5) 87.5 (-6.3) / 41.5 (-1.0) 86.3 (-10.4) / 190.5 (-3.5) w/o captioning 140.5 (-0.9) 41.2 (-0.4) 75.9 (-0.1) 94.9 (+1.1) / 39.9 (-2.6) 101.3 (+4.6) / 193.3 (-0.7) w/o OCR 142.3 (+0.9) 39.9 (-1.7) 75.9 (-0.1) 95.4 (+1.6) / 43.6 (+1.1) 92.5 (-4.2) / 193.7 (-0.3) w/o VQA 140.9 (-0.5) 40.0 (-1.6) 75.9 (-0.1) 93.9 (+0.1) / 42.7 (+0.2) 94.1 (-2.6) / 193.2 (-0.8) w/o VQG 141.4 (+0.0) 41.3 (-0.3) 75.8 (-0.2) 95.1 (+1.3) / 42.0 (-0.5) 17.9 (-78.8) / 188.2 (-5.8) w/o object-related 140.9 (-0.5) 40.2 (-1.4) 75.4 (-0.6) 90.9 (-2.9) / 41.8 (-0.7) 81.7 (-15.0) / 189.1 (-4.9)

Published as a conference paper at ICLR 2023

ETHICS STATEMENT AND BROADER IMPACTS

Large models may have broader societal impact. While such models have demonstrated strong performance on public benchmarks, they might contain unknown biases or stereotypes, or propagate inaccurate or otherwise distorted information. While we have made efforts to measure some of these issues, such models need to be re-assessed carefully before being used for speciﬁc purposes. The dataset used for pre-training is automatically harvested, and ﬁltering of the data is automatic. That process may leave undesirable images or text annotations, descriptions or concepts to be incorporated into the model. We have also attempted to train the model to operate in more than 100 languages, which we believe is an important step forward for image-language models. However, languages have various levels of data presence and coverage, so the language-generated text varies in quality depending on the language, and might contain inaccurate or undesirable outputs.

REPRODUCIBILITY STATEMENTS

Our model is based on open sourced components - Vi T and m T5 (Dosovitskiy et al., 2021; Xue et al., 2021). Model architecture details for each component is in Section 3.1. The conﬁguration of Vi T-e when scaling is provided in Table 7 and Section A.1. We have provided training and ﬁne-tuning details in Section 3.3 and in Section A in the Appendix. Data and model cards are also provided in the Appendix.

ACKNOWLEDGEMENTS

We would like to thank Erica Moreira, Victor Gomes, Tom Small, Sarah Laszlo, Kathy Meier Hellstern, Susanna Ricco, Emily Denton, Bo Pang, Wei Li, Jihyung Kil, Tomer Levinboim, Julien Amelot, Zhenhai Zhu, Xiangning Chen, Liang Chen, Filip Pavetic, Daniel Keysers, Matthias Minderer, Josip Djolonga, Ibrahim Alabdulmohsin, Mostafa Dehghani, Yi Tay, Rich Lee, Austin Tarango, Elizabeth Adkison, James Cockerille, Eric Ni, Anna Davies, Maysam Moussalem, Jeremiah Harmsen, Claire Cui, Slav Petrov, Tania Bedrax-Weiss, Joelle Barral, Tom Duerig, Paul Natsev, Fernando Pereira, Jeff Dean, and Zoubin Ghahramani for helpful discussions, feedback, and support.

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948 8957, 2019.

Arjun Akula, Soravit Changpinyo, Boqing Gong, Piyush Sharma, Song-Chun Zhu, and Radu Soricut. Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2148 2166, 2021.

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. ar Xiv preprint ar Xiv:2204.14198, 2022.

Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4623 4637, 2020.

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Joshua Tenenbaum, and Boris Katz. Object Net: a large-scale bias-controlled dataset for pushing the limits of object recognition models. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 9453 9463, 2019.

Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are we done with Image Net? ar Xiv preprint ar Xiv:2006.07159, 2020.

Published as a conference paper at ICLR 2023

Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big Vision. https://github.com/ google-research/big_vision, 2022.

Ali Furkan Biten, Rubèn Tito, Andrés Maﬂa, Lluis Gomez, Marçal Rusiñol, C.V. Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Scene text visual question answering. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4290 4300, 2019. doi: 10.1109/ICCV. 2019.00439.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax.

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 1877 1901, 2020.

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558 3568, 2021.

Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, and Radu Soricut. All you may need for VQA are image captions. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1947 1963, Jul 2022a.

Soravit Changpinyo, Linting Xue, Idan Szpektor, Ashish V. Thapliyal, Julien Amelot, Michal Yarom, Xi Chen, and Radu Soricut. Ma XM: Towards multilingual visual question answering. ar Xiv preprint ar Xiv:2209.05401, 2022b.

Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. ar Xiv preprint ar Xiv:2109.10852, 2021.

Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J Fleet, and Geoffrey Hinton. A uniﬁed sequence interface for vision tasks. ar Xiv preprint ar Xiv:2206.07669, 2022.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO Captions: Data collection and evaluation server. ar Xiv preprint ar Xiv:1504.00325, 2015.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pp. 104 120, 2020.

Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pp. 1931 1942, 2021.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022.

Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Ty Di QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8: 454 470, 2020.

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2475 2485, 2018.

Published as a conference paper at ICLR 2023

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255, 2009.

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625 2634, 2015.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021,, 2021.

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLa M: Efﬁcient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp. 5547 5569, 2022.

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12): 86 92, 2021.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org.

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904 6913, 2017.

Liangke Gui, Borui Wang, Qiuyuan Huang, Alex Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT: A knowledge augmented transformer for vision-and-language. ar Xiv preprint ar Xiv:2112.08614, 2021.

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Viz Wiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3608 3617, 2018.

Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind. In European Conference on Computer Vision, pp. 417 434. Springer, 2020.

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2020. URL http://github.com/google/flax.

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340 8349, 2021a.

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262 15271, 2021b.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ar Xiv preprint ar Xiv:2203.15556, 2022.

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pp. 4411 4421, 2020.

Published as a conference paper at ICLR 2023

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17980 17989, 2022.

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, Hyouk Joong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. GPipe: efﬁcient training of giant neural networks using pipeline parallelism. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 103 112, 2019.

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904 4916, 2021.

Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. A domain-speciﬁc supercomputer for training deep neural networks. Communications of the ACM, 63(7):67 78, 2020.

Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Aleksei Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, Andrew Tomkins, and Sujith Ravi. Graph-rise: Graph-regularized image semantic embedding. ar Xiv preprint ar Xiv:1902.10814, 2019.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128 3137, 2015.

Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, and Radu Soricut. Pre STU: Pre-training for scene-text understanding. ar Xiv preprint ar Xiv:2209.05534, 2022.

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (Bi T): General visual representation learning. Lecture Notes in Computer Science, pp. 491 507, 2020. ISSN 1611-3349.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32 73, 2017.

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The Open Images dataset v4. International Journal of Computer Vision, 128(7):1956 1981, 2020.

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Uniﬁedio: A uniﬁed model for vision, language, and multi-modal tasks. ar Xiv preprint ar Xiv:2206.08916, 2022.

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pp. 181 196, 2018.

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pp. 220 229, 2019.

Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin Steitz, Stefan Roth, Ivan Vuli c, and Iryna Gurevych. x GQA: Cross-lingual visual question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 2497 2511, 2022.

Published as a conference paper at ICLR 2023

Hieu Pham, Zihang Dai, Golnaz Ghiasi, Hanxiao Liu, Adams Wei Yu, Minh-Thang Luong, Mingxing Tan, and Quoc V Le. Combined scaling for zero-shot transfer learning. ar Xiv preprint ar Xiv:2111.10050, 2021.

AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova. Pre-training image-language transformers for open-vocabulary tasks. In T4V: Transformers for Vision Workshop, Conference on Computer Vision and Pattern Recognition, 2022a.

AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, and Anelia Angelova. Answer-Me: Multi-task learning for generalization to many question-answering tasks. ar Xiv preprint ar Xiv:2205.00949, 2022b.

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and transparent dataset documentation for responsible AI. In FAcc T 22: 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1776 1826, 2022.

Yixuan Qiao, Hao Chen, Jun Wang, Yihao Chen, Xianbin Ye, Ziliang Li, Xianbiao Qi, Peng Gao, and Guotong Xie. Winner team Mia at Text VQA challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model. ar Xiv preprint ar Xiv:2106.15332, 2021.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748 8763, 2021.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. Journal of Machine Learning Research, 21(140):1 67, 2020.

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do Image Net classiﬁers generalize to Image Net? In International Conference on Machine Learning, pp. 5389 5400, 2019.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583 8595, 2021.

Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scaling up models and data with t5x and seqio. ar Xiv preprint ar Xiv:2203.17189, 2022.

Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru. A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES 21, pp. 916 925, 2021. ISBN 9781450384735.

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 8430 8439, 2019.

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2556 2565, 2018.

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596 4604, 2018.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick Le Gresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. ar Xiv preprint ar Xiv:1909.08053, 2019.

Published as a conference paper at ICLR 2023

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Text Caps: a dataset for image captioning with reading comprehension. In European conference on computer vision, pp. 742 758, 2020.

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8317 8326, 2019.

Hao Tan and Mohit Bansal. LXMERT: Learning cross-modality encoder representations from transformers. ar Xiv preprint ar Xiv:1908.07490, 2019.

Ashish V. Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. ar Xiv preprint ar Xiv:2205.12522, 2022.

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. MLP-Mixer: An all-MLP architecture for vision. Advances in Neural Information Processing Systems, 34: 24261 24272, 2021.

Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 8252 8262, 2019.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000 6010, 2017.

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566 4575, 2015.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156 3164, 2015.

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Super GLUE: a stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3266 3280, 2019a.

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pp. 10506 10518, 2019b.

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language. ar Xiv preprint ar Xiv:2205.14100, 2022a.

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. ar Xiv preprint ar Xiv:2202.03052, 2022b.

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: BEi T pretraining for all vision and vision-language tasks. ar Xiv preprint ar Xiv:2208.10442, 2022c.

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Sim VLM: Simple visual language model pretraining with weak supervision. ar Xiv preprint ar Xiv:2108.10904, 2021.

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple ﬁne-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965 23998, 2022.

Published as a conference paper at ICLR 2023

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483 498, Jun 2021.

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. By T5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291 306, 2022.

Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. TAP: Text-aware pre-training for Text-VQA and Text-Caption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8751 8761, 2021.

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Co Ca: Contrastive captioners are image-text foundation models. ar Xiv preprint ar Xiv:2205.01917, 2022.

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. ar Xiv preprint ar Xiv:2111.11432, 2021.

Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634 23651, 2021.

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. ar Xiv preprint ar Xiv:1910.04867, 2019.

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104 12113, 2022a.

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Li T: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123 18133, 2022b.

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vin VL: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579 5588, 2021.

Published as a conference paper at ICLR 2023

A PALI MODEL ADDITIONAL INFORMATION

A.1 PALI MODEL DETAILS

Figure 3 visualizes some examples of Pa LI on several tasks, such as image captioning, visual question answering, OCR-oriented captioning and question answering. Examples in multiple languages are shown as well.

Below, we show more speciﬁcs about the Pa LI model and its components.

Model variants Table 7 lists the main Pa LI models used where the largest is Pa LI-17B of 17B parameters.

Model Components Image Encoder Multimodal Encoder-Decoder Total

Pa LI-3B Vi T-G, m T5-L 1.8B 1.2B 3.0B Pa LI-15B Vi T-G, m T5-XXL 1.8B 13B 14.8B Pa LI-17B Vi T-e, m T5-XXL 3.9B 13B 16.9B

Table 7: The size in terms of number of parameters for the trained Pa LI model versions.

Vi T-e Backbone We show Vi T-e s conﬁguration in Table 8 alongside Vi T-g and Vi T-G for reference. Width, depth and MLP dimensions are all further scaled up in Vi T-e, resulting in a model with 4B parameters. The model training setup is copied from the Vi T-G model (Zhai et al., 2022a), on the JFT-3B dataset (Zhai et al., 2022a), with 16, 384 batch size, 224 224 resolution. We train the model for 1M steps using 0.0008 initial learning rate, with an inverse square-root learning rate decay, and a linear cool-down to zero for the ﬁnal 100k steps. The only additional technique added is model souping (Wortsman et al., 2022): we run the 900K to 1M cool-down twice, once with inception cropping and once with resizing only. Thus, the ﬁnal Vi T-e model consists of the average weights of these two cool-downs. Vi T-e is pretrained using the big_vision codebase (Beyer et al., 2022).

Name Width Depth MLP Heads Params (M) GFLOPs

g/14 1408 40 6144 16 1011 533.1 1596.4 G/14 1664 48 8192 16 1843 965.3 2859.9 e/14 1792 56 15360 16 3926 1980 5777

Table 8: Vi T-e architecture details.

The overall model The overall Pa LI models are implemented in JAX/Flax (Bradbury et al., 2018) using the open-source T5X (Roberts et al., 2022) and Flaxformer (Heek et al., 2020) frameworks. For the learning rate, we use a 1k-step linear warmup, followed by inverse square-root decay. For Pa LI-3B, we use a peak learning rate of 1e-2. For larger models, Pa LI-15B and Pa LI-17B, we use a peak learning rate of 5e-3. We use the Adafactor (Shazeer & Stern, 2018) optimizer with β1 = 0 and second-moment exponential decay set to 0.8.

The largest model, Pa LI-17B, is pretrained using 1,024 GCP-TPUv4 chips for 7 days. It uses a four-way model partitioning (Roberts et al., 2022) and a batch size of 4,096. This is slightly less TPU resources than used to train other large vision and language models on TPUs. Sim VLM used 2,048 GCP-TPUv3 for 5 days (Wang et al., 2021), while Co Ca used 2,048 GCP-TPUv4 chips for 5 days (Yu et al., 2022). Flamingo used 1,536 GCP-TPUv4 chips for 15 days (Alayrac et al., 2022).

During training, the model passes over 1.6B images, one epoch over the entire pretraining dataset. The image resolution for this pass is 224 224. During training, only the parameters of the language component are updated and the vision component is frozen, which provides a boost in performance (Sec. 4.6).

Continuation of pretraining at higher image resolution For the largest model, Pa LI-17B, we perform a further high-resolution (588 588) pre-ﬁnetuning for the multilingual tasks. When scaling

Published as a conference paper at ICLR 2023

Input: Generate the alt_text in FR

Output: Un arbre debout dans un champ avec un ciel violet

Input: Generate the alt_text in EN Output: A cellar filled with barrels of wine

Input: Generate the alt_text in ZH

Output: 一辆 电动 汽车 停 在 充电 桩 上

Input: Answer in EN: what website is on the wall in back Output: arsenaldirect.com

Input: Answer in EN: what time is it according to this radio

Output: 1254

Input: Generate the alt_text in EN Output: Two helicopters are flying in the sky and one has a yellow stripe on the tail

Input: Generate the alt_text in TH

Output: ลา ส เทา เด น ไป ตาม ถนน

Input: Generate the alt_text in EN Output: a clock on a building that says 'lyvania' on it

Input: Answer in EN: what is the brand of this watch

Output: seiko

(A tree standing in a field with a purple sky) (An electric car parking on a charging station) (A gray donkey walks down the street)

Figure 3: Pa LI addresses a variety of vision and language tasks across many languages, for example, image captioning, visual question answering, scene-text understanding, etc. Images from the publiclyavailable Text VQA (Singh et al., 2019) and Text Caps (Sidorov et al., 2020) datasets are shown, together with Pa LI inputs and outputs.

up image resolution, the patch size is kept the same, and the number of patches are increased with higher resolution. We perform a 2D bilinear upsampling of the positional embedding to match the increased number of patches. This second stage of training is only for 10k steps at batch size 1024 (10M examples in total) and is performed on a subset of the full training mix. We simplify the mixture of data in this stage to focus on VQA, captioning and OCR capabilities, by including only the OCR, CC3M-35L and VQ2A in the training mixture and making them equally weighted. In this high-resolution ﬁnetuning phase, all of the parameters of Pa LI are updated. This high resolution phase was performed using 512 GCP-TPUv4 chips for an additional 3 days.

A.2 THE PRETRAINING TASK MIXTURE

Below are detailed descriptions of each component of our task mixture.

Published as a conference paper at ICLR 2023

Span corruption on text-only data uses the same technique described by Xue et al. (2021), corrupting 15% of the tokens from a given text-only example and using sentinels of the form extra_id_k for each corrupted span; the text-only examples are using a sample of 100M of text-only examples.

Split-captioning (Split Cap) on Web LI alt-text data is inspired by the pretraining objective of Wang et al. (2021), and works by splitting each alt-text string randomly into two parts, cap1 and cap2 . It uses the prompt "Generate the alt_text in lang at pos : cap1 extra_id_0 " (where lang is the language code of the alt-text string, and pos is the number of words in cap1 ), with cap2 as the target.

Captioning (Cap) on CC3M-35L on native and translated alt-text data using the prompt "Generate the alt_text in lang at 0: extra_id_0 ", with the alt-text string in language lang as the target. CC3M-35L is Conceptual Captions (Sharma et al., 2018) training data, translated into an additional 34 languages (the same as the non-English ones covered by Crossmodal-3600 (Thapliyal et al., 2022), except for Cusco-Quechua), for a total of 100M examples.

OCR on Web LI OCR-text data using the prompt "Generate the ocr_text in lang : extra_id_0 ", with OCR_text as the target, where OCR_text is the concatenation of the annotated OCR texts in language lang (Kil et al., 2022) produced by the publicly available automatic service for the input image.

English and Cross-Lingual VQA on native and translated VQ2A-CC3M-35L-100M VQA triplets using, for a given image, [question], [answer] VQA triple, the prompt: "Answer in EN: [question] extra_id_0 ", with [answer] for the target. VQ2A-CC3M-35L100M is a 100M random subset of VQ2A-CC3M (Changpinyo et al., 2022a), translated into the same additional 34 languages as mentioned above. Note that we use English answers in all instances here, as the English-native answers for VQA are often short and too prone to errors to perform out-of-context automatic translation.

English and Cross-Lingual visual question generation (VQG) on native and translated VQ2A-CC3M-35L-100M VQA triplets using, for a given image, [question], [answer] VQA triple, the prompt: "Generate a question in lang for [answer]: extra_id_0 ", with [question] in language lang as the target. Similarly, we use only English answers here.

English-only Object-Aware (OA) VQA is based on VQA triplets derived from automatically-produced, non-exhaustive object labels, inspired by Piergiovanni et al. (2022a). We automatically generate 4 different prompt types, based on the available object labels, as follows. (1) Prompt: "Answer in EN: List the objects present: extra_id_0 ", with the target: object1 , ..., object N . (2) Prompt: "Answer in EN: Is objectk in the image? extra_id_0 ", with the target Yes or No . (3) Prompt: "Answer in EN: Is object1 , ..., object N in the image? extra_id_0 ", with the target Yes or No . (4) Prompt: "Answer in EN: Which of object1 , .. ., object N are in the image? extra_id_0 ", with the target made of the list of object labels present. To create these examples, we require object-level annotations, for which we use Open Images (Kuznetsova et al., 2020), from which we create 50M examples.

Object detection is a generative object-detection task inspired by Chen et al. (2021; 2022). The target sequence describes bounding-box coordinates and object labels, e.g. "10 20 90 100 cat 20 30 100 100 dog". The coordinates are in the ymin xmin ymax xmax order, and range between 0 and 999. Unlike Chen et al. (2021), the prompt used contains a set of positive and negative class labels, i.e. object classes that are present and not present in the image (e.g. "detect cat and dog and leopard"). The prompt is preﬁxed with the word "detect". For the datasets that do not have negative class labels explicitly deﬁned, we randomly sample non-positive class labels. Since Web LI does not contain bounding box annotations, we train on a mixture of public datasets, totalling 16M images: Open Images (Kuznetsova et al., 2020), Visual Genome (Krishna et al., 2017), and Object365 (Shao et al., 2019). The datasets are de-duplicated against evaluation tasks. These examples are included to increase object awareness capabilities of the model.

Dataset mixing ratio for pretraining Table 9 provides the data mixing ratio for pretraining all Pa LI variants.

Published as a conference paper at ICLR 2023

Text-only Web LI alt-text OCR CC3M-35L VQA VQG OA Detection Total

Amount (M) 100 1000 100 100 100 100 50 16 1566

Table 9: Mixing ratio of each task for pretraining

A.3 FINE-TUNING DETAILS

Hyperparameters for ﬁnetuning the V&L tasks We performed limited hyperparameter search for ﬁnetuning. The train steps is mostly selected based on dataset size. The batch size is selected among {128, 256, 512}, and the initial learning rate among {1e-5, 3e-5, 1e-4}. The optimizer setting for ﬁnetuning is the same as the setting for pretraining. Note that we did not perform the hyperparameter sweep over all possible combinations. Table 10 summarizes the hyperparameters corresponding to the main results.

Hyper-parameter COCO & No Caps Text Caps Viz Wiz-Cap VQAv2 Text VQA Viz Wiz-QA OKVQA ST-VQA

Dropout 0.1 LR decay schedule linear decay to zero Train 20k 10k 5k 20k 5k 5k 5k 5k Batch size 256 Initial (peak) LR 3e-5 1e-4 1e-4 1e-4 1e-4 1e-4 3e-5 1e-4

Table 10: Hyper-parameters used in ﬁne-tuning experiments.

Setup for zero-shot image classiﬁcation For each image, each class is scored using the prompt "Generate alt_text in EN at 2: Photo of extra_id_0 ", scoring against all 1,000 classes with a target " en_class_name ", where " en_class_name " stands for a classiﬁcation label in English, such as "goldﬁsh", "great white shark", etc.

B WEBLI DATASET DETAILS

The Web LI dataset covers about 10 billion images and 12 billion alt-texts in 109 languages. We further apply a publicly available automatic service to extract OCR annotations on all images, producing additional 29 billion image-OCR pairs. Examples and statistics for the Web LI corpus are shown in Figure 4.

Due to the scale of Web LI, to mitigate train-to-test leakage, we perform near de-duplication of the images against the train, validation, and test splits of 68 common vision/vision-language datasets. Eliminating these images from the Web LI dataset does not result in any signiﬁcant shrinkage (0.36%), and avoids any potential leakage of examples from the pretraining setup to the downstream evaluation tasks.

To improve the data quality in terms of image-text alignment, we score image and alt-text pairs based on their cross-modal similarity. This score is measured with cosine similarity between embedding representations from each modality, computed as follows. The image embeddings are trained with a graph-based, semi-supervised representation learning approach, as described in Juan et al. (2019). Then, the text embeddings are learned using the frozen image embeddings, based on a contrastive approach using a Transformer encoder for the text, which forces both modality representations to the same embedding space.

We tune a threshold on the image and alt-text pairs score, and retain only the top 10% best scoring of the original Web LI image-text pairs (about 1B examples), which we use to train Pa LI.

1The second image is by jopradier (original), used under the CC BY-NC-SA 2.0 license. Remaining images are also used with permissions.

Published as a conference paper at ICLR 2023

English French Thai Chinese

"free stock photo of matrix and sidekick"

"carte joyeux noël anges et étoiles"

"ทานตะว นเป นดอกไม ท ห น หน าเข าหาดวงอาท ตย "

"太行山 脉 长治 太行山 大 峡谷 林州 河北 平原 长城"

"card", "telecom", "5624" "joyeux noël" n/a n/a

Figure 4: The Web LI dataset. Top: Sampled images1 associated with multilingual alt-text (available) and OCR (computed using publicly available API ). Bottom left/middle: Statistics of recognized languages from alt-text/OCR. Bottom right: Image-text pair counts, compared against other largescale vision-language datasets.

C ADDITIONAL EXPERIMENTAL RESULTS

C.1 LANGUAGE-ONLY EVALUATION

In Table 11, we evaluate te performance of Pa LI on a range of language understanding benchmarks, in order to verify that the language-only capabilities of the model have been preserved. More speciﬁcally we compare m T5-XXL and Pa LI-17B, evaluating on the English-only Super GLUE benchmark (Wang et al., 2019a), and on three multilingual benchmarks from the XTREME (Hu et al., 2020): XNLI (Conneau et al., 2018), which is a textual entailment task covering 14 languages, XQu AD (Artetxe et al., 2020) and Ty Di QA-Gold P (Clark et al., 2020), which are both questionanswering tasks covering 10 and 11 languages, respectively.

Model Super GLUE XNLI XQu AD Ty Di QA-Gold P Method FT ZS ZS ZS

Metric Avg. Score Accuracy F1/EM F1/EM

m T5-XXL (Xue et al., 2021) 89.2 85.0 82.5 / 66.8 80.8 / 65.9 m T5-XXL (our setting) 89.3 84.5 82.6 / 66.6 81.6 / 66.3 Pa LI-17B 88.2 84.9 81.8 / 66.0 81.2 / 66.5

Table 11: Results on Super GLUE and three XTREME tasks. The ﬁrst row is the result reported by m T5 (Xue et al., 2021) and By T5 (Xue et al., 2022) paper. The second row is our repetition using the publicly available m T5-XXL checkpoint, which is also the starting point for Pa LI-17B. The third row results are using the trained Pa LI-17B model.

C.2 ADDITIONAL SCALING RESULTS

Figure 5 shows that the model scaling impacts signiﬁcantly the performance for multiple languages. We can see that Pa LI-17B improves substantially over Pa LI-3B across languages. We also include a plot where for a subset of 600 examples, we back-translate the predictions from six languages,

Published as a conference paper at ICLR 2023

English French Hindi Hebrew Romanian Thai Chinese Avg. (35L)

baseline (Thapliyal et al) Pa LI-3B (L, Vi T-G) Pa LI-17B (XXL, Vi T-e)

CIDEr (EN back translation)

English French Hindi Hebrew Romanian Thai Chinese

baseline (Thapliyal et al) Pa LI-3B Pa LI-17B

Translating non-English predictions back to English on 600 example subset

Figure 5: Pa LI Scaling performance across multiple languages (See Table 2), using the Crossmodal3600 benchmark. Larger scale models are important for better performance in these languages, especially low resource ones. (Top) CIDEr scores computed using predictions in each language. (Bottom) For the six languages French, Hindi, Hebrew, Romanian, Thai and Chinese, we sample a 600-example subset and back-translate the non-English predictions to English, and computed the CIDEr score vs. the same English references.

including French, Hindi, Hebrew, Romanian, Thai and Chinese to English and compute the CIDEr score against English references for a better comparison to the English quality. The result shows that the captioning quality across languages is fairly consistent.

We also trained a 5B Pa LI model consisting of m T5-Large and Vi T-e for additional datapoints. We evaluated this 5B model on two representative captioning and VQA benchmarks, COCO-Cap and OKVQA, and the results are shown in Table 12. We note that the training mixture and hyperparameters of this Pa LI-5B checkpoint are slightly different from other Pa LI sizes, but the results are still indicative and supportive of our conclusions regarding the value of joint scaling.

On COCO, the improvement from Pa LI-3B to 5B (+2.1 CIDEr points) is slightly smaller than the improvement from Pa LI-15B to 17B (+2.8). On OKVQA, it is likely that the beneﬁt of having Vi T-e cannot be exploited by the m T5-Large enc-dec as much as that by the m T5-XXL on VQA tasks, which require stronger language-understanding capabilities than Image Captioning tasks. In general, it is clear that scaling Vi T still has much better return on investment (see the last column in Table 12), even for Pa LI-5B where the Vi T model is much larger than the encoder-decoder backbone. Note that we computed Ro I as improvement per 1B parameter , using COCO and OKVQA numbers as performance indicators.

Published as a conference paper at ICLR 2023

Model Component COCO-Cap OKVQA Improvement per @490 res @490 res 1B more params

Pa LI-3B m T5-Large & Vi T-G 145.4 52.4 -

Pa LI-5B m T5-Large & Vi T-e 147.5 53.8 +0.9 per 1B more Vi T params (vs Pa LI-3B)

Pa LI-15B m T5-XXL & Vi T-G 146.2 56.5 +0.2 per 1B more m T5 params (vs Pa LI-3B)

Pa LI-17B m T5-XXL & Vi T-e 149.0 62.4

+2.2 per 1B more Vi T params (vs Pa LI-15B) +0.4 per 1B more m T5 params (vs Pa LI-5B)

Table 12: Result on a 5B version of Pa LI consisting of m T5-Large and Vi T-e. Results on COCO-Cap and OKVQA with 490 490 are shown together with other sizes.

C.3 ADDITIONAL ABLATIONS

Table 13 shows that initializing from unimodal checkpoint plays a critical role in Pa LI s quality. Table 14 shows that freezing Vi T during pretraining leads to an improvement in downstream ﬁnetuning on COCO.

Table 15 shows the effect of the non-English part of Web LI data. The table shows two sets of comparison for the pretraining data. 1) Using only the English subset of Web LI vs using only the whole Web LI. 2) Taking out the non-EN part of Web LI from the full mix vs using the full mix. This set of comparison results is performed with a 1.5B version of Pa LI model, consisting of m T5-Large and Vi T-L (with 300M parameters). This model has a similar parameter ratio (20% for Vi T) compared with Pa LI-17B (23%). Each model is pretrained to cover 200M of the data. All downstream benchmarks are ﬁne-tuned and evaluated at 224 224 image resolution. The six non-En languages (6L) for XM-3600 are fr, hi, iw, ro, th and zh, and "7L" for x GQA are en, bn, de, id, ko, pt, ru, zh, both are the same as those included in Table 2 and Table 4. The takeaways are as follows:

(comparison 1, row #1 vs row #2) With only the English portion of Web LI, the model s multilingual captioning capability remains very low (as measured on XM-3600), even with further ﬁnetuning on COCO-35L. There is also a clear drop in cross-lingual VQA performance on x GQA.

(comparison 2, row #3 vs row #4) Taking away the multilingual part of Web LI from the full mixture, which still contains other translated multilingual/cross-lingual datasets (CC3M35L, VQ2A-CC3M-35L, VQG-CC3M-35L), still has a signiﬁcant impact on XM-3600 performance. On x GQA, because of the cross-lingual training source VQ2A-CC3M-35L, the impact of removing non-EN Web LI data is reduced but still apparent. With the non-EN Web LI data in the full mix, x GQA performance improves by +0.4 overall and is better than or equal to with only the Web LI-EN in every language.

Last but deﬁnitely not least, there is an interesting result: when training with all the languages of Web LI, the model is performing better on (English) COCO captions, compared to training with English-only Web LI (about +2 CIDEr points). This suggests that 1) the multilingual Web LI may contain extra images with richer objects and their descriptions compared with the English-only subset 2) the model may be able to exploit the shared linguistic structure across languages, beneﬁting from transfer learning across languages.

Model Initialization COCO (Karp. test) XM-3600 Text VQA

Pa LI-3B From m T5-Large and Vi T-G 141.4 93.8 (EN) / 42.5 (6L) 41.6 From scratch 72.8 22.1 (EN) / 10.1 (6L) 12.8

Table 13: Comparison between Pa LI s initializing from existing unimodal checkpoints and initializing the parameter from scratch. The setup is the same as the main ablation result Table 6.

Published as a conference paper at ICLR 2023

Model Vi T during ﬁnetuning Vi T during pretraining COCO (Karp. test)

Pa LI-3B Fine-tuned Frozen 139.3 Fine-tuned 138.8

Pa LI-15B Fine-tuned Frozen 141.4 Fine-tuned 140.1

Table 14: Comparison of performance on COCO for Frozen versus ﬁne-tuned Vi T during a short period of pretraining. In this comparison, ﬁnetuning of COCO is performed at resolution 224 224.

Pretraining Data XM-3600 (FT on COCO-35L) COCO-Cap x GQA (FT on VQAv2-13L)

only Web LI-en 86.0 (en) / 8.2 (6L) 132.2 40.6 (en) / 34.0 (7L) only Web LI 87.2 (en) / 30.0 (6L) 134.3 42.8 (en) / 38.6 (7L)

Web LI-en & rest of Pa LI mix 91.2 (en) / 39.0 (6L) 135.3 44.9 (en) / 40.9 (7L) Full Pa LI mix 92.2 (en) / 41.9 (6L) 135.4 45.1 (en) / 41.3 (7L)

Table 15: Ablation studies on the effect of including the multilingual examples of Web LI on multi- (cross-)lingual benchmarks XM-3600 and x GQA. We also included the English benchmark COCOCaptions in the comparison. This set of comparison results is performed with a 1.5B version of Pa LI model, consisting of m T5-Large and Vi T-L (with 300M parameters).

C.4 EVALUATION OF PALI S VISUAL COMPONENT: VIT-E

Table 16 compares the Vi T-e architecture with the smaller Vi T-G and Vi T-g architectures on vision only and vision-language tasks. The results suggest that V&L tasks could beneﬁt more from scaling up the vision backbone, even on the high end. In Table 17, we ﬁne-tune the pretrained Vi T-e model on the Image Net dataset, and then report the evaluation scores on several out-of-distribution test variants: Image Net-v2, Object Net, and Rea L (Beyer et al., 2020). We follow the ﬁnetuning protocol of Zhai et al. (2022a), but use a 560 560 resolution. We evaluate the ﬁne-tuned model at 644 644 (Touvron et al., 2019) (chosen according to a held-out 2% of the training set), results are reported in Table 17. Vi T-e achieves 90.9% top-1 accuracy on Image Net and shows clear beneﬁts on the OOD benchmarks.

INet-10 INet-25 COCO VQAv2

Vi T-g 84.5 85.4 - - Vi T-G 84.9 85.6 146.2 82.9 Vi T-e 85.2 85.8 149 83.4

Table 16: Impact of scaling Vi T. For vision-only tasks, we report 10-shot and 25-shot accuracy on Image Net. For vision-language tasks, Vi T models are paired with the m T5-XXL model in Pa LI and we report captioning (COCO) and VQA (VQAv2). For direct comparison, results with Vi T-e on COCO and VQAv2 do not include the high resolution phase of pretraining.

Since Vi T-e is new and has not been evaluated in the prior work, we evaluate its standalone performance. For this, we perform supervised ﬁne-tuning on standard classiﬁcation tasks. Additionally, we perform Li T transfer (Zhai et al., 2022b) to evaluate the frozen representation quality in a zero-shot setup.

We follow Li T (Zhai et al., 2022b) to add zero-shot transfer capabilities to the (frozen) Vi T-e model, the visual component of Pa LI. More speciﬁcally, we tune a text encoder, when the Vi T image encoder is frozen. We use the English subset of the Web LI dataset for the text encoder training, since all evaluation tasks in Table 18 are in English. These results highlight that going from Vi T-g to Vi T-e

Model INet INet-v2 Obj Net Rea L

Vi T-G 90.5 83.3 70.5 90.8 Co Ca 91.0 - - -

Vi T-e 90.9 84.3 72.0 91.1

Table 17: Vi T-e on Image Net and OOD test sets.

Published as a conference paper at ICLR 2023

Model INet INet-v2 INet-R INet-A Obj Net Rea L VTAB-N

CLIP (Radford et al., 2021) 76.2 70.1 88.9 77.2 72.3 - 73.9 ALIGN (Jia et al., 2021) 76.4 70.1 92.2 75.8 72.2 - - BASIC (Pham et al., 2021) 85.7 80.6 95.7 85.6 78.9 - - Co Ca (Yu et al., 2022) 86.3 80.7 96.5 90.2 82.7 - - Li T Vi T-g (Zhai et al., 2022b) 85.2 79.8 94.9 81.8 82.5 88.6 74.7

Li T Vi T-e (ours) 85.4 80.6 96.1 88.0 84.9 88.4 76.9

Table 18: Zero-shot transfer results of Vi T-e on Image Net, OOD test sets and VTAB-Natural datasets.

ar bn cs da de el en es fa fi fil fr hi hr hu id it iw ja ko mi nl no pl pt quz ro ru sv sw te th tr uk vi zh avg 0.0

0.6 Li T Vi T-g Li T Vi T-e Li T Vi T-e (multilingual)

ar bn cs da de el en es fa fi fil fr hi hr hu id it iw ja ko mi nl no pl pt quz ro ru sv sw te th tr uk vi zh avg 0.0

Li T Vi T-g Li T Vi T-e Li T Vi T-e (multilingual)

Figure 6: Zero-shot image-text retrieval results on all 36 languages of Crossmodal-3600. Top: image-to-text retrieval accuracy; bottom: text-to-image retrieval accuracy.

provides consistently better results. Notably, Li T with Vi T-e achieves 84.9% zero-shot accuracy on the challenging out-of-distribution Object Net test set, setting the new state-of-the-art. The VTAB-Natural benchmark (Zhai et al., 2019) consists of seven diverse natural image datasets, for which Li T also beneﬁts from Vi T-e over Vi T-g. Detailed results on each VTAB-Natural task are in Appendix C.6.

We also test multilingual performance using Web LI in this setting. We further perform Li T transfer using the same multilingual Web LI dataset as used to train Pa LI, and use Crossmodal-3600 to evaluate the cross-lingual image-text retrieval performance. Figure 6 shows that Li T Vi T-e pretrained on the English subset substantially outperforms the same model pretrained on the multilingual dataset. The same observation applies to a few languages that are similar to English, e.g. Spanish (es), French (fr), Italian (it). However, the multilingual model performs much better on most other languages, especially those with a non-latin script such as Chinese (zh), Japanese (ja), Korean (ko), and Hebrew (iw). On average (avg), the multilingual Li T Vi T-e outperforms the English-only model by a large margin. More results could be found in Table 22. These results highlight the importance of having good multilingual benchmarks to measure the beneﬁts of training models on diverse datasets such as Web LI.

C.5 RESULTS ON TEXTCAPS, TEXTVQA AND VIZWIZ-QA WITHOUT DETECTED OCR AS INPUT

In the main text, we presented results on Text Caps, Text VQA, Viz Wiz-Cap, Viz Wiz-QA and ST-VQA with detected OCR strings as input. Following Kil et al. (2022), we order the OCR items based on their locations in the image, from top left to bottom right. We only include the OCR strings themselves, without the OCR-item locations provided by the API. GIT2 (Wang et al., 2022a) has demonstrated strong performance without the OCR input, while Pa LI-17B shows the superiority of levaraging a specialized OCR system for a better recipe to solve these tasks.

Published as a conference paper at ICLR 2023

Table 19 shows the results on Text Caps, Text VQA and Viz Wiz-QA without the detected OCR strings as input. Pa LI slightly suffers without OCR input, while its performance remains close to the ﬁrst version of GIT. This result may suggest that the signiﬁcantly larger vocab of Pa LI adds further difﬁculty to OCR string generation.

However, for Viz Wiz-QA, Pa LI establishes SOTA performance without OCR input.

Text Caps Text VQA Viz Wiz-QA

Method OCR input? test test test-dev test-std

TAP (Yang et al., 2021) Yes 103.2 53.97 - - GIT No 138.2 59.75 68.0 67.5 GIT2 No 145.0 67.27 71.0 70.1

Pa LI No 135.4 58.80 71.6 70.7 Pa LI Yes 160.4 73.06 74.4 73.3

Table 19: Results on Text Caps, Text VQA and Viz Wiz-QA with and without detected OCR as input for Pa LI

C.6 DETAILED VTAB RESULTS

For the VTAB benchmark (Zhai et al., 2019), we follow the methodology outlined in (Zhai et al., 2022b). Pa LI sets a new state-of-the-art zero-shot performance for the natural subset (see Table 20).

Caltech101 CIFAR-100 DTD Flowers102 Pets Sun397 SVHN Mean

CLIP 92.8 77.5 55.7 78.3 93.5 68.4 51.0 73.9 Li T Vi T-g 79.2 83.6 66.6 92.3 97.7 76.0 27.5 74.7 Li T Vi T-e 79.8 90.4 68.8 91.2 98.1 76.3 33.8 76.9

Table 20: Accuracies for zero-shot evaluation of different VTAB natural tasks, and the average over these tasks. Note that CLIP is using OCR for the SVHN task (as opposed to Li T and Pa LI, which do not use OCR).

C.7 TOP 5 ACCURACY ON ZERO-SHOT IMAGENET DATASETS

Table 21 shows the Top 5 Accuracy results on Zero-shot evaluation on Image Net Datasets.

Model INet INet-R INet-A INet-sketch INet-v2 Obj Net

Pa LI-3B 84.31 90.05 55.04 76.47 78.49 53.71 Pa LI-15B 84.78 90.91 59.00 76.81 79.54 55.29 Pa LI-17B 86.18 91.51 62.72 79.30 80.71 58.35

Table 21: Top 5 accuracy results of Zero-shot image classiﬁcation on Image Net (Deng et al., 2009), Image Net-R (Hendrycks et al., 2021a), Image Net-A (Hendrycks et al., 2021b), Image Net-Sketch (Wang et al., 2019b), Image Net-v2 (Recht et al., 2019) and Object Net (Barbu et al., 2019).

C.8 MORE ZERO-SHOT IMAGE-TEXT RETRIEVAL RESULTS ON CROSSMODAL-3600

Table 22 shows more zero-shot image-text retrieval results on Crossmodal-3600.

D MODEL FAIRNESS, BIASES, AND OTHER POTENTIAL ISSUES

Models trained on web data are at risk of being biased or unfair due to biases in that data. A ﬁrst step towards addressing those risks is being transparent about their existence, and then measuring them.

Published as a conference paper at ICLR 2023

Language Image-to-text Text-to-image

Li T Vi T-g Li T Vi T-e Li T Vi T-e (multilingual) Li T Vi T-g Li T Vi T-e Li T Vi T-e (multilingual)

ar 5.28 26.58 39.69 2.80 18.46 32.60 bn 0.00 0.11 5.67 0.00 0.06 3.31 cs 18.19 39.25 44.03 11.24 27.35 35.24 da 26.44 48.92 50.75 14.07 34.43 38.48 de 37.83 58.42 58.53 23.61 43.25 46.50 el 1.56 13.47 29.03 0.39 5.46 20.92 en 51.22 51.78 42.11 46.24 47.07 40.63 es 41.81 57.50 55.22 30.29 47.71 46.55 fa 3.78 18.39 44.50 1.57 10.74 35.58 ﬁ 14.14 29.42 32.64 6.59 16.91 21.80 ﬁl 10.94 16.39 15.53 4.18 8.66 10.04 fr 38.28 57.06 52.61 28.02 45.20 43.47 hi 0.47 7.33 13.14 0.08 2.90 7.42 hr 15.86 34.47 38.31 8.80 22.72 29.55 hu 15.11 31.17 44.67 8.45 20.52 35.49 id 24.11 43.72 46.33 12.99 32.08 36.75 it 39.69 57.47 54.53 27.07 46.79 44.76 iw 1.75 9.11 38.67 0.86 3.99 29.39 ja 3.61 11.67 35.47 1.20 4.91 27.24 ko 1.78 6.00 36.11 0.35 3.14 25.95 mi 0.58 0.92 0.33 0.19 0.30 0.22 nl 37.47 51.67 52.14 27.26 44.08 43.79 no 26.53 49.69 49.17 14.61 35.59 37.35 pl 19.67 42.03 51.42 12.00 31.13 43.72 pt 33.92 50.81 49.19 23.58 42.97 42.73 quz 5.08 6.83 4.31 1.85 1.89 1.90 ro 17.94 30.08 37.75 10.15 20.06 28.82 ru 12.00 26.22 50.64 5.76 17.19 41.11 sv 25.50 51.00 53.22 15.11 38.80 40.66 sw 4.47 7.75 6.42 1.58 4.17 3.41 te 0.06 0.03 1.92 0.03 0.03 1.42 th 1.89 7.22 22.00 0.79 3.71 16.06 tr 10.72 31.28 39.50 4.73 20.42 31.47 uk 7.67 19.94 39.53 3.38 10.40 30.81 vi 3.08 11.44 27.08 0.98 6.22 21.28 zh 4.53 11.11 33.61 1.67 5.60 28.24 avg 15.64 28.23 35.99 9.79 20.14 28.46

Table 22: Image-to-text and text-to-image zero-shot retrieval results on all 36 languages of Crossmodal-3600. Models are trained following Li T (Zhai et al., 2022b) method with diverse visual backbones (Vi T-g or Vi T-e) and datasets (English or multilingual).

To this end, we add a data card (Pushkarna et al., 2022) for Web LI and model card (Mitchell et al., 2019) for Pa LI in Appendix G and F.

To understand the demographic properties of the data, we sample 112,782 (0.001% of the full data set, randomly sampled due to the limitations of the labeling tool, described next) examples and analyze both images and texts of the sampled data with the Know Your Data (KYD) tool. We use KYD to analyze the perceived gender presentation of image subjects (Schumann et al., 2021) along with gender expressed through pronouns in text. In the sampled images, 54% of people appear feminine presenting with 46% masculine presenting. In the sampled text, female pronouns (e.g., she, her) are used 30% of the time, male pronouns (e.g., he, him) 38% of the time, and they or them (either singular or plural) 31% of the time. We also analyze the perceived age of individuals appearing in the sampled images, resulting in the distribution displayed in Figure 7.

We consider all the effort above a ﬁrst step, and know that it will be important to continue to measure and mitigate bias as we apply our model to new tasks. Deeper analysis will include the study of the model s recognition capabilities and potential biases observed towards speciﬁc attributes, e.g. related to gender, age, etc. and how scaling affects these observations.

Published as a conference paper at ICLR 2023

<13 13-18 18-35 35-50 50-65 65

Figure 7: The distribution of ages recognized from the sampled images of Web LI.

E LIMITATIONS

Despite good performance, our model has a number of limitations. For example, the model might not describe very thoroughly a complex scene with many objects because most of the source data does not have complex annotations. We have tried to mitigate this with the object-aware and localization aware queries, added to the data.

We also noticed that some of the multilingual capabilities are lost when ﬁne-tuned on English-only data, which is consistent with other model ﬁne-tuning behavior. Ideally these models should be ﬁne-tuned on a mix of multiple datasets including multilingual ones.

There are limitations related to the evaluation procedures of the benchmarks. Since we are evaluating in the open-vocabulary generative setting, for example in VQA, the model might generate a correct response which is a synonym or a paraphrase of the target response and does not match the target exactly. In these cases the answer is counted as incorrect. Fixed-vocabulary approaches do not suffer from these issues, but are limited in generalization beyond the answers of a speciﬁc dataset. Further, in terms of evaluation, some benchmarks might need more comprehensive strategies to avoid evaluations with Western-centric bias. Multilingual models and benchmarks are a ﬁrst step in that direction.

F PALI MODEL CARD

Following Mitchell et al. (2019), we present the Pa LI model card in Table 23.

Model Summary Model Architecture Pa LI is a multimodal sequence-to-sequence Transformer (Vaswani et al., 2017) model derived from the T5 (Raffel et al., 2020) encoder-decoder architecture. It takes text tokens and Vi T (Dosovitskiy et al., 2021) dense image embeddings as inputs to an encoder and autoregressively predicts discrete text tokens with a decoder.

Input(s) A pair of image and text.

Output(s) Generated text.

Usage Application The model is for research prototype and the current version is not available for the public.

Known Caveats No.

System Type System Description This is a standalone model.

Upstream Dependencies No.

Downstream Dependencies No. Implementation Frameworks

Published as a conference paper at ICLR 2023

Hardware & Software Hardware: TPU v4 (Jouppi et al., 2020).

Software: T5X (Roberts et al., 2022), JAX (Bradbury et al., 2018), Flaxformer (Heek et al., 2020)

Details are reported in Section A.1.

Compute Requirements Reported in Section A.1. Model Characteristics Model Initialization The model is initialized from pre-trained language (m T5) (Xue et al., 2021) and Vision Transformer (Vi T) (Zhai et al., 2022a; Dosovitskiy et al., 2021) checkpoints.

Model Status This is a static model trained on an ofﬂine dataset.

Model Stats The largest Pa LI model has 17B parameters, which consists of a 13B parameter m T5-XXL model and a 4B parameter Vi T-e model. We have also trained 3B and 15B parameter models.

Data Overview Training dataset The model is pre-trained on the following mixture of datasets: Web LI (Table 24), CC3M-35L (Sharma et al., 2018), VQ2A-CC3M-35L (Changpinyo et al., 2022a), Open Images (Kuznetsova et al., 2020), Visual Genome (Krishna et al., 2017) and Object365 (Shao et al., 2019). Details are reported in Section A.2.

Published as a conference paper at ICLR 2023

Evaluation and Fine-tuning Dataset

Vision + language tasks

Image captioning (English): COCO (Chen et al., 2015), No Caps (Agrawal et al., 2019), Text Caps (Sidorov et al., 2020) Image captioning (multilingual): Crossmodal3600 (Thapliyal et al., 2022) Visual question answering (English): VQAv2 (Goyal et al., 2017), OKVQA (Gui et al., 2021), Text VQA (Singh et al., 2019), Viz Wiz-QA (Gurari et al., 2018) Visual question answering (multilingual): x GQA (Pfeiffer et al., 2022), Ma XM (Changpinyo et al., 2022b)

Vision-only tasks

Image classiﬁcation (ﬁne-tuning): Image Net (Deng et al., 2009), Image Net-V2 (Recht et al., 2019), Object Net (Barbu et al., 2019), Rea L (Beyer et al., 2020) Image classiﬁcation (zero-shot): Image Net (Deng et al., 2009), Image Net-V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2021a), Image Net-A (Hendrycks et al., 2021b), Image Net-Sketch (Wang et al., 2019b), Object Net (Barbu et al., 2019), Rea L (Beyer et al., 2020), VTAB (Zhai et al., 2019)

Language-only tasks

Natural language inference (English): Super GLUE (Wang et al., 2019a) Natural language inference (multilingual): XNLI (Conneau et al., 2018) Question Answering (multilingual): XQu AD (Artetxe et al., 2020), Ty Di QA (Clark et al., 2020)

Evaluation Results Evaluation Results Reported in Section 4. Model Usage & Limitations Sensitive Use The model is capable of open-ended text generations. This model should not be used for any of the unacceptable language model use cases, e.g., generation of toxic speech.

Known Limitations Reported in Section E.

Ethical Considerations & Risks Reported in Section D.

Table 23: Pa LI model card.

Published as a conference paper at ICLR 2023

G WEBLI DATASHEET

Following Gebru et al. (2021), we present the Web LI datasheet in Table 24.

Motivation For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?

The dataset was created to support vision-language research, such as the large-scale pre-training for image understanding, image captioning, visual question answering, object detection etc.

Any other comments? No user data is included in the data source. Personally identiﬁable and privileged data are ﬁltered out during the dataset construction.

Composition What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?

Each instance is presented as an image and associated texts (alt-text, page title and OCR) collected from the web.

How many instances are there in total (of each type, if appropriate)?

There are 9,624,017,440 instances in total (about 260 TB in size).

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

The dataset is built from the public web pages. It is not a complete set but rather a subset of the publicly available imagetext pairs.

What data does each instance consist of?

Each instance consists of 20+ features. Most features are from public web pages; a few are from publicly available automatic services. The primary features are image pixels and the associated texts, including alt-text, page title and OCR. Other features include rich image and page meta information (e.g. URL, MIME type) and ﬁlter signals (attached to alt-text only).

Is there a label or target associated with each instance?

Is any information missing from individual instances?

Are relationships between individual instances made explicit?

There are no relationships between individual instances.

Are there recommended data splits?

There is only one split containing all the instances of the dataset.

Are there any errors, sources of noise, or redundancies in the dataset?

The dataset is built from the web and only applied a few ﬁlters. The data is noisy and redundant images or texts may exist.

Is the dataset self-contained, or does it link to or otherwise rely on external resources?

The dataset is self-contained.

Does the dataset contain data that might be considered conﬁdential?

Published as a conference paper at ICLR 2023

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?

The dataset likely contains data that might be considered offensive, insulting or threatening as the data is collected from the web. We use algorithmic methods and classiﬁers to remove sensitive / personal identiﬁable information (PII) / pornographic images.

Collection Process How was the data associated with each instance acquired?

Images, alt-text and meta information are from the public web. Text language identiﬁcation and OCR annotation are done via publicly available automatic services.

What mechanisms or procedures were used to collect the data?

The data was collected using a variety of pipelines, software programs and publicly available automatic services to extract and ﬁlter images and texts.

If the dataset is a sample from a larger set, what was the sampling strategy?

The dataset is built from a subset of public web pages.

Over what timeframe was the data collected?

Were any ethical review processes conducted?

Preprocessing, cleaning, and labeling Was any preprocessing, cleaning, or labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?

The dataset is not annotated. Images which are identiﬁed as having adult content are excluded. Empty texts and texts (alt-text, page title and OCR) which are identiﬁed as PII are excluded. Images identiﬁed as having adult content, with improper shape, or with too many paired-texts are excluded.

Is the software used to preprocess, clean, or label the instances available?

Uses Has the dataset been used for any tasks already?

Yes, we use the dataset for pre-training Pa LI models.

Is there a repository that links to any or all papers or systems that use the dataset?

What (other) tasks could the dataset be used for?

Vision-only tasks (image classiﬁcation, object detection etc.), language-only tasks (question answering, natural language inference etc.) and vision+Language tasks (image captioning, visual question answering, image-text retrieval etc.).

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?

The dataset is in a stable version and will be refreshed in the future to follow data policies.

Are there tasks for which the dataset should not be used?

The dataset should not be used for training any of the unacceptable vision, language or vision-language model use cases, e.g., generation of toxic captions or inappropriate images.

Distribution

Published as a conference paper at ICLR 2023

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

Table 24: Web LI datasheet.