# jointly_training_large_autoregressive_multimodal_models__f9595102.pdf Published as a conference paper at ICLR 2024 JOINTLY TRAINING LARGE AUTOREGRESSIVE MULTIMODAL MODELS Emanuele Aiello , , Lili Yu , Yixin Nie , Armen Aghajanyan , Barlas Oguz Politecnico di Torino, Meta AI In recent years, advances in the large-scale pretraining of language and text-toimage models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixedmodal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose. 1 INTRODUCTION Autoregressive text-to-image models, as exemplified by works such as Yu et al. (2023; 2022), have made remarkable strides in generating highly detailed images, paralleling the achievements of Diffusion Models Nichol et al. (2022); Ramesh et al. (2022); Rombach et al. (2022). These models bear architectural resemblance to Large Language Models (LLMs), yet their training regimen is tailored for paired image-text data. LLMs on the other hand (Brown et al., 2020; Zhang et al., 2022; Touvron et al., 2023) are limited to text-based output, thus lacking multimodal generative capabilities despite their proficiency in textual tasks. The subfield of Multimodal Large Models has emerged in recent years Tsimpoukelli et al. (2021); Alayrac et al. (2022); Li et al. (2022a) in the quest to bring together the disparate strengths of vision and language models. Despite important advances in this direction, these models still predominantly generate one modality, thereby constraining their expressiveness. This study aspires to break this limitation by developing a multimodal model capable of generating integrated text and image outputs. To achieve this objective, we conduct a comprehensive empirical investigation into the fusion of two specialized autoregressive, decoder-only, large transformer models, each designed for unique tasks (one for text-to-image and a text only model). We introduce a set of methods under the umbrella of the Joint Autoregressive Mixture (JAM) framework. In building this framework, we take advantage of the inherent architectural compatibility of autoregressive text-to-image models with LLMs, allowing us to do deep model fusion and joint training in ways which would otherwise not be possible. Our modular and data-efficient solution allows for deep, rapid and effective integration of continually evolving large models, using less than 1% of the original pretraining data for both parent models. Our contributions to this study are twofold. First, we establish the feasibility of blending autoregressive text-to-image models with LLMs into a unified architecture that retains the core strengths of each while revealing new, emergent capabilities. Second, we present innovative strategies for multimodal instruction tuning, utilizing text-based instructions and a custom-curated dataset designed explicitly for image generation. The result is a first-of-its-kind large multimodal model which can coherently generate long-form content with interleaved text and images. To tackle the challenge of creating a unified model that excels at vision-language generative tasks, we propose to combine two autoregressive decoder-only architectures. Our primary image-text model is work done as an intern in Meta AI, correspondence Published as a conference paper at ICLR 2024 https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I make homemade jam? There are many different kinds of jam, spread, or preserves that you can make at home. Some are simple and take little time, while others are more complicated and take a little longer. Most require some berries, fruits or herbs to make, but many of them also have other ingredients you may already have in your kitchen. Here are some simple and easy ways to make jam, spread, or preserves: Simple and easy jam 1. Bring a kettle of water to boil. Pour in 3 cups (710 ml) water into a kettle or pot and put it on the stove to boil. If you don't have a kettle, you can boil water in a pot on the stove. 2. Prepare the berries. Wash the berries and then strain them to remove the seeds. You can use any kind of berries, such as raspberries, blueberries, blackberries, strawberries, gooseberries, and mirabelle berries. You can also use other fruits, such as apricots, peaches, and plums. 3. Poke the berries through a cheesecloth. Place the berries through the cheesecloth so that the seeds get caught on the cloth. Wash and peel the apples. Cut the core out of the apple. Then cut the apple into small slices. These slices should be about half an inch thick. Peel the skins off the apples. Cut the core out of the peaches. Then peel the peaches and cut them into slices. Cut the gooseberries into small pieces. This process is called cutting the berries. You can also use other fruits, such as mangoes, in place of the apples and peaches. 4. Add all the ingredients to a saucepan. Pour in the water, sugar, lemon juice, and lemon peel. Stir the ingredients together with a spoon or rubber spatula until they are well dissolved. 5. Bring the mixture to a boil. Place the saucepan on the stove and bring the mixture to a boil over high heat. Then reduce the heat to low and simmer the ingredients, covered, for 10 minutes. 6. Remove the lid and stir the mixture. Remove the cover from the saucepan and use a wooden spoon to stir the mixture frequently. This will help the sugar breakdown and make the jam smoother. It will also help the flavors meld together. Figure 1: Selected sample generated by our instruction-tuned JAM-Cross model. The model can complex mixed-modal outputs with coherent alignment between generated text and images. CM3leon (Yu et al., 2023), trained on 2.4T image-text caption tokens. In contrast, using the same architecture, our LLM (Molybog et al., 2023) has been trained on 1.4T text tokens. Both models have 7B parameters, we provide additional architectural details in Section 3.1. Our overall methodology develops in two stages. In the first stage (Sect. 2.1), we first combine and align the models. In the second stage (Sect. 2.2), we explore new directions for instruction tuning focused on interleaved image-text generation. 2.1 CONTINUED PRETRAINING We combine the two pretrained models into a singular, cohesive structure in our proposed framework. This composite model is fine-tuned using a hybrid dataset comprising both text-only and image-text samples within our continued pretraining phase. The central motivation behind this approach is to seamlessly merge the capabilities of two pretrained models, capitalizing on the unique strengths of each. The training procedure is data-efficient since the original pretrained models are typically trained on trillions of tokens. In contrast, our procedure only uses 50 billion tokens, corresponding to 1.3% of the total data used during the pretraining. 2.1.1 MODEL MERGING The concept of model merging has been previously utilized to combine models that share identical optimization trajectories (Kaddour et al., 2022), or models that are trained on identical datasets but have independent optimizations (for instance, Matena & Raffel (2022); Wortsman et al. (2022); Ainsworth et al. (2022)). A consistent approach across these studies is to combine models without any training. Our approach diverges from this convention; we view the merged model as a powerful initialization for subsequent training on mixed-modal data. The weights of the averaged model are defined as: θaverage = 1 Where θllm and θimg represent the weights of the LLM and the text-to-image model respectively. In this study, we explore weights merging specifically to multimodal decoder-only large transformer models, and notably, on an unprecedented scale, involving models trained on trillions of tokens from diverse datasets. In the following sections, we refer to our average model as JAM-Uniform. Published as a conference paper at ICLR 2024 Cross Attention block Shared Embedding Shared Embedding Cross Attention block Figure 2: JAM-Cross, architecture overview. The cross-attention blocks are interleaved between the original LLM block and the Text-Image blocks, and the output embedding between the two branches are concatenated and then projected to the output embedding dimension. 2.1.2 WIDTH CONCATENATION Our second approach employs the pretrained weights to initialize a wider architecture. Our new model has hidden dimensions djoint = 8192, which is doubled with respect to one of the two original models dllm = dimg = 4096. We keep the same number of layers of the original architectures. The resulting architecture has 26B parameters, initialized starting from the pretrained weights of our backbones. The token embedding input/output projections and the learned positional embeddings of the two initial models are concatenated on the hidden dimension. The attention weights (e.g query projection) Wq,combined Rdjoint djoint are initialized as: Wq,combined = Wq,llm Wq,llm Wq,img Wq,img Where Wq,llm, Wq,img Rdllm dllm represent the weights for the query projection of a generic attention layer. All the other weights (FFNs and output projections) are initialized following the same logic. We also experiment with slight variations of the approach: Wq,combined = Wq,llm Wq,average Wq,img Wq,average Instead of copying the two models parameters, we use the average to initialize half of the new parameters. We name the resulting model JAM-Width. 2.1.3 CROSS MODEL FUSION We propose to embed cross-attention layers between the foundational models to facilitate seamless information interchange while preserving the original models knowledge. Given two decoder-only transformers models Tllm and Timg, we introduce a bi-directional cross-attention mechanism that enables the layers of one model to attend to the corresponding layer s output of the other model. This approach allows for a progressive exchange of information at different representation levels. For a specific layer l, let the models produce sequences of hidden states Hllm,l for Tllm and Himg,l for Timg where these hidden states are outputs from layer l. The output of the cross-attention mechanism (Hcross,l) from Timg Tllm for a given layer is evaluated as: Qcross,l = Wq,l Hllm,l 1, Kcross,l = Wk,l Himg,l 1, Vcross,l = Wv,l Himg,l 1 (4) Hcross,l = Softmax Qcross,l KT cross,l dk Vcross,l (5) Where Wq, Wk, Wv represent the query, key, and value projection weights of the newly inserted cross-attention layers. A symmetric process is applied for the reverse direction Tllm Timg. We use a shared input-output projection layer, initializing the weights of the text tokens from the LLM input embedding and the weights of the image tokens from the image-text model. We insert a new linear projection layer that takes the concatenation of the two model s output embeddings as input. Published as a conference paper at ICLR 2024 Figure 2 illustrates a schematic of our model configuration. We refer to the model resulting from this approach as JAM-Cross. Additional architectural details and the underlying design choices can be found in Sect. 3.1. The ablation study for the optimal frequency of inserting new layers is presented in Sect. 3.3. 2.2 MULTIMODAL CONVERSATIONAL INSTRUCT TUNING Supervised fine-tuning is a fundamental tool to leverage the abilities of large pretrained models. Recently, instruct tuning has been extended to a multimodal setting (Liu et al., 2023; Dai et al., 2023); however, all the existing approaches are focused on visual understanding abilities. In this work, we study instruction tuning tailored to interleaved image-text generation. We collect a small and curated mixed-modal dataset to teach our JAM model to support textual explanations with coherent images. Since in the first stage, the model has been trained on image-text captions and text-only data; we train on interleaved image-text data during this phase. Our approach is inspired by the Superficial Alignment Hypothesis from LIMA (Zhou, 2023), which posits that a model s foundational knowledge and skills are entirely learnt during the pretraining. Instruction tuning is then used to guide the model in selecting the appropriate subdistribution of format used when interacting with users. Our results demonstrate that the model can quickly learn the style of images and text from a small curated dataset, suggesting that LIMA hypotesis holds not only for learning the text style but also for images. In our experiments, we consider two slightly different instruction tuning settings, we introduce a small portion of the image-text Shutterstock data with retrieval augmentation and we find this approach beneficial to preserve the generated image quality when generating with retrieval augmentation. Sect 3 presents a comparison between these two strategies. We train using a standard supervised procedure without leveraging any reinforcement learning or human preference strategy. In this instruction-tuning phase, we leverage interleaved image-text data in contrast to previous methods (Koh et al., 2023a) that rely only on image-text caption and no instruction tuning, our experimental results confirm the benefits of training with interleaved image-text data. 3 EXPERIMENTS 3.1 EXPERIMENTAL DETAILS Tokenizers For images, we use the VQ-VAE tokenizer from Gafni et al. (2022). The image resolution is set to 256 256, 1024 tokens represent each image, and the vocabulary has a size of 8192. Our text tokenizer is the same that have been used to train the two parent models, trained over the Zhang et al. (2022) data for text. We introduce the additional token used by CM3leon to identify a modality break. Image-Text Autoregressive Model We adopt CM3leon as the image-text autoregressive backbone. The model has a standard decoder-only architecture with some peculiarities: no bias terms, dropout, and learnable parameters for layer norms. It has been trained on 2.4T image-text tokens and uses a sequence length 4096. LLM As an LLM backbone, we select a model with the same architecture as CM3leon, trained in Molybog et al. (2023) this allows us to experiment with a broader range of approaches, such as weight averaging and width concatenation. The model is trained on 1.4T text data with a 2048 context length, and we further fine-tuned it with a 4096 context length using only 30B text tokens. Objective In all our experiments, we employ the CM3 objective introduced in Aghajanyan et al. (2022); this objective accepts the original sequence as input or transforms it into an infilling instance by masking specific spans and relocating them to the end of the document. Then, the model is optimized for minimizing the standard autoregressive loss log p(xinput). This objective allows for optional bidirectionally and increases the versatility of the model that can be used for both infilling or standard autoregressive generation. We prevent the objective from masking across the modality tokens. Retrieval Augmentation We employ multimodal retrieval augmentation introduced in Yasunaga et al. (2022) for our training procedure. The retrieval procedure employs a dense retriever r, a memory bank M and a specifically selected retrieval strategy. The retriever takes an input query Published as a conference paper at ICLR 2024 What is the best way to start practicing yoga? Yoga is a great way to start practicing physical and mental health. There are many styles of yoga to choose from; each with its own focus, practice methods, and intensity levels. That said, if you're new to yoga or if you're interested in trying it out, you may want to start with a beginner's class, choose a class that focuses on relaxation, or begin with a gentler practice until your body can handle more rigorous routines. To find a beginner's class, search for "beginner's class" or "introduction to yoga" in your area. You may also want to ask your doctor or a physical therapist about local yoga instructors who know how to work with people who are new to yoga. If you have a friend who already practices yoga, ask them if they can go with you to your first class; they may be able to show you some of the basics. As you become more experienced, you may find that you want to specialize in a particular style of yoga or in a particular path within yoga; you may also find that you want to teach yoga. If you decide that you want to teach yoga, you will need to get certified by the Yoga Alliance or by a similar organization How can I plan my trip to Hawaii? Hawaii is a stunningly beautiful state in the United States, and it's only fitting that you want to explore all that it has to offer. The island paradise is made up of six different islands Hawaii, Molokai, Lanai, Kauai, Oahu, Maui, and Hawaii Island -each with their own unique culture, traditions, and attractions. To really make the most of your Hawaiian adventure, you'll need to do some serious planning, especially if you want to see the big picture and travel to different islands. Set your travel goals 1. List your priorities. Before you start planning your trip to Hawaii, you need to figure out what you want out of it. Are you looking for a relaxing, family-friendly experience? Or are you more interested in exploring Hawaii's natural wonders, such as its lush vegetation and beautiful beaches? Once you know what you want, you can start narrowing down your options and focusing on the activities that best fit your goals. For example, if you want to take in some of Hawaii's most popular attractions, such as the Rainbow Falls and the Duke Kahanamoku Beach, you'll probably want to stay on the island of Oahu, then you can make sure to see the main attractions on your list, like the Pearl Harbor and the Lone Star State Park. If you're interested in nature, you might want to explore one of Hawaii's wilderness areas, such as the Kaumakani Recreation Area on Hawaii Island. Another great option is the Hawaii Tropical Botanical Garden, which features a wide array of exotic plants and flowers. 2. Research travel time and costs. Depending on where you live, you may have to fly to Hawaii to experience all that it has to offer. If you're interested in seeing as much of Hawaii as you can, you'll probably want to stay on the island of Oahu, which is the biggest and busiest of the Hawaiian islands. Once you fly to Hawaii, you'll need to figure out how much time you have and how you'd like to spend it. Also, take into account the cost of different travel methods, such as flights, trains, buses, and rentals. If you're planning a family trip, you may want to research attractions that are kid-friendly. Figure 3: Samples generated by our JAM-Cross instruct tuned model. (Left - generated without retrieval augmentation; Right - generated with retrieval augmentation) x and returns a relevance score r(x, m) for each candidate document m M. Each multimodal document is split between text and images and fed to the corresponding modality-specific VIT-B-32 CLIP encoder (Radford et al., 2021). The two embeddings are then averaged to form the documents vector representation. We then use Maximum Inner Product Search (MIPS) over the memory bank to obtain a list of candidates. When sampling retrieved documents, we prioritize the diversity of the sampled documents by skipping candidates with a score r(x, m) 0.9. We apply query dropout to regularize the training, dropping 20% of tokens from the input sequence x. Training - Alignment Phase During the continued pretraining, we train for approximately 50B multimodal tokens. Our initial learning rate is lr = 3 10 5 we use 500 warm-up steps. We set our optimal batch size to 8M tokens, this hyperparameter is borrowed from the mixed-modal scaling laws introduced in Aghajanyan et al. (2023). The total number of training steps is 5960. This training procedure takes approximately one day on 256 80GB A100s for all models. We select the last checkpoint for all the different JAM models, which is always the one with the lowest average validation perplexity (PPL). Training - Instruct Tuning Our instruct tuning training procedure is data efficient we train with our instruction tuning mixed corpora. The initial learning rate is set to 1 10 5, and we use 300 warm-up steps and a batch size of 1M. The instruction tuning procedure takes less than 2 hours on 64 80GB A100s, we train for 15 epochs over our mixture of datasets and manually select the best checkpoint corresponding to the 9th epoch. Following Zhou et al. (2023), we notice that the validation PPL doesn t correlate with the quality of the responses. Decoding Strategies We implement a mixed-modal decoding strategy for our interleaved generation. The model starts generating text tokens until a modality token is detected, then an image is sampled. The generation process alternating the two modalities continues iteratively until a token is sampled. As a result our model is able to generate free-form multimodal documents. We employ temperature sampling, a common technique used in autoregressive model (e.g Ramesh et al. (2022)) to control the randomness of the prediction by modifying the softmax temperature τ. We pair this technique with Top P sampling introduced in Holtzman et al. (2019) consisting of sampling from the top-ranked tokens with a cumulative probability exceeding a predefined threshold τP . We also employ classifier-free guidance (CFG (Gafni et al., 2022)) for sampling images. This technique allows to condition the sampling procedure, blending the logits from an unconditional sample with the logits from a conditional sample. The procedure is mathematically described as logitscf = logitsuncond + αc(logitscond logitsuncond) (6) where logitscond = T (ty|tx) and logitsuncond = T (ty| < mask >); T represent the transformer model, < mask > represent the absence of the input text, tx are the conditional input tokens, ty are Published as a conference paper at ICLR 2024 the output tokens and αc is the scaling factor for CFG. Thanks to the CM3 objective, our training procedure allows our models to sample with CFG without further fine-tuning. Inspired by Yu et al. (2023) we complement this technique to boost the generation quality. Our samples are generated using a temperature value τ = 1, τP is set between 0.8 and 1, and we use classifier-free guidance with values 3.5 and 4. In contrast to other approaches, we don t make use of the computationally expensive clip-reranking (Ramesh et al., 2021; Yu et al., 2022; Gafni et al., 2022) or constrastive decoding (Li et al., 2022b; Yu et al., 2023). 3.1.1 DATASETS Shutterstock We randomly sample a subset of 30B tokens from CM3leon (Yu et al., 2023) pretraining data. The data consists of legally acquired image-caption pairs from Shutterstock, a commercial online platform offering images with ownership attribution and clear licensing terms. Text corpora We use 30B text tokens sampled from a mixture of several publicly available data, and we reuse the data used for training other common open-source LLM following the same preprocessing of (Touvron et al., 2023). The datasets are: English Common Crawl (Touvron et al., 2023), C4 (Raffel et al., 2020), Wikipedia, Books3 from The Pile (Gao et al., 2020), and ar Xiv. LIMA We use the 1k dataset present in Zhou et al. (2023), which features various curated prompts and responses. wiki How We collect an interleaved image-text dataset sampling 3000 articles from Wiki How, an online wiki publication that usually curates apposite images for each article. We sample balanced articles from each category to ensure diversity; moreover, we leverage the platform s community ratings to filter each article s quality, sampling only those with a score greater than 90/100. For each article, we use the title (e.g., How to make ..? ) as prompt, we modify the phrase This article... with The following answer.. . Furthermore, we restrict the number of images as 3 per sample, to fit our 4096 context length. 3.2 CONTINUED PRETRAINING RESULTS In the initial stage of continued pretraining, we evaluate the performance across various JAM models. Our primary objective is to ensure minimal performance degradation post-merging, relative to the parent models. Managing both image and text processing within a single model poses significant challenges. This evaluation seeks to quantify the retention of original performance in our different JAM models, benchmarked against the two parent models specialized in individual modalities. 3.2.1 TEXT MODALITY For the text modality, we compare the zero-shot performance on some common sense reasoning tasks: PIQA (Bisk et al., 2020), ARC-Challenge, ARC-Easy (Clark et al., 2018), Story Cloze (Mostafazadeh et al., 2016), Winograd, and Winogrande (Sakaguchi et al., 2021). We also report some recent influential LLM (Brown et al., 2020; Touvron et al., 2023), and our LLM (Molybog et al., 2023) fine-tuned with 4k context as a reference. Results are presented in Table 1. The JAM-Uniform reaches slightly better text-only performance than JAM-Width however, it is crucial to remark that this approach consolidates the functionalities of both parent models within a constrained 7B parameter space. Our findings reveal that the intrinsic knowledge of the parent models can be recovered mainly from the parameter average utilizing only a minimal portion of the original pretraining data. The JAM-Cross model yields the best results, aligning with our primary LLM. This highlights the strength of our bidirectional cross-attention mechanism against other baselines. 3.2.2 IMAGE-TEXT MODALITY To assess the performance of our different baselines over the image-text modality, we compare them using the validation perplexity (PPL) on MS-COCO dataset (Lin et al., 2014). We believe this metric robustly correlates with performance on subsequent tasks, such as image generation and captioning. Furthermore, it provides a reliable reference point for comparing different autoregressive models sharing an identical tokenizer. Results are reported in Table 2. Diverging from results Published as a conference paper at ICLR 2024 Table 1: Zero Shot Text Comparison on Common Sense Reasoning Tasks Model Size PIQA ARC-C ARC-E Story Cloze Winograd Winogrande GPT-3 175B 81.0 51.4 68.8 - - 70.1 LLa Ma 7B 79.8 47.6 72.8 - - 70.1 LLM-4k 7B 76.7 45.9 67.7 79.3 83.9 66.2 JAM-Uniform 7B 62.4 28.5 42.6 63.5 47.8 49.7 JAM-Width 26B 57.8 31.4 31.6 54.7 50.2 51.9 JAM-Cross 19B 75.4 41.6 67.2 79.8 81.0 66.0 Table 2: Image-Text Comparison Model Size MS-COCO PPL CM3 2.7B 200.1 RA-CM3 2.7B 193.1 CM3leon 760M 168.8 CM3leon 7B 149.0 JAM-Uniform 7B 177.5 JAM-Width 26B 159.5 JAM-Cross 19B 147.6 Table 3: Ablations - JAM-Width Model Init. Wikipedia PPL MS-COCO PPL Copy 7.34 159.5 Average 9.0 175.4 Table 4: Ablations - JAM-Cross Model C-Attn Size Wikipedia PPL MS-COCO PPL 13B 7.86 153.2 1 26B 7.53 152.4 2 19B 7.18 149.0 4 16B 8.55 151.7 Table 5: Ablations - Instruction Tuning Shutterstock MS-COCO PPL 190.2 164.5 on the text-only modality, the JAM-Width model exhibits enhanced performance over the JAMUniform model in the image-text domain. Specifically, the JAM-Width model demonstrates superior efficacy in retaining image-text performance relative to text-only performance. Conversely, despite a decline in performance, the JAM-Uniform model remains a good parameters-performance trade-off. Interestingly, our JAM-Cross model not only reaches the best PPL between the JAM strategies but also surpasses our foundational image-text model, CM3leon. We hypothesize that such advancement can be attributed to integrating novel textual capabilities coupled with an augmented parameter count inherent to the combined architecture. Based on empirical evidence, the JAM-Cross emerges as the best strategy to combine two pretrained autoregressive models. 3.2.3 INTERLEAVED GENERATION Our instruct-tuned JAM-Cross model reaches a high-quality level of image-text generated output. To demonstrate its ability to generate coherent modality interleaved responses, we show an extensive set of generated samples in Figure 3 and Section B. The samples generated with retrieval are obtained from the model instruct-tuned with a mixture of pretraining image-text Shutterstock data along with our corpora of instruct-tuning datasets, while the samples generated without retrieval are obtained from the model instruct tuned only on our instruct-tuning set. The generated samples show coherent image and text integration, demonstrating unprecedented abilities at this novel task. Overall we find our retrieval augmented solution to be more effective than standard image sampling, boosting image quality. We further report several qualitative comparisons with the most relevant previous work GILL (Koh et al., 2023a) that features mixed-modal generation. We use our retrieval-augmented JAM-Cross model and source generations for the GILL model from the original paper. From this comparison (Figure 4), it s immediate to notice how our model has a better overall quality of responses. The generated text is more complete and exhaustive, while the generated images are more relevant to the text context. We remark that our method is the first capable of such coherent and interleaved generation with a focus on instruction tuning and that our fine-tuning procedure is effective in Published as a conference paper at ICLR 2024 Hi, I'm looking for ideas for a customized cake design for my daughter's birthday. I'm not sure what you mean by customised. Here are some custom cake ideas for your daughter's birthday: Pink and lavender themed cake with edible glitter. Multicolored frosted cake with white icing and pink and lavender decorations. Baker's chest themed cake. Princess themed cake. Panda themed cake. Tomato themed cake. Hawaiian themed cake. Fluffy cake with layers. 3D decorated cake. Colorful sprinkles or glitter. New Year themed cake. Book themed cake. (Generated) (Retrieved) I need inspiration for my bridal hairstyle in a garden wedding. I love the idea of a garden wedding, but I think you should go with Here are some hairstyles you can try for a garden wedding. You can do 1 large bun, 2 side buns, a low side part, or part your hair in the middle and go slightly half up half down. 1 large bun: Put some styling mousse in your hair. Then gather your hair into a large ponytail at the nape of your neck. Tuck the hair at the back of your head under so it looks like a large braid. Then use bobby pins to hold it in place. 2 side buns: Part your hair down the middle from the crown of your head. Then, use a comb to separate a half of your hair to the left and the other half to the right. Put the half that is on the left in a small ponytail and pin it up. Repeat the same thing with the other half. A low side part: Part your hair from just behind one ear to the bottom of your chin. Let some of your hair fall on the sides and part the rest in the middle. Part your hair in the middle: Take a comb and start combing from the midsection down to your feet. Part the hair in the middle so that (Generated) (Generated) I want to paint a landscape with mountains and a river. Can you help me find some inspiration? I'm not sure Here are some inspiration ideas for you to paint a landscape with mountains and a river. You can start by exploring your own hometown and look around at the scenery you are familiar with. Take pictures of the mountains and the river if you can. Look at the pictures and try to find ways to incorporate them into your painting. You can also find inspiration by looking at paintings and drawings of mountains and rivers by other artists. You might go to a museum or an art store and look at some paintings. If you can't visit a museum or an art store, you can at least look online. You can also find inspiration by watching movies and videos that feature beautiful landscapes with mountains and rivers. (Generated) What would a pizza with bacon look like? Pizza with bacon is a savory dish that contains both savory and sweet flavors. The sweet flavors come from the sauce, the cheese, and the toppings, while the savory flavors come from the bacon. There are various methods of making this dish. Some people choose to make a thin crust pizza, which contains less cheese and more bacon. Others make a thicker crust pizza that has more cheese and less bacon. (Generated) (Generated) (Generated) Figure 4: Qualitative comparison with previous interleaved generation models. Compared to GILL, our model is able to generate more complete and precise answers. Results for GILL are sourced from Koh et al. (2023a). efficiently learning the style of the dataset, not only for text but even for images. Our model paves the way toward a larger adaption of mixed-modal generation in real-world use cases. 3.3 ABLATION STUDY We compare the two approaches for the width concatenation model: copying the original models weight or using the average to initialize the new parameters. Results (Table 3) show that copying the weights is more effective than averaging them to retain the original model capabilities. The ablation study for the Cross-attention model is presented in Table 4. We ablate the frequency of inserting cross-attention layers and the impact of not using any cross-attention layers. These experiments are performed training with 25B tokens, all the other parameters are the same as reported in Sect. 3.1. We remark that this is an even shorter training setting concerning our 50B tokens total training and that the difference in performance increases as the training progresses. We further ablate the contribution of image-text pretraining data in the instruction tuning procedure in Table 5. The results indicate the importance of using pretraining data mixed in the instruction tuning procedure to preserve the MS-COCO PPL. We do not report Wiki How PPL since analyzing the models shows that it doesn t correlate with generation quality similarly to Zhou et al. (2023). 4 RELATED WORKS Generative Text-to-Image Models The field of generative text-to-image models has recently been dominated by diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020). Recent enhancements have used pretrained text representations (Ramesh et al., 2022; Nichol et al., 2022) like CLIP (Radford et al., 2021) to improve the generation quality. Concurrently to developing diffusion-based generative models, significant steps have been made by autoregressive token models (Esser et al., 2021; Gafni et al., 2022). These models encode images into a discrete latent space (Van Den Oord et al., 2017) Published as a conference paper at ICLR 2024 and can be processed as a standard sequence-to-sequence modeling task, enabling the borrowing of techniques used from Large Language Models. A critical element that has been found beneficial in boosting text-to-image generative models is retrieval augmentation (Chen et al., 2022; Yasunaga et al., 2022). Yasunaga et al. (2022) propose to prefix decoder-only models, such as Aghajanyan et al. (2022), with retrieved images during training, resulting in a huge efficiency gain for the training procedure. Yu et al. (2023), scale this strategy to reach state-of-art performance in image generation using 5x less training compute. In this work, we borrow their model as our text-to-image autoregressive backbone. Multimodal Language Models The multimodal language model field has recently seen considerable development. Several prior works have focused on connecting language models to visual encoders. (Tsimpoukelli et al., 2021; Mokady et al., 2021; Najdenkoska et al., 2023; Li et al., 2023). These methods typically train a mapping network between a pretrained image encoder and a language model. Flamingo (Alayrac et al., 2022) introduces cross attention into a frozen LLM to inject visual features and trains a large corpus of image-text pairs. In this work, we similarly use cross attention to bridge the two models; however, our mechanism is bidirectional between the vision and language models, while for Flamingo, the visual knowledge is injected in the language model and not vice-versa. CM3 (Aghajanyan et al., 2022) is trained on a large corpus of structured HTML; it introduces the Casually Masked Language Modeling objective we adopt to train our models. Koh et al. (2023b) propose a multimodal language model capable of processing arbitrarily interleaved image and text inputs and generating interleaved output of text and retrieved image. Subsequently, on the same line of work, GILL Koh et al. (2023a) proposes to ground an LLM to a text-to-image model, using a mapping network and freezing the pretrained models, introducing the possibility of generating or retrieving images as output. Similarly to GILL, Sun et al. (2023b) propose to model different modalities in an autoregressive way with a single model they call Emu. Differently from our work, they employ EVA-CLIP (Sun et al., 2023a) encoder to generate visual embeddings and Stable Diffusion (Rombach et al., 2022) conditioned on the generated image tokens to decode images. Instruction Tuning Instruction tuning aims to teach language models to follow natural language instructions. Several methods have been proposed for instruction tuning, using existing NLP datasets converted in instruction formats Wei et al. (2021) Chung et al. (2022), or using LLMs like GPT-4 to generate instruction data with better diversity Wang et al. (2022) Honovich et al. (2022). Recently, LIMA Zhou et al. (2023) demonstrated that 1,000 carefully curated samples are enough to reach competitive results compared to bigger instruction-tuning datasets. The authors hypothesize that most of the knowledge is learned during the pretraining, and the instruction tuning teaches the style to interact with the users. In this work, we explore using a small set of multimodal instruction tuning data to fine-tune our model, verifying the effectiveness of a small dataset in this multimodal setting tailored to image generation. Several vision language works adopt instruction tuning for multimodal tasks-focused user interactions optimized for visual content understanding Liu et al. (2023) Dai et al. (2023) Ye et al. (2023) Zhu et al. (2023). Unlike previous works, we explore instruction tuning focused mixed-modal generation, paving the way for more significant adaptation of multimodal models that can generate interleaved image-text output. 5 CONCLUSIONS In this work, we have presented novel methodologies for combining pretrained autoregressive models, demonstrating the viability of synthesizing the knowledge of two distinct models into a cohesive structure with extended capabilities. Our exploration validates that the integrated model can be adeptly fine-tuned using our tailored instruction-tuning procedure for interleaved image-text generation. To this end, we pioneered creating a specialized dataset centered on instruction tuning for this particular task. Nevertheless, the proposed study is limited to 7B parameter models with the same architecture. Future works may consider scaling the models size and asymmetrically applying our cross-fusion method to bridge models of varying sizes. Increasing the context length and delving into multi-turn conversations could further represent an interesting exploration direction. In conclusion, our study sets the foundation for substantial advancements in the realm of multimodal autoregressive models. The fusion of text-to-image generation with large language models paves the way for sophisticated systems capable of interleaved image-text interactions, enriching the landscape of conversational AI. Published as a conference paper at ICLR 2024 Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, et al. Cm3: A causal masked multimodal model of the internet. ar Xiv preprint ar Xiv:2201.07520, 2022. Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. ar Xiv preprint ar Xiv:2301.03728, 2023. Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations, 2022. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716 23736, 2022. Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432 7439, 2020. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image generator. ar Xiv preprint ar Xiv:2209.14491, 2022. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. ar Xiv preprint ar Xiv:2210.11416, 2022. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ar Xiv preprint ar Xiv:1803.05457, 2018. Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. ar Xiv preprint ar Xiv:2305.06500, 2023. Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873 12883, 2021. Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Makea-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pp. 89 106. Springer, 2022. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019. Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. ar Xiv preprint ar Xiv:2212.09689, 2022. Published as a conference paper at ICLR 2024 Jean Kaddour, Linqing Liu, Ricardo Silva, and Matt J Kusner. Questions for flat-minima optimization of modern neural networks. ar Xiv preprint ar Xiv:2202.00661, 2, 2022. Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Generating images with multimodal language models. ar Xiv preprint ar Xiv:2305.17216, 2023a. Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. ar Xiv preprint ar Xiv:2301.13823, 2023b. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. In International Conference on Machine Learning, pp. 12888 12900. PMLR, 2022a. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models. ar Xiv preprint ar Xiv:2301.12597, 2023. Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. ar Xiv preprint ar Xiv:2210.15097, 2022b. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740 755. Springer, 2014. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. ar Xiv preprint ar Xiv:2304.08485, 2023. Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703 17716, 2022. Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. ar Xiv preprint ar Xiv:2111.09734, 2021. Igor Molybog, Peter Albert, Moya Chen, Zachary De Vito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, et al. A theory on adam instability in large-scale machine learning. ar Xiv preprint ar Xiv:2304.09871, 2023. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and evaluation framework for deeper understanding of commonsense stories. ar Xiv preprint ar Xiv:1604.01696, 2016. Ivona Najdenkoska, Xiantong Zhen, and Marcel Worring. Meta learning to bridge vision and language models for multimodal few-shot learning. ar Xiv preprint ar Xiv:2302.14794, 2023. Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pp. 16784 16804. PMLR, 2022. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485 5551, 2020. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821 8831. PMLR, 2021. Published as a conference paper at ICLR 2024 Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. ar Xiv e-prints, pp. ar Xiv 2204, 2022. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99 106, 2021. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015. Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. ar Xiv preprint ar Xiv:2303.15389, 2023a. Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. ar Xiv preprint ar Xiv:2307.05222, 2023b. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200 212, 2021. Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. ar Xiv preprint ar Xiv:2212.10560, 2022. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021. Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965 23998. PMLR, 2022. Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Retrieval-augmented multimodal language modeling. ar Xiv preprint ar Xiv:2211.12561, 2022. Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. ar Xiv preprint ar Xiv:2304.14178, 2023. Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for contentrich text-to-image generation. Transactions on Machine Learning Research, 2022. Lili Yu, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, et al. Scaling autoregressive multi-modal models: Pre-training and instruction tuning. ar Xiv preprint ar Xiv:2201.07520, 2023. Published as a conference paper at ICLR 2024 Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. ar Xiv preprint ar Xiv:2305.11206, 2023. Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ar Xiv preprint ar Xiv:2304.10592, 2023. A LIMITATIONS The generation quality of our proposed model presents some limitations. Our JAM frameworks rely on LLMs and text-to-image autoregressive models, inheriting their strengths and limitations, such as the potential for hallucinations and biases in image generation. These limitations may be addressed by improving and leveraging better autoregressive backbones. Moreover, our instruct-tuning procedure focuses on a specific wiki-style single-turn question answering. Most of the time, the model generates a single or, at most, two images, while a longstanding goal could be to generate entire multimodal documents with >2 images. This is partially bottlenecked by the training data and the context length of 4k tokens; given that each image takes 1024 tokens, generate longer outputs it s still challenging. Furthermore, our model is more effective in generating interleaved image text when the question style aligns with the dataset style and when the images refer to concrete objects. While the model usually struggles to represent more abstract concepts. We showcase some examples of this behavior in Figure 5. When the subject of the text dialogue are more abstract concepts such as "time travel", "music", or "research", the generated images fail to capture the textual context and enrich the model output faithfully. B GENERATED EXAMPLES This section shows several samples generated by our instruction-tuned models using both retrieval augmentation and standard image generation. Published as a conference paper at ICLR 2024 https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg You are a scientist who just invented a Time Machine. Where do you travel first? I would travel back in time to when I was a teenager. There are a few people I want to meet, and I have a lot of work to do! First, I would travel to about the year 2000, and spend some time getting to know my family. I think it would be really cool to talk to my parents when they were young, and see how their lives turned out. I'd also like to meet my grandparents, and see how they changed over the course of their lives. I would also travel to the past to try to influence history in ways that I think would be beneficial. I would travel to historical events and try to change the course of history for the better, or at least record what I do and let people see what I did. I would also travel to ancient times and try to make discoveries that have never been made before. I would also go to the future to see how society has changed and what kind of changes it has undergone. I would also like to see what kind of technological advances people have made in the future and what kind of impact those advances have had on society. https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I start playing guitar and become a better musician? There are several ways. Start with basic lessons. Many websites offer free lessons, including Google, You Tube, and Encyclopedia Britannica. Focus on learning the basics, playing songs, and becoming familiar with the guitar's neck and how the strings are positioned. Practice your chords and scales. When you have a basic foundation of knowledge, you can move on to more advanced techniques. Search for videos and lessons online. Check out popular websites such as You Tube, Vimeo, and Google Play. You can also find books at your local library that offer guidance. Create your own routine. When you first start learning how to play the guitar, you'll probably feel overwhelmed and not make any progress at all. This is normal; everyone was once a beginner. To help you overcome this, create a practice routine. Here are a few things to consider for your routine: Set aside a set time and day to practice. This will help you establish a routine and stay motivated. Since you're a beginner, start with 15-20 minutes a day and 30-60 minutes a few times a week. As you progress, you can increase your practice time. Review what you learned in your lesson and practice a few songs Why are you holding the guitar this way? What is the best technique you can find for this? Go slow and repeat until you get it right. Review and practice until you can do it without taking a break. This will help you build endurance and become a more efficient player. Start with the basics and work your way up. Don't try to learn too much too fast. Focus on becoming a competent player who can play a few fun songs and review the basics along the way. You'll progress faster if you take your time and learn each aspect properly. There are plenty of resources that outline the basics of guitar playing. You can find them online or at your local music store. Take your guitar with you. This way you can practice whenever you have free time. You'll also be able to take your practice further. For example, if you're in the middle of a movie, practice a few chords. If you're bored in between episodes of your favorite TV show, hit the practice pedal. Finally, if you're out on a walk or running an errand, riff away. Any time you have a few free moments, take out your guitar and practice a few chords. Your patience will pay off. https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I write a research paper? A research paper is an academic essay that researches and evaluates a topic and presents the results in the form of written content. Usually, a thesis statement (the main idea of the paper) will guide the research paper; the thesis statement will also be supported by the research and appear at the end of the paper. There are lots of different ways to write a research paper; the basic structure, however, stays essentially the same. Research paper introduction 1. Give a definition of the topic you are researching. A good research paper topic should be interesting, worthy of investigation, and have an aspect of immediacy. For example, you could research whether teenage pregnancy rates have decreased over the past 15 years in your area. You might choose to research homelessness in your area, or early childhood education programs in your city. How does obesity in children affect their growth and development? Do pregnant women in your area get screenings for gestational diabetes? 2. Describe why your topic is important. Your topic is important because people's lives are at stake. If you are researching homelessness in your area, people are living in unsanitary conditions, maybe they lack access to basic necessities like food and shelter, they may be infected with diseases Read about the different styles and include what styles the professional uses in their work. You might want to research how different authors in your field have written their introductory chapters. Some research papers even include a chapter titled "Acknowledgments" where the author acknowledges all of the people and resources that went into creating the work. Remember to cite all sources you use. 3. Provide context and weigh the severity of the issue. You want to tell your reader who or what will be affected by your topic, why that would be bad, and how your topic ties in to other issues and fields. How will your research paper affect real people? If you are writing about homelessness, your research will either help homeless people get shelter and basic necessities or it will show that the current system of homelessness in your area is unsustainable and needs to be altered. What will the outcomes of your research be? If you are researching obesity in children, you might find that children are not getting enough exercise, which can lead to health problems later in life, or you might find that parents are not doing a good enough job at watching their children's eating habits, which may lead to lifelong health problems and/or dependence on others. Figure 5: Generated samples - Failure Cases Published as a conference paper at ICLR 2024 How can I improve my physical fitness? It is important to understand that fitness is not just about how much you can lift or how fast you can run, there is a holistic approach to health and fitness that takes into account your habits, stress levels and diet.To improve your physical fitness, you should first assess your current level of fitness and identify areas where you can improve. There are a number of ways to do this, such as assessing your heart rate during exercise or doing a BMI test. After you have done an assessment, you should develop a specific plan to improve your fitness. This should involve not only specific exercises and workouts, but also changing your overall lifestyle, such as by getting more active and adopting a healthier diet. Coordinating your workouts with your regular schedule. Be sure to take time for rest and relaxation. Give yourself a few days off between workouts to prevent injury and help your body repair and grow. Work out at a time of day when you will be energetic. You will be able to exert yourself at your best and achieve the best results. Many people find that morning is the best time for them, while others prefer to work out at night. It is important to experiment with different working out schedules to see what works best for you.] Try new workouts and exercise techniques to keep your workout routine exciting and fun. This will help you stay motivated and prevent you from getting bored with your routine. Always talk to your doctor before trying a new workout or technique, especially if you have any medical conditions. You may want to seek advice from a personal trainer, who can develop a specific workout program for you. Be patient with yourself and your progress. Remember that it takes time to improve physical fitness, and you should be patient with yourself and avoid beating yourself up over what you perceive are your shortcomings. Focus on the positives and use your failures as learning opportunities. You may have to adapt your goals over time as you progress. I want to visit Yosemite National Park, can you give me some suggestions for my trip? Yosemite National Park is one of the most beautiful parks in the world. There are so many different things to do and see in the park, that it's really hard to make suggestions. The park is huge, and travels long distances. You could visit Venice Beach, and then drive up to Mammoth Lakes. From there, you can go back down through Glacier Point and Bear Yosemite to Mariposa Grove. You can explore the grove, then head back to Yahoo Dome. From there, you can just keep driving and drive back into Oakland. It's a long journey, and there are many areas you can explore. One of the best ways to explore the park is to take a shuttle bus. The park provides free shuttle buses between Yosemite Village and several locations in the valley. The shuttles run every day, but they are not available on Thanksgiving Day, Christmas Day, or New Year's Day. The bus system is fantastic. It's easy to use, and it allows you to explore different parts of the valley that you may not have otherwise visited. In the summer time the seasons are not as distinct as they are in other areas of California. It can be warm most of the time, and it rains fairly often. The water level in the springs and lakes can vary a lot. Sometimes, it's low enough to walk across. In other cases, it's deep enough to swim in. The weather can change quickly in Yosemite. It's important to pack clothing for all sorts of weather, and to be prepared for rainy, cooler weather. On the other hand, you may not get to see a lot of snow. If you want to snow bike, then you may be able to find someone to rent you equipment. Yosemite is amazing, and it will be a fantastic trip. Just remember to bring water, clothe, and sunblock Figure 6: Generated samples - without retrieval (left), retrieval augmentation (right). Published as a conference paper at ICLR 2024 https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg I need to train my dog. How can I teach him how to sit and other basic instructions? Start with easy commands like "Sit" and "Stay". These are the first steps for someone to learn how to control their dog. They are also the first steps for your dog to learn how to associate his body with certain commands. To get your dog to sit, raise your hand in the air and make a demanding motion with your hand. Often called a "hands command", your dog will associate the command with your hand motion and with the positioning of your arm. Start using the word "sit" as you are making the hand motion. After a few attempts, your dog should understand that the command "sit" corresponds to the hand motion you are making. In addition to using the hand motion, be consistent with your tone of voice. Positive reinforcement will help him learn the hand motion, but your tone of voice will let him know whether or not he has succeeded. Start with a gentle voice and gradually raise your tone of voice when he follows the command. If you are happy with his performance, tell him "Good dog!" in a cheerful tone of voice. If he needs more work, use a firmer tone of voice and let him know that he has not succeeded yet. However, always stay positive. Beginning commands with the hand motion will help you to establish the movement as the command. https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I improve my mathematical knowledge? There are several things you can do to improve your mathematical knowledge. Here are some suggestions. Read books. A good way to learn mathematical concepts is to read about them. You can find books in your local library. Some popular books that might be helpful are: Introduction to Algebra Introduction to Geometry Mathematics: A Very Basic Guide Play with numbers. You can find some math games online. Playing these games can help you practice calculations and recognize patterns. Some games you might try are: Arithmetic Bee Hyperbole and Parabola Pathfinder Solving Equations Take classes. You can take classes in math at your local community college or university. Most colleges and universities offer classes in algebra, geometry, and calculus. You can also take classes in advanced math, such as abstract algebra, harmonic analysis, and number theory. Speak with a tutor. If you are having trouble understanding a concept, you can hire a math tutor. A tutor can help you explain the material in a simpler way. You can find a tutor online or by asking around at Figure 7: Generated samples - retrieval augmentation Published as a conference paper at ICLR 2024 https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I prepare for college admission? There are a lot of things to consider when preparing for college admission. Here are a few suggestions: Take college preparation classes, visit college campuses, and talk to counselors and admissions officers. Doing these things will help you choose the right college and understand what you need to do to get admitted. Set aside time to think about what you want to study and why. Don't settle on a major just because your parents or teachers told you to. Choose a major that you're interested in and that will help you find a job that you'll like. Join clubs, organizations, and sports teams. These can help you make friends, get to know other people, and have fun. They're also a good way to build skills and get immediate experience that colleges will value. Work hard and get good grades. A grade point average of 3.0 or better is usually needed to get into a good college. Some schools may require a higher GPA, so ask them when you visit. If you can't achieve a 3.0 grade point average, talk to your teachers and counselors, and see if there are any special programs you can join to help you get good grades. Take notes in class https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How to prepare a tasty hot dog? Hot dogs are a classic comfort food that you can enjoy at home or at a festival. They're easy to make at home with common ingredients, and your friends and family may be surprised at how delicious they can be. You can mix up your own filler to save money and add healthy ingredients; you can also sausage or use turkey bacon to make a veggie-friendly hot dog. Get started by choosing a topping that suits your preferences. Choose the types of hot dog 1. Get a homemade hot dog for a more nutritious option. Homemade hot dogs are healthier than store-bought versions since they don't contain primarily processed ingredients. They're also fresher, so you're less likely to get sick from eating them. While you can buy a kit that contains all the ingredients you need to make a homemade hot dog, you can also make your own using common ingredients. The hot dog using the basics of a homemade turkey sausage 2. Try a veggie hot dog for a plant based option. Many people eat vegetarian or vegan hot dogs for various reasons. Kosher and soy-based hot dogs are available in bulk or at most grocery stores, and they can be substituted for meat in most recipes. Purchase Veggie hot dogs cook just as well as bacon-filled hot dogs, and they generally aren't filled with a lot of processed meat (or at all). For meat lovers, you can cook your veggie hot dog using delicious mushrooms, seitan, or tempeh. 3. Choose a hot dog seasoning mix for a quick fix. Seasoning a hot dog at home is simple and fast, and it's less expensive than buying a premade seasoning mix. While you can buy hot dog seasoning mixes at most grocery stores, it's easy to make your own using a few spices. Taste your hot dogs as they come out of the oven and add more seasoning if you prefer. Use your finger or a spoon to spread the seasoning evenly over the hot dog. For example, you can use 1 tablespoon (15 ml) each of garlic powder, onion powder, cayenne pepper, lemon pepper seasoning, and herb seasoning. You can try flavoring your hot dogs with different spices, or use a combination of seasonings to create your own unique taste. https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I start to snowboard? Snowboarding is a lot of fun. Here are some steps to help you learn how to snowboard. 1. Get up on the chair lift to go to the top of the mountain. You'll need to buy a ski or snowboard lift ticket. 2. Begin riding the chair lift to the top of the mountain. Once you've arrived at the top, you can turn around and go back the way you came. You didn't actually have to ride the chair lift all the way to the top --you could have walked, but it's much faster to ride the lift. 3. Now that you've arrived at the top of the mountain, you can ride your snowboard down the mountain. To go downhill, you need to keep your weight over the front of your snowboard and release the grip that you have on the back of the board. Engage your brakes so that you can stop when you need to. To go faster, you can push off the side of the snowboard. If you want to turn, you can turn your feet the same way that you would on a skateboard. Releases the pressure on your back foot to start the turn and then catches it with your front foot Figure 8: Generated samples - without retrieval augmentation Published as a conference paper at ICLR 2024 https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I order a large pizza from Domino's? Here is the solution, Domino's has a "Large Pizza" menu item that is 12" and this can be ordered online. 1. Visit the Domino's website. 2. Click on the "Pizza" menu and select "Large Pizza". 3. Enter the desired quantity in the "Quantity" box. For example, 2. 4. Check the "Grab a Package" box to get a box of 12 slices. 5. Click on the "Order Now" button. 6. Provide your name, phone number, address, and the payment method. 7. Confirm the order and wait for the pizza to be delivered. https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I build a wood table for my kitchen? Here are some options to consider for your kitchen table: Build a table out of panels and a top (this is probably the most common style, and probably the cheapest) Build a table out of whole logs (good for a rustic look, and you get to choose the color of the wood) Build a table out of distressed wood (my personal favorite, can be a little more expensive) Build a table with a tile top (not sure why, but there are people that love it) Build a pool table (not really a kitchen table, but can be used in the kitchen) Build a buffet table (these are usually made of brass and wood, and can be quite heavy. Good if you have a lot of guests) Remember, when building any table, the most important part is the frame. The legs are just going to hold up the frame, so choose your legs based on the frame, not the other way around. https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I prepare an avocado toast? Avocado toast is delicious and relatively easy to make. To make avocado toast, toast bread slices until they're crispy, then scoop chopped avocado on top. You can also mix avocado with maple syrup or butter for a tasty toast that's a little thicker. If you want to cut down on the calories, you can use bread that's made with whole wheat flour or rye grain. If you want to go vegan, you can use vegan butter or mousse instead. Toast and scoop avocado 1. Preheat the oven to 250 F (120 C). Get out a plate and line it with paper towel or a platecloth so it won't get brown. Set the plate aside while you toast the bread. 2. Heat a pan on the stove until it's hot. Place a pan on the stove and heat it over medium heat. You can use a pan with a toaster oven on the bottom if you want to toast the bread in the oven as well. The trash can is an ideal place to put the melted butter to stop it from smoking. 3. Toast the bread until it's crispy Figure 9: Generated samples - without retrieval augmentation Published as a conference paper at ICLR 2024 https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How to make shakshuka? Shakshuka is a tasty and healthy breakfast dish made with eggs, beans, mushrooms, onions, lettuce and tomato sauce. The shaken eggs are scrambled in a pan before being poured over anything else. The dish can be served warm or cold. Read on for some tips on how to make your own shakshuka. Make basic shakshuka 1. Heat olive oil or a tasty oil (like coconut or pistachio) in a pan on medium-high heat. A cast iron pan is best, but you can also use a non-stick pan. You'll also need a medium-sized saucepan. 2. Add onion, garlic and/or mushrooms to the pan and saute them for 5-10 minutes, depending on your taste. Onion is the most common addition, but you can experiment with other vegetables, like za'atar (middle eastern spice mix), chopped chili pepper, or garlic powder. You can also mix in some spices like cumin and coriander. 3. Crush the tomatoes (you can crull them up, or simply chop them into small pieces) https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg I am going with my friend for an hiking. How should I prepare my backpack? Here are some tips to get you started preparing your backpack for a trek. 1. Pack your essentials in a lightweight, compact bag. 2. Place your care items-including snacks, clothes, and toiletries-in a separate bag. 3. Pack your gear in a strap bag, or a pack pouch attached to the main bag. 4. Use breathable, lightweight materials in your gear bags. 5. Opt for removable straps, bags, or other accessories so you can lighten your load while hiking. 6. Pack your clothes in a tarp or a collapsible bag so they'll take up minimal space. 7. Use a pack that allows you to easily adjust your carrying capacity according to the terrain and distance of your trek. 8. Pack your gear in plastic bags to keep it clean and preserve its durability. 9. Use a waterproof bag that can hold everything you need, or create a waterproof bag calculated for your trek. 10. Start your trek with fresh clothes so you don't get sweaty or dirty while hiking. Figure 10: Generated samples - without retrieval augmentation Published as a conference paper at ICLR 2024 https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg What can I visit in Torino? Here are some suggestions for your visit to Torino, Italy: The Palazzo Savona, formerly the palace of the Savoys, is a spectacular building located in the city center, constructed in the 17th century. The Muse Piemonte, located in the Old Palazzo, houses an impressive collection of art and antiquities, and is one of the largest museums in Italy. Church of the Gesù, designed by Brunelleschi and completed in 1558, is a Gothic-style church located in the church district of the city. Piazza San Marco, or St. Mark's Square, is a beautiful square located in the heart of the city, facing the Basilica. The Torino Dome, constructed in 1878, is an extraordinary example of Italian architecture, created by architects Guglielmo and Achille Villa. The Biblioteca Albertina, located in the University of Torino, houses one of the largest collections of rare books and manuscripts in Italy. The Botanical Garden, located in the heart of the city, is one of the largest and most beautiful gardens https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I learn skateboarding? Skateboarding is fun, involves a lot of movement and is a good exercise. There are a lot of ways to learn how to skateboard. You can take lessons from a professional skater, watch instructional videos, read how-to articles, etc. Try to find a friend that skates and asks to ride with you, or visit a skatepark and practice on the bumpy terrain with others. Learn from a pro skater 1. Find a professional skater to teach you. Look online for skater camps or try asking around at your local skatepark. You might even know a skater that's willing to teach you how to skate. Organize a lesson with the person and make sure you know what you expect to get out of it. Some skaters may be willing to help you for free if they like you, or if you help them with something. 2. Organize a lesson. Make sure you know exactly what you need to learn. For example, if you want to learn how to kick turn, you need to know how to move your foot from the front to the back, and you need to know how to rotate the deck. You probably also need to know how to balance on a skateboard Figure 11: Generated samples - retrieval augmentation Published as a conference paper at ICLR 2024 https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How to fly to Niagara Falls? Niagara Falls is one of the most popular tourist destinations in the world. You can easily travel to this region from the Canadian or American side, although you may want to explore the whole area. The best way to get to Niagara Falls is by plane, as there are several airports in the area. Once you're at Niagara Falls, you can take a number of different activities to make the most of your visit.## Fly to niagara falls 1. Use the Google Flight Search website to book your flight. Google Flight Search is an online flight booking website. You can browse flight prices from a number of different airlines. You can also input vacation dates and see what flight prices come up. You don't have to log in or register in any way. Within the flight search box, you can enter the following information: Year of travel Month of travel Day of travel. You can also input where you want to go, and the website will find flights that meet your budget. https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How to make mango lassi? Mango lassi is a tasty drink that's made with mango juice and yogurt. It's a good choice for a hot summer day, as it's cold and creamy. You can enjoy it plain, or you can add some spices to it. If you want to know how to make it, just follow these steps. Make plain mango lassi 1. Wash the mango. Use your hands to scrub the mango to remove any dirt. Rinse the mango and let it dry. 2. Prepare the yogurt. Heat the yogurt in a microwave-safe bowl. Place the bowl in the microwave and heat it on medium heat for 1 to 2 minutes. You can use any kind of yogurt. Plain yogurt will have less sugar and will be sweetened nicely by the mango juice. You can also use plain cottage cheese instead of yogurt. 3. Mix the yogurt and mango juice. Use a spoon to mix the yogurt and juice together until the mixture is smooth. You may need to stop and scrape down the sides of the bowl occasionally. 4. Pour the mixture into a glass. You can use any kind of glass or mug. https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I get better at chess? I would like to improve my skills. There are several ways you can improve your chess skills. Play more chess. Practice several times a week and increase your play time as you get better. You could join a chess club to get better practiced, or play online chess with friends or other players. Analyze your games. After you have played a game, observe it and think about how you could have done things differently. This will help you grow as a player and evaluate your moves in a more objective way. Study the chess tactics. There are some interesting strategies and concepts in chess that you can learn about by studying them. Some good books to reference are The Art of Chess Strategy by Alexander Navolynets, The Chess Improvement Book by Nikola Kodrina, The Chess Triple Threat by Adrian Menendez. Understand the chess openings. The way you begin a game of chess depends on whether you want to counteract your opponent's moves, or try to establish your control over the chessboard early on. There are several kinds of openings, such as the Anglicane, Russian, Sicilian, Zusovsky, and Germans. Learn about the strategies behind them and understand which kind of opening you should Figure 12: Generated samples - retrieval augmentation Published as a conference paper at ICLR 2024 https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I start playing basketball? Basketball is one of the most popular sports in the world. There are more kids playing basketball now than ever before. There are a lot of ways to start playing basketball, and a lot of resources available to help you learn. Checking out some new basketball shoes and adding some new basketball pants to your wardrobe might be a good place to start. Next, you might want to consider joining a basketball class or league. Finally, to start playing basketball, you'll need to learn the game by playing it. Basketball is a one-on-one sport, but it's also a team sport, and learning how to work as a team and play individual games will help you become a better basketball player. Get into basketball 1. Try basketball activities for kids. There are a lot of different activities you can do to get into basketball. Most activities are geared towards kids ages 8-12, but many of them might be appropriate for younger children as well. Many of these activities involve playing basketball in a tested environment, under the supervision of trained coaches. Some of the more popular activities include: * Youth Basketball Program (YBP): YBP is one of the most popular youth https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I improve my bouldering technique? Pushing too hard is a common problem for beginners in bouldering. Here are a few things to keep in mind as you develop your technique: Start with small, shallow moves. You need to get your technique down before you start trying big, difficult moves. Start with some simple strategies like just getting your foot on the rock, or pushing a grip onto a particular hold. These small strategies will help you build strength and confidence before you tackle bigger moves. Focus on getting through one grip at a time. When you're just getting started, you'll probably be working on just getting your grip on the rock. Focus on just working on a single grip at a time, and get good at that before moving on. For example, you might start by working on a double overhand grip, and only work on that grip for the first few sessions. Once you feel more comfortable with the grip, you can start looking at trying a different one. Take your time. You really can't learn proper technique in a single session. You need to practice consistently in order to develop your skills. Set aside 30 minutes or so to practice every day, and make sure you take breaks occasionally. You also need to practice multiple times https://t3.ftcdn.net/jpg/05/17/79/88/360_F_517798849_Wu Xh HTpg2dj Tbf Nf0FQAjz FEolu Hpnct.jpg How can I improve my rock climbing technique? There are a few fundamental concepts that you need to get better at first, before you start getting better at climbing. So, in no particular order: Form an essential training routine. If you want to improve your climbing technique, you have to practice it. A lot. And you have to do it consistently. There are a lot of different exercises you can do, but they need to be done daily, and they need to be done for a long period of time (at least 3 months, preferably 6 months) before you start seeing improvements. If you're just starting out, 3x weekly is good. If you're already more experienced, 2x weekly is good. Practice good technique. This includes things like using your arms in synchrony with your legs, using your whole body weight when you can, not pushing too hard, and avoiding bad habits (like bouncing). A good way to ensure you're using good technique is to do your exercises in front of a mirror. That way you can see whether or not you're actually using your body in the most efficient way. Learn to fail safely. Rock climbing is a dangerous activity, so you need to learn how to fail safely. Figure 13: Generated samples - retrieval augmentation