# icar_imagebased_complementary_auto_reasoning__1af37a96.pdf ICAR: Image-Based Complementary Auto Reasoning Xijun Wang1,2, Anqi Liang2, Junbang Liang2, Ming Lin1,2, Yu Lou2, Shan Yang2 1 University of Maryland, College Park, USA 2 Amazon, USA {xijun, lin}@umd.edu, {lianganq, junbangl, ylou, ssyang}@amazon.com Scene-aware Complementary Item Retrieval (CIR) is a challenging task which requires to generate a set of compatible items across domains. Due to the subjectivity, it is difficult to set up a rigorous standard for both data collection and learning objectives. To address this challenging task, we propose a visual compatibility concept, composed of similarity (resembling in color, geometry, texture, and etc.) and complementarity (different items like table vs chair completing a group). Based on this notion, we propose a compatibility learning framework, a category-aware Flexible Bidirectional Transformer (FBT), for visual scene-based set compatibility reasoning with the cross-domain visual similarity input and auto-regressive complementary item generation. We introduce a Flexible Bidirectional Transformer (FBT), consisting of an encoder with flexible masking, a category prediction arm, and an auto-regressive visual embedding prediction arm. And the inputs for FBT are cross-domain visual similarity invariant embeddings, making this framework quite generalizable. Furthermore, our proposed FBT model learns the inter-object compatibility from a large set of scene images in a self-supervised way. Compared with the SOTA methods, this approach achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% SFID improvement on fashion and furniture, respectively. Introduction Online shopping catalogs provide great convenience, such as searching and comparing similar items. However, when customers can compare similar items, they often miss the browsing of complementary items in the e-shopping experience. Millions of online images offer a new opportunity to shop with inspirational home decoration ideas or outfit matching. But, to retrieve stylistically compatible products from these online images for set matching can be an overwhelming process.The ability to recommend visually complementary items becomes especially important, when shopping for home furniture and clothings. In this work, we aim to address the visual scene-aware Complementary Item Retrieval (CIR) (Sarkar et al. 2022; Kang et al. 2019) task. In this task (as shown in Figure 2), we attempt to model human s ability to select a set of objects from cross-domain pools, given a scene image, objects Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. in the scene, and object categories. Therefore, we propose a visual compatibility concept, consisting of two key elements: similarity and complementarity. Visual similarity and complementarity, however, can contradict each other sometimes. Items that look similar (color, geometry, texture, and etc.) may not be complementary (different items like dinner table vs sofa) when putting them into a set. Items that complement each other do not necessarily look similar (e.g. an outfit set in contrasting colors). The ambiguous definition for visual complementarity is a major challenge. This ambiguity makes it difficult to rigorously define an objective and creates extra challenge for collecting such datasets, when designing a data-driven method. To address these issues, we first propose a compatibility leaning framework to model the visual similarity and complementarity. To the best of our knowledge, we are among the first to show qualitatively that our model based on this framework can generalize to unseen domains (where the model is not trained with, as shown in Figure 1 and Figure 6). For the scene-based CIR task, it s complex to learn both the cross-domain similarity and complementarity. Therefore, we use cross-domain visual similarity invariant embeddings in our framework. Many previous CIR works (Han et al. 2017; Kang et al. 2019) also start from some types of learned embedding. But failing to model the visual similarity creates extra complexity for the complementary learning. Secondly, we propose to use selfsupervised learning for visual complementarity reasoning by introducing an auto-regressive transformer based architecture. Given the difficulty to define style complementarity mathematically, we propose a solution based on the assumption that the items exist in the inspirational scene images are compatible with each other. Built upon the aforementioned premises, we present a novel self-supervised transformer-based learning framework (overview shown in Fig. 3). Our model effectively learns both the similarity and complementarity between a set of items. Our model does not require extra complementary labels. In addition, compared to the prior work that models complementary items as pairs or a sequence of items, we model them as unordered sets. We carefully design our compatibility leaning model. First, we ensure that the learned embedding both contains and extracts all the necessary information for the compatibility learning. Second, we make The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Figure 1: Cover Image: We present a new self-supervised model for scene-aware, visually compatible object retrieval tasks. In this example, given an inspirational home scene image (sampled from STL-home(Kang et al. 2019) column 1, 3, 5) with a pool of objects (3D-FRONT(Fu et al. 2021)) from an unseen domain, our model auto-regressively retrieves a set of stylistically compatible items (column 2, 4, 6). Figure 2: Scene-aware Complementary Item Retrieval Task Illustration. Given a query scene image, (optional) scene objects and item categories, the task goal is to generate a crossdomain set of stylistically compatible items. full use of Transformer s ability in reasoning about the interactions between these learned embeddings. To model flexible-length unordered set generation with cross-domain retrieval, we propose a new Flexible Bidirectional Transformer (FBT). In this FBT model, we model the unordered set generation using random shuffle and masking technique. In addition, we introduce a category prediction arm and a cross-domain retrieval arm to the transformer encoder. The added category prediction branch helps the model to reason about the complementary item types. More importantly, we notice that most of the CIR prior work are evaluated via the Fill-In-The-Blank (FITB) (Han et al. 2017) metric or human in the loop. The FITB metric can reflect the model s ability in cross-domain retrieval but it does not measure the complementarity as a set. Human in the loop evaluation, however, is both limited to scale and biases, if not conducted thoughtfully. To address these issues, we propose a new CIR evaluation metric: Style Frechet In- ception Distance (SFID) (see supplementary for details). In summary, the key contributions of this work include: Visual compatibility is defined based on similarity and complementarity for the Scene-aware Complementary Item Retrieval (CIR) task and a new compatibility learning framework is designed to solve this task. For the compatibility learning framework, a categoryaware Flexible Bidirectional Transformer (FBT) is introduced for visual scene-based set compatibility reasoning with the cross-domain visual similarity input and autoregressive complementary item generation. Related Work Visual Similarity Learning Visual similarity learning has been a main computer vision topic. The goal is to mimic human s ability in finding visually similar objects or scenes. This is particularly studied in image retrieval (El-Nouby et al. 2021; Radenovi c, Tolias, and Chum 2018; Teh, De Vries, and Taylor 2020; Cheng et al. 2021) - finding images with a certain definition of similarity and so on. Prior to retrieving similar clothing, researchers also studied how to detect and segment clothing from real-life images (Yamaguchi et al. 2012; Yang and Yu 2011; Gallagher and Chen 2008). With clothing detection or segmentation, similar clothing retrieval is explored via style analysis (Hsiao and Grauman 2017; Kiapour et al. 2014; Simo-Serra and Ishikawa 2016; Yu et al. 2012; Yamaguchi, Hadi Kiapour, and Berg 2013; Kiapour et al. 2014; Di et al. 2013). Recent fashion retrieval tasks can be further categorized based on the input information, such as images (Liu et al. 2016; Simo-Serra and Ishikawa 2016; Zhai et al. 2017; Tran et al. 2019), clothing attributes (Ak et al. 2018), videos (Cheng et al. 2017). Visual complementarity Learning Visual complementarity learning, unlike visual similarity learning, is much more ambiguous. There are a couple of research directions: pairwise complementary item set complementary prediction (no The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) cross domain retrieval) (Tangseng, Yamaguchi, and Okatani 2017; Li et al. 2017; Han et al. 2017; Hsiao and Grauman 2018; Tangseng, Yamaguchi, and Okatani 2017; Shih et al. 2018; Li et al. 2020), set complementary item retrieval (Hu, Yi, and Davis 2015; Huang et al. 2015; Liu et al. 2012), personalized set complementary item prediction (requires user input) (Taraviya et al. 2021; Chen et al. 2019; Li et al. 2020; Su et al. 2021; Zheng et al. 2021; Guan et al. 2022b,a) and multi-modal complementary item prediction (Guan et al. 2021). All these prior work focus on feature representation learning. Another line of works (Chen et al. 2015; Song et al. 2017b; Tan et al. 2019; Lin, Tran, and Davis 2020) focus on learning multiple sub-embedding based on different properties for both similarity and compatibility. Later on, Sarkar (Sarkar et al. 2022) uses Transformer and CNN-based image classification features for compatibility learning. Unlike all the work above, we build our visual compatibility model which focuses on both similarity and complementarity. Learning Framework Many researchers have studied and explored building a cascaded learning framework. The cascaded method here means learning how to encode the data then modeling the statistics of this encoding. Many of the methods proposed for CIR task can also be categorized as two-stage models. But almost all of them use the image classification training target as the first-stage feature extractor (Han et al. 2017; Sarkar et al. 2022; Kang et al. 2019; Chen et al. 2019). Taraviya (Taraviya et al. 2021) propose a two-stage model for personalized pairwise complementary item recommendation where they learn a feature embedding specially designed for customer preferences in their first stage. In our compatibility learning, we set cross-domain visual similarity embedding as input, and design FTB for complementary set generation. We show empirically that the visual similarity feature, compared to image classification learned features, are better suited for CIR. Our design makes our model surpass the prior work in scene aware CIR task. Problem Statement Given a scene image I, a set of unordered objects O = {oi}N i=0, oi DA in the scene and a set of unordered object categories C = {ci}L i=0, the problem is to retrieve cross-domain a set of complementary objects X = {xi}L i=0, xi DB. This generated set of objects needs to be both visually compatible with each other and of visually similar style to the input scene image I. Here we use DA and DB to denominate the two different visual domains, L to represent the number of objects to retrieve during inference and N is the number of scene objects. The difference between the two domains DA and DB can be quantified as the Fr echet distance F larger than a certain threshold θ. Conditional Compatibility Auto Reasoning We formulate the problem of generating a set of objects X = {xi}L i=0 conditioned on the scene image I and a specified set of categories as how to compute likelihood (Eq. 1) of creating the object set X given the scene image I, objects in the scene O, and set of categories C. We model the probability of generating the unordered set X as the sum of generating the set in any permutation ˆ X: p(Xi|I, O, C) = X ˆ X Φ(Xi) p(xi|x0, . . . , xi 1, I, O, C), i L (1) where Φ(X) includes all the permutation of the target object set X given all the permutation of the categories C, and L is the maximum number of items to compose a set. For each permutation of X, the set generation becomes a sequence generation problem. We model the sequence generation as an auto-regressive process. In the auto-regressive process, the next item in the set is generated conditioned on the prior items. This auto-regressive process statistically formulated as the multiplication of the probabilities: p(xi|x0, . . . , xi 1) = j p(xj|x0, . . . , xj 1). (2) To learn to conditionally generate the best set of objects, our model learns to maximize the log likelihood of the probability, p(X|I, O, C), log p(Xi|I, O, C) = X j log p(xi|x