# a_causal_view_of_compositional_zeroshot_recognition__3cbd3c97.pdf A causal view of compositional zero-shot recognition Yuval Atzmon1 Felix Kreuk 1,2 Uri Shalit 3 Gal Chechik1,2 1NVIDIA Research, Tel Aviv, Israel 2Bar-Ilan University, Ramat Gan, Israel 3Technion - Israel Institute of Technology yatzmon@nvidia.com, gchechik@nvidia.com, People easily recognize new visual categories that are new combinations of known components. This compositional generalization capacity is critical for learning in real-world domains like vision and language because the long tail of new combinations dominates the distribution. Unfortunately, learning systems struggle with compositional generalization because they often build on features that are correlated with class labels even if they are not essential for the class. This leads to consistent misclassification of samples from a new distribution, like new combinations of known components. Here we describe an approach for compositional generalization that builds on causal ideas. First, we describe compositional zero-shot learning from a causal perspective, and propose to view zero-shot inference as finding which intervention caused the image? . Second, we present a causal-inspired embedding model that learns disentangled representations of elementary components of visual objects from correlated (confounded) training data. We evaluate this approach on two datasets for predicting new combinations of attribute-object pairs: A well-controlled synthesized images dataset and a real-world dataset which consists of fine-grained types of shoes. We show improvements compared to strong baselines. Code and data are provided in https://github.com/nv-research-israel/causal_comp 1 Introduction Compositional zero-shot recognition is the problem of learning to recognize new combinations of known components. People seamlessly recognize and generate new compositions from known elements and Compositional Reasoning is considered a hallmark of human intelligence [33, 34, 6, 4]. As a simple example, people can recognize a purple cauliflower even if they have never seen one, based on their familiarity with cauliflowers and with other purple objects (Figure 1b). Unfortunately, although feature compositionality is a key design consideration of deep networks, current deep models struggle when required to generalize to new label compositions. This limitation has grave implications for machine learning because the heavy tail of unfamiliar compositions dominates the distribution of labels in perception, language, and decision-making problems. Models trained from data tend to fail with compositional generalization for two fundamental reasons: distribution-shift and entanglement. First, recognizing new combinations is an extreme case of distribution-shift inference, where label combinations at test time were never observed during training (zero-shot learning). As a result, models learn correlations during training that hurt inference at test time. For instance, if all cauliflowers in the training set are white, the correlation between the color and the class label is predictive and useful. A correlation-based model like (most) deep networks will learn to associate cauliflowers with the color white during training, and may fail when presented with a purple cauliflower at test time. For the current scope, we put aside the fundamental 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Confounding Causal influence cauliflower Sample from latent purple manifold Sample from latent caulif. manifold Image features Attribute vocabulary Object vocabulary Latent representations h O(o) g(ɸa ,ɸo) . Figure 1: (a) The causal graph that generates an image. The solid arrows represent the real-world processes by which the two categorical variables Object and Attribute each generate core features [21, 17] φo and φa. The core features then jointly generate an image feature vector x. The core features are assumed to be stable for unseen combinations of objects and attributes. The dotted double-edged arrows between the Object and Attribute nodes indicates that there is a process confounding the two: they are not independent of each other. (b) An intervention that generates a test image of a purple-cauliflower, by enforcing a = purple and o = cauliflower. It cuts the confounding link between the two nodes [49] and changes the joint distribution of the nodes to the interventional distribution . (c) Illustration of the learned mappings, detailed in Section 4 semantic question about what defines the class of an object (cauliflower), and assume that it is given or determined by human observers. The second challenge is that the training samples themselves are often labeled in a compositional way, and disentangling their elementary components from examples is often an ill-defined problem [38]. For example, for an image labeled as white cauliflower, it is hard to tell which visual features capture being a cauliflower, and which, being white. In models that learn from data the representation of these terms may be inherently entangled, and it would be hard to separate which visual features represent white and which represent a cauliflower. These two challenges are encountered when learning deep discriminative models from data. For example, consider a simple model that learns the concept cauliflower , by training a deep model over all cauliflower images (Vis Prod [44]), and the same for the concept "white". At inference time, simply select the most likely attribute ˆa = arg maxa p(a|x) and, independently, the most likely object ˆo = arg maxo p(o|x). Unfortunately, this model, while quite powerful, tends to be sensitive to training-specific correlations in its input. Here we propose to address compositional recognition by modelling images as being generated, or caused, by real-world entities (labels) (Figure 1). This model recognizes that the distribution p(Image=x|Attr=a, Obj=o) is more likely to be stable across the train and test environments ptest(x|a, o) = ptrn(x|a, o) [54, 49, 57]: it means that unlike objects or attributes by themselves, combinations of objects and attributes generate the same distribution over images in train and test. We propose to consider images of unseen combinations as generated by interventions on the attribute and object labels. In causal inference, intervention means that the value of a random variable is forced to some value, without affecting its causes (but affecting other variables that depend on it, Figure 1b). We cast zero-shot inference as the problem of finding which intervention caused a given image. In the general case, the conditional distribution p(x|a, o) can have arbitrary an complex structure and may be hard to learn. We explain how treating labels as causes, rather than as effects of the image, reveals an independence structure that makes it easier to learn p(x|a, o). We propose conditional independence constraints applied to the structure of this distribution and show how the model can be learned effectively from data. The paper makes the following novel contributions: First, we provide a new formulation of compositional zero-shot recognition using a causal perspective. Specifically, we formalize inference as a problem of finding the most likely intervention. Second, we describe a new embedding-based architecture that infers causally stable representations for compositional recognition. Finally, we demonstrate empirically that in two challenging datasets, our architecture better recognizes new unseen attribute-object compositions compared to previous methods. 2 Related work Attribute - object compositionality: [42] studied decomposing attribute-object combinations. They embedded attributes and object classes using deep networks pre-trained on other large datasets. [45] proposed to view attributes as linear operators in the embedding space of object word-embeddings. Operators are trained to keep transformed objects similar to the corresponding image representation. [46] proposed a method similar to [42] with an additional decoding loss. [60] used GANs to generate a feature vector from the label embedding. [50] trained a set of network modules, jointly with a gating network that rewires the modules according to embeddings of attribute and object labels. [37] is a very recent framework inspired by group theory that incorporates symmetries in label-space. Compositional generalizations: Several papers devised datasets to directly evaluate compositional generalization for vision problems by creating a test set with new combinations of train-set components. [25] introduced a synthetic dataset inherently built with compositionality splits. [1, 9] introduced new compositional splits of VQA datasets [2] and show that the performance of existing models degrades under their new setting. [26] used a knowledge graph for composing classifiers for verb-noun pairs. [6] proposed an experimental framework for measuring compositional generalization and showed that structured prediction models outperform image-captioning models. Zero-shot Learning (ZSL): Compositional generalization can be viewed as a special case of zeroshot learning [62, 35, 8], where a classifier is trained to recognize (new) unseen classes based on their semantic description, which can include a natural-language textual description [48] or predefined attributes [35, 7]. To discourage attributes that belong to different groups from sharing low-level features, [24] proposed a group-sparse regularization term. Causal inference for domain adaptation: Several recent papers take a causal approach to describe and address domain adaptation problems. This includes early work by [55] and [64]. Adapting these ideas to computer vision, [17] were one of the first papers to propose a causal DAG describing the generative process of an image as being generated by a domain , which generates a label and an image. They use this graph for learning invariant components that transfer across domains. [36, 3] extended [17] with adversarial training [15]. It learned a single discriminative classifier, p(o|x), that is robust against domain shifts and accounts for the dependency of classes on domains. When viewing attributes as domains , one of the independence terms in their model corresponds to one term (c) in Eq. (6). [28, 55] discusses image recognition as an anti-causal problem, inferring causes from effects. [10, 27] studied learning causal structures under sparse distributional shifts. [40] proposed to learn causal relationships in images by detecting the causal direction of two random variables. [14] used targeted interventions to address distribution shifts in imitation learning. [32] learned a conditional-GAN model jointly with a causal-model of label distribution. [5] proposed a regularization term to improve robustness against distributional changes. Their view is complementary to this paper in that they model the labeling processes, where images cause labels, while our work focuses on the data generating process (labels cause images). [21] proposed a similar causal DAG for images, while adding auxiliary information such as that some images are of the same instance with a different style . This allowed the model to identify core features. The approach described in this paper does not use such auxiliary information. It also views the object and attribute as playing mutual roles, making their inferred representations invariant to each-other. Unsupervised disentanglement of representations: Several works use a VAE [31] approach for unsupervised disentanglement of representations [39, 22, 11, 13, 52, 41]. This paper focuses on a different problem setup: (1) Our goal is to infer a joint attribute-object pair, disentangling the representation is a useful byproduct. (2) In our setup, attribute-object combinations are dependent in the training data, and new combinations may be observed at test time. (3) We do not use unsupervised learning. (4) We take a simpler embedding based approach. 3 Method overview We start with a descriptive overview of our approach. For simplicity, we skip here the causal motivation and describe the model in an informal way from an embedding viewpoint. Our model is designed to estimate p(x|a, o), the likelihood of an image feature vector x, conditioned on a tuple (a, o) of attribute-object labels. For inference, we iterate over all combinations of labels and select (ˆa, ˆo) = argmaxa,o p(x|a, o). To estimate the distribution p(x|a, o), our model learns two embedding spaces: ΦA for attributes, and ΦO for objects (see Figure 1a). These spaces can be thought of as semantic embedding spaces, where an attribute a (say, white") has some dense prototypical representation φa ΦA, and an object o (say, a cauliflower) has a dense representation φo Φo. Given a new image x, we learn a mapping to three spaces. First, an inferred attribute embedding ˆφa ΦA represents the attribute seen in the image (say, how white is the object). Second, an inferred object embedding ˆφo ΦO represents the object seen in the image (say, how cauliflowered it is). Finally, we also represent the image in a general space of image features. Learning the parameters of the model involves learning the three mappings above. In addition, we learn the representation of the attribute prototype ( White ) φa ΦA and the representation of the object prototype ( Cauliflower") φo ΦO. Very naturally, we want that a perceived attribute ˆφa would be embedded close to its attribute prototype φa. Our loss captures this intuition. Finally, we also aim to have the representation spaces of attributes and objects statistically independent. The intuition is that we want to keep the representation of an object (cauliflower) independent of the attribute (white), so we can recognize that object when seen with new attributes (purple cauliflower). At this point, the challenge remains to build a principled model that can be learned efficiently from data. We now turn to the formal and detailed description of the approach. 4 A causal formulation of compositional zero-shot recognition We put forward a causal perspective that treats labels as causes of an image, rather than its effects. This direction of dependencies is consistent with the mechanism underlying natural image generation and, as we show below, allows us to recognize unseen label combinations. Figure 1 presents our causal generative model. We consider two elementary factors which are categorical variables called Attribute a A and Object" o O, and are dependent (confounded) in the training data. As one example, European swans (Cygnus) are white but Australian ones are black. The data collection process may make white and swan confounded if collected in Europe, even-though this dependency does not apply in Australia.. The model also has two semantic representation spaces: one for attributes ΦA = Rd A and another for objects ΦO = Rd O. An attribute a induces a distribution p(φa|a) over the representation space, which we model as a Gaussian distribution. We denote by ha a function that maps a categorical attributes to the center of this distribution in the semantic space ha : A ΦA (Figure 1c). The conditional distribution is therefore φa N(ha, σ2 a I). We have a similar setup for p(φo|o) N(ho, σ2 o I). Given the semantic embedding of the attribute and object, the probability of an image feature vector x X is determined by the representations p(x|φa, φo), which we model as Gaussian, w.r.t a mapping g, x N(g(φa, φo), σ2 x I). φa and φo can be viewed as an encoding of core features , namely encoding a representation of attribute and object that is stable" in the training set and test set, as proposed by [21, 17]. Namely, the conditional distributions p(φa|a) and p(φo|o) do not substantially change for unseen combinations of attributes and objects. We emphasize that our causal graph is premised on the belief that what we use as objects and attributes are truly distinct aspects of the world, giving rise to different core features. For attributes that have no physical meaning in the world, it may not be possible to postulate two distinct processes giving rise to separate core features. 4.1 Interventions on elementary factors Causal inference provides a formal mechanism to address the confounding effect through a dointervention 1. A Do-intervention overrides the joint distribution ptrn(a, o), enforcing a, o to specific values and propagates them through the causal graph. With this propagation, an intervention changes the joint distribution of nodes in the graph. Therefore, a test image is generated according to a new joint-distribution, denoted in causal language as the interventional distribution pdo(A=a,O=o)(x). Thus, for zero-shot learning, we postulate that inference about a test image is equivalent to asking: Which intervention on attributes and objects caused the image? 1Our formalism and model can be extended to include other types of intervention on the joint distribution of attributes and objects. For simplicity, we focus here on the most-common do-intervention . 5 Inference We propose to infer the attribute and object by choosing the most likely interventional distribution: (ˆa, ˆo) = argmax a,o A O pdo(A=a,O=o)(x). (1) This inference procedure is more stable than the discriminative zero-shot approach, since the generative conditional distribution is equivalent to the interventional distribution [55]. pdo(A=a,O=o)(x) = p(x|a, o). (2) This holds both for training and test, so we simply write p(x|a, o). This likelihood depends on the core features φA, φO which are latent variables; Computing the likelihood exactly requires to marginalize (integrate) over the latent variables. Since this integral is very hard to compute, we take a hard" approach instead, evaluating the integrand at its most likely value. Since φA, φO are not known, we estimate them from the image x, by learning a mapping function ˆφa = g 1 A (x) (see Figure 1c). The supplemental describes these approximation steps in details. It shows that the negative log-likelihood log p(x|a, o) can be approximated by ˆL(a, o) = 1 σ2a ||ˆφa ha||2 + 1 σ2o ||ˆφo ho||2 + 1 σ2x ||x g(ha, ho)||2 . (3) Here, ha, ho and g(ha, ho) are the parameters of the Gaussian distributions of φa, φo and x. The factors a and o are inferred by taking the argmina,o ˆL(a, o) of Eq. (3). Note that in the causal graph, φa, φo are parent nodes of the image x, but ˆφa, ˆφo are estimated from x and are therefore child nodes of x, and therefore do not immediately follow the conditional independence relations that φa, φo obey. We elaborate on this point in section 6. Our model consists of five learned mappings: h A, h O, g, g 1 A and g 1 O , illustrated in Figure 1c. All mappings are modelled using MLPs.We aim to learn the parameters of these mappings such that the (approximated) negative log-likelihood of Eq. (3) is minimized. In addition, we also include in the objective several regularization terms designed to encourage properties that we want to induce on these mappings. Specifically, the model is trained with a linear combination of three losses. L = Ldata + λindep Lindep + λinvert Linvert, (4) where λindep 0 and λinvert 0 are hyperparameters. We now discuss these losses in detail. (1) Data Likelihood loss. The first component of the loss, Ldata, corresponds to the (approximate) negative log likelihood of the model, as described by Eq. (3) Ldata = ||ha g 1 A (x)||2 + ||ho g 1 O (x)||2 + λao Ltriplet x, (a, o), (a, o)neg . (5) For easier comparisons with [44], we replaced the rightmost term in Eq. (3) with the standard triplet loss Ltriplet with Euclidean distance ||x g(ha, ho)||2. λao 0 is a hyperparameter. (2) Independence loss. The second component of the loss Lindep is designed to capture conditionalindependence relations and apply them to the reconstructed core factors ˆφa, ˆφo. By that, the following property is encouraged: pdo(O=o)(ˆφo) pdo(A=a,O=o)(ˆφo) and pdo(A=a)(ˆφa)) pdo(A=a,O=o)(ˆφa). Namely, learning a representation of objects that is robust to attribute interventions, and vice versa. In more detail, the causal graph (Figure 1a) dictates conditional-independence relations for the latent core factors φa, φo: (a) φa O|A = a (b) φa φo|A = a, (6) (c) φo A|O = o (d) φa φo|O = o. These relations reflect the independent mechanisms that generate the training data. Since the core factors φa, φo, are latent and not observed, we wish that their reconstructions ˆφa and ˆφo maintain approximately the same independence relations. For example, we encourage (ˆφo A|O = o) to capture the independence in Eq. (6)c. To learn mappings that adhere to these statistical independences over ˆφa and ˆφo, we regularize the learned mappings using a differentiable measure of statistical dependence. Specifically, we use the Hilbert-Schmidt Information Criterion (HSIC) [19, 20]. HSIC is a non-parametric method for estimating the statistical dependence between samples of two random variables, based on an implicit embedding into a universal reproducing kernel Hilbert space. In the infinite-sample limit, the HSIC between two random variables is 0 if and only if they are independent [20]. HSIC also has a simple finite-sample estimator which is easily calculated and is differentiable w.r.t. the input variables. In supplemental Section B, we describe the details of Lindep and how it is optimized with HSIC, and why minimizing Lindep indeed encourages the property pdo(O=o)(ˆφo) pdo(A=a,O=o)(ˆφo). This minimization can be viewed as minimizing the Post Interventional Disagreement (PIDA) metric of [58], a recently proposed measure of disentanglement of representations. We explain this perspective in more detail in the supplemental (B.2). There exist alternative measures for encouraging statistical independence, such as adversarial training [15, 36, 3] or techniques based on mutual information [16, 53, 29]. HSIC has the benefit that it is non-parametric and therefore does not require training an additional network. It was easy to optimize, and was provide useful in previous literature [59, 56, 43, 18]. (3) Invertible embedding loss. The third component of the loss, Linvert, encourages the labelembedding mappings ha, ho, and g(ha, ho) to preserve information about their source labels when minimizing Ldata. Without this term, minimizing ||ˆφa ha||2 may reach trivial solutions because the model has no access to ground-truth values for φa or ha (same for φo, ho). Similar to [44], we use a cross-entropy (CE) loss with a linear layer that classifies the labels that generate each embedding, and a hyperparameter λg: Linvert = CE(a, fa(ha))+CE(o, fo(ho))+λg CE(a, fga(g(ha, ho)))+CE(a, fgo(g(ha, ho))) . 7 Experiments Despite several studies of compositionality, current datasets used for evaluations are quite limited. Two main benchmarks were used in evaluations of previous literature: MIT states [23] and UTZappos50K [63]. The MIT-states dataset was labeled automatically using early technology of image search engine based on text surrounding images. As a result, labels are often incorrect. We quantified label quality using human raters , and found that they are too noisy to be useful for proper evaluations. In more detail, we presented images to human raters, along with candidate attribute labels from the dataset. Raters were asked to select the best and second-best attributes that describe the noun (multiple-choice setup). Only 32% of raters selected the correct attribute for their first choice (top-1 accuracy), and only 47% of raters had the correct attribute in one of their choices (top-2 accuracy). The top-2 accuracy was only slightly higher than adding a random label on top of the top-1 label (yielding 42%). To verify that raters were attentive, we also injected 30 sanity questions that had two easy" attributes, yielding top-2 accuracy of 100%. See supplemental (H) for further details. We conclude that this level of 70% label noise is too noisy for evaluating noun-attribute compositionality. Zappos: We evaluate our approach on the Zappos dataset, which consists of fine-grained types of shoes, like leather sandal or rubber sneaker . It has 33K images, 16 attribute classes, and 12 object classes. We use the split of [50] and the provided Res Net18 pretrained features. The split contains both seen pairs and unseen pairs for validation and test. It uses 23K images for training of 83 seen pairs, a validation set with 3K images from 15 seen and 15 unseen pairs, and a test set with 3K images from 18 seen and 18 unseen pairs. All the metrics we report for our approach and compared baselines are averaged over 5 random initializations of the model. AO-CLEVr: To evaluate compositional methods on a well-controlled clean dataset, we generated a synthetic-images dataset containing images of easy Attribute-Object categories. We used the CLEVr framework [25], hence we name the dataset AO-CLEVr. AO-CLEVr has attribute-object pairs created from 8 attributes: { red, purple, yellow, blue, green, cyan, gray, brown } and 3 objects {sphere, cube, cylinder}, yielding 24 attribute-object pairs. Each pair consists of 7500 images. Each image has a single object that consists of the attribute-object pair. The object is randomly assigned Figure 2: Example images of AO-CLEVr dataset and their labels. one of two sizes (small/large), one of two materials (rubber/metallic), a random position, and random lightning according to CLEVr defaults. See Figure 2 for examples. For cross-validation, we used two types of splits. The first uses the same unseen pairs for validation and test. This split allows us to quantify the potential generalization capability of each method. In the second split, which is harder, unseen validation pairs are not overlapping with the unseen test pairs. Importantly, we vary the ratio of unseen:seen pairs on a range of (2:8, 3:7, ...,7:3), and for each ratio we draw 3 random seen-unseen splits. We report the average and the standard error of the mean (S.E.M.) over the three random splits and three random model initialization for each split. We provide more details about the splits in the suppl. 7.2 Compared methods (1) Causal. Our approach as described in Section 5. For Zappos it also learns a single layer network to project the pretrained image features to the feature space X. We also evaluate a variant named Causal λindep=0, which nulls the loss terms that encourage the conditional independence relations (2) Vis Prod: A common discriminatively-trained baseline [45, 42]. It uses two classifiers over image features to predict the attribute and object independently, and approximates p(a, o|x) p(a|x)p(o|x). (3) Vis Prod&CI: A discriminatively-trained variant of our model. We use Vis Prod as a vanilla model, regularized by the conditional independence loss Lindep, where we use the top network layer activations of attributes and objects as proxies for ˆφa, ˆφo. (4) LE: Label embedding [45] trains a neural network to embed images and attribute-object labels to a joint feature space. LE is a vanilla baseline because it approximately models p(x|a, o), but without modelling the core-features. (5) ATTOP: Attributes-as-operators [45] views attributes as operators over the embedding space of object label-embeddings. We use the code of [45] to train ATTOP and LE. (6) TMN: Task-modular-networks [50] trains a set of network modules jointly with a gating network. The gate rewires the modules according to embeddings of attributes and objects. We used the implementation provided by the authors and followed their grid for hyperparameter search (details in suppl.). Our results differ on Zappos because we report an average over 5 random initializations rather than a single initialization as reported in [50]. Some random initializations reproduce well their reported AUC metric. We explicitly avoid using prior knowledge in the form of pretrained label embeddings, because we are interested to quantify the effectiveness of our approach to naturally avoid unreliable correlations. Yet, most of the methods we compare with, rely on pretrained embeddings. Thus, we provide additional results using random initialization for the compared methods, denoted by an asterisk (e.g. LE*). Implementation details of our approach and reproduced baselines are given in the supplemental. 7.3 Evaluation In zero-shot (ZS) attribute-object recognition, a training set D has N labeled samples of images: D = {(xi, ai, oi), i = 1 . . . N} , where each xi is a feature vector, ai A is an attribute label, oi O is an object label and each pair (ai, oi) is from a set of (source) seen pairs S A O. At test time, a new set of samples D = {xi, i = N + 1 . . . N + M} is given from a set of target pairs. The target set is a union of the set of seen pairs S with new unseen pairs U A O, U S = . Our goal is to predict the correct pair of each sample. Evaluation metrics: We evaluate methods by the accuracy of their top-1 prediction for recognizing seen and unseen attribute-object pairs. In AO-CLEVr, we compute the balanced accuracy across pairs, namely, the average of per-class accuracy. This is the common metric to evaluate zero-shot benchmarks [61, 62]. Yet, in Zappos, we used the standard (imbalanced) accuracy, to be consistent with the protocol of [50], We compute metrics in two main evaluations setups, which differ in their output label space, namely, which classes can be predicted. (1) Closed: Predictions can be from unseen class-pairs only. (2) Open: Predictions can be from all pairs in the dataset, seen or unseen. This setup is also known as the generalized zero-shot learning setup [62, 12]. Specifically, we compute: Seen: Accuracy is computed on samples from seen class-pairs. Unseen: Accuracy is computed on samples from unseen classpairs. Harmonic mean: A metric that quantifies the overall performance of both Open-Seen and Open-Unseen accuracy. It is defined as: H = 2(Accseen Accunseen)/(Accseen + Accunseen). For the harmonic metric, we follow the protocol of [61, 62], which does not take an additional postprocessing step. We note that some papers [50, 37, 12] used a different protocol averaging between seen and unseen. Finally, we report the Area Under Seen-Unseen Curve (AUSUC), which uses a post-processing step [12] to balance the seen-unseen accuracy. Inspired by the area-under-the-curve procedure, it adjusts the level of confidence of seen pairs by adding (or subtracting) a constant (see [12] for further details). To compute AUSUC, we sweep over a range of constants and compute the area under the seen-unseen curve. For early stopping, we use (i) The Harmonic for the open setup. (ii) The closed accuracy for the closed setup. In Zappos, we followed [50] and use the AUSUC for early stopping at the closed setup. All experiments were performed on a cluster of DGX-V100 machines. Training a single model for 1000 epochs on the 5 : 5 AO-CLEVr split (with 80K samples) takes 2-3 hours. We describe here the results obtained on AO-CLEVr (overlapping-split) and Zappos. Additional results are reported in the supplemental, including the full set of metrics and numeric values; using random initialization; results with the non-overlapping split (showing a similar trend to the overlapping split); studying our approach in greater depth through ablation experiments; and an error analysis. AO-CLEVr: Figure 3 (right) shows the Harmonic metric for AO-CLEVr for the whole range of unseen:seen ratios. Unsurprisingly, the more seen pairs are available, the better all models perform for unseen combinations of attributes and objects. Our approach Causal , performs better than or equivalent to all the compared methods. Vis Prod easily confuses the Unseen classes. ATTOP, is better than LE on the open unseen pairs but performs substantially worse than all methods on the seen pairs. TMN performs equally well as our approach for splits with mostly seen pairs but degrades when the fraction of seen pairs is below 4:6. The seen-unseen plane: By definition, our task aims to perform well in two different metrics (multiobjective): accuracy on seen pairs and unseen pairs. It is therefore natural to compare approaches by their performance on the seen-unseen plane. This is important, because different approaches may select different operating points to trade-off accuracy on unseen for accuracy on seen. Figure 3 (left) shows how the compared approaches trade-off the metrics for the 5:5 split. ATTOP tends to favor unseen-pairs accuracy over the accuracy of seen pairs, vanilla models like Vis Prod, LE tend to favor seen classes. Importantly, it reveals that modelling the core-features largely improves the unseen accuracy, without hurting much the seen accuracy. Specifically, comparing Causal to vanilla baseline LE, improves the unseen acc. from 26% to 47% and reduces the seen acc. from 86% to 84%. Comparing Vis Pros&CI to Vis Prod improves the unseen acc. from 19% to 38% and reduces the seen Acc. from 85% to 82%. UNSEEN SEEN HARMONIC CLOSED AUSUC WITH PRIOR EMBEDDINGS LE 10.7 0.8 52.9 1.3 17.8 1.1 55.1 2.3 19.4 0.3 ATTOP 22.6 2.9 35.2 2.7 26.5 1.4 52.2 1.8 20.3 1.8 TMN 9.7 0.6 51.9 2.4 16.4 1.0 60.9 1.1 24.6 0.8 NO PRIOR EMBEDDINGS LE* 15.6 0.6 52.0 1.0 24.0 0.7 58.1 1.2 22.0 0.9 ATTOP* 16.5 1.5 15.8 1.9 15.8 1.4 42.3 1.5 16.7 1.1 TMN* 6.3 1.4 55.3 1.6 11.1 2.3 58.4 1.5 24.5 0.8 CAUSAL λindep=0 22.5 2.0 45.5 3.7 29.4 1.5 55.3 1.1 22.2 0.9 CAUSAL 26.6 1.6 39.7 2.2 31.8 1.7 55.4 0.8 23.3 0.3 Table 1: Results for Zappos. denotes the Standard Error of the Mean (S.E.M.) over 5 random model initializations. 60 70 80 90 15 Seen-Unseen Acc. @5:5 Acc. Unseen Causal (ours) LE ATTOP TMN Vis Prod&CI (ours) Vis Prod Causal λindep=0 2:8 3:7 4:6 5:5 6:4 7:3 Harmonic Acc. Unseen:Seen Ratio Figure 3: Left: The seen-unseen plane for the 5:5 split. Modelling the core features largely improves the unseen accuracy: Compare Causal to LE or to λindep=0 and compare Vis Pros&CI to Vis Prod. Error bars denote Standard Error of the Mean (S.E.M.) over 3 random splits and three random seeds. Right: Harmonic mean of seen-unseen accuracy for AO-CLEVR on a range of 20% to 70% unseen classes. To reduce visual clutter, error bars are shown only for our Causal method and for a vanilla baseline (LE). Zappos: Our approach improves the Unseen and Harmonic metrics. For the Closed and AUSUC metrics it loses compared to TMN.We note that results on the Closed metric are less interesting from a causal point of view: A model cannot easily rely on the knowledge of which attribute-object pairs tend to appear in the test set. For both AO-CLEVr and Zappos, the independence loss improves recognition on unseen pairs but hurts recognition of seen pairs. This is a known and important trade-off when learning models that are robust for interventions [51]. The independence loss discourages certain types of correlations, hence models do not benefit from them when the test and train distributions are identical (seen-pairs). However, the loss is constructed in such a way that these are exactly the correlations that fail to hold once the test distribution changes (unseen-pairs). Thus, ignoring these correlations improves performance for samples of unseen-pairs. 9 Discussion We present a new causal perspective on the problem of recognizing new attribute-object combinations in images. We propose to view inference in this setup as answering the question which intervention on attribute and object caused the image . This perspective gives rise to a causal-inspired embedding model. The model learns disentangled representations of attributes and objects although they are dependent in the training data. It provides better accuracy on two benchmark datasets. The trade-off between seen accuracy and unseen accuracy reflects the fact that prior information about co-occurrence of attributes and objects in the training set is useful and predictive. A related problem has been studied in the setting of (non-compositional) generalized zero-shot learning [8]. We suggest that some of these techniques could be adapted to the compositional setting. Several aspects of the model can be further extended by relaxing its assumptions. First, the assumption that image features are normally distributed may be limiting, and alternative ways to model this distribution may improve the model accuracy. Second, the model is premised on the prior knowledge that the attributes and objects have distinct and stable generation processes. However, this prior knowledge may not always be available, or some attribute may not have an obvious physical meaning. E.g. cute , comfortable or sporty , and in a multi-label setting [47] it is challenging to reveal what are the independent generation mechanisms themselves from confounded training data. This paper focused on the case where attributes and objects are fully disentangled. Clearly, in natural language, many attributes and object are used in a way are that makes them dependent. For example, white wine is actually yellow, and the attribute a small bear is larger than a large bird. It remains an interesting question to extend the fully disentangled case to learn specific dependencies while leveraging the power of disentangled representations. Broader Impact Compositional generalization, the key challenge of this work, is critical for learning in real-world domains where the long tail of new combinations dominates the distribution, like in vision-andlanguage tasks or for the perception modules of autonomous driving. A causality-based approach, like the one we propose, may allow vision systems to make more robust inference, and debias correlations that naturally exist in the training data, allowing to use vision systems in complex environments where the distribution of labels and their combinations is varying. It has been shown in the past that vision systems may emphasize biases in the data, and the ideas put forward in the paper may help make systems more robust to such biases. Such approach may be useful for improving fairness in various applications, for example by providing a more balanced visual recognition of individuals from minority groups. Acknowledgements We thank Daniel Greenfeld, Idan Schwartz, Eli Meirom, Haggai Maron, Lior Bracha and Ohad Amosy for their helpful feedback on the early version. Uri Shalit was partially supported by the Israel Science Foundation (grant No. 1950/19). Funding Disclosure Uri Shalit was partially supported by the Israel Science Foundation. Yuval Atzmon was supported by the Israel Science Foundation and Bar-Ilan University during his Ph.D. studies. [1] Aishwarya Agrawal, Aniruddha Kembhavi, Dhruv Batra, and Devi Parikh. C-vqa: A compositional split of the visual question answering (vqa) v1. 0 dataset. ar Xiv preprint ar Xiv:1704.08243, 2017. [2] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Vqa: Visual question answering. International Journal of Computer Vision, 123(1):4 31, 2017. [3] Kei Akuzawa, Yusuke Iwasawa, and Yutaka Matsuo. Adversarial invariant feature learning with accuracy constraint for domain generalization. Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2019. [4] J. Andreas, M. Rohrbach, T. Darrell, and Klein D. Learning to compose neural networks for question answering. In NAACL, 2016. [5] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. [6] Yuval Atzmon, Jonathan Berant, Vahid Kezami, Amir Globerson, and Gal Chechik. Learning to generalize to new compositions in image understanding. ar Xiv preprint ar Xiv:1608.07639, 2016. [7] Yuval Atzmon and Gal Chechik. Probabilistic and-or attribute grouping for zero-shot learning. In Proceedings of the Thirty-Forth Conference on Uncertainty in Artificial Intelligence, 2018. [8] Yuval Atzmon and Gal Chechik. Adaptive confidence smoothing for generalized zero-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. [9] Dzmitry Bahdanau, Harm de Vries, Timothy J. O Donnell, Shikhar Murty, Philippe Beaudoin, Yoshua Bengio, and Aaron Courville. Closure: Assessing systematic generalization of clevr models, 2019. [10] Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations, 2020. [11] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in beta-vae. ar Xiv preprint ar Xiv:1804.03599, 2018. [12] R. Chao, S. Changpinyo, B. Gong, and Sha F. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ICCV, 2016. [13] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pages 2610 2620, 2018. [14] Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. In Advances in Neural Information Processing Systems 32, pages 11698 11709. 2019. [15] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016. [16] Weihao Gao, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Conditional dependence via shannon capacity: Axioms, estimators and applications. ar Xiv preprint ar Xiv:1602.03476, 2016. [17] Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. Domain adaptation with conditional transferable components. In International conference on machine learning, pages 2839 2848, 2016. [18] Daniel Greenfeld and Uri Shalit. Robust learning with the hilbert-schmidt independence criterion. ar Xiv preprint ar Xiv:1910.00270, 2019. [19] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63 77. Springer, 2005. [20] Arthur Gretton, Kenji Fukumizu, Choon H Teo, Le Song, Bernhard Schölkopf, and Alex J Smola. A kernel statistical test of independence. In Advances in neural information processing systems, pages 585 592, 2008. [21] Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. ar Xiv preprint ar Xiv:1710.11469, 2017. [22] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. ICLR, 2017. [23] Phillip Isola, Joseph J Lim, and Edward H Adelson. Discovering states and transformations in image collections. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1383 1391, 2015. [24] D. Jayaraman, F. Sha, and K. Grauman. Decorrelating semantic visual attributes by resisting the urge to share. In CVPR, CVPR 14, 2014. [25] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017. [26] Keizo Kato, Yin Li, and Abhinav Gupta. Compositional learning for human object interaction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 234 251, 2018. [27] Nan Rosemary Ke, Olexa Bilaniuk, Anirudh Goyal, Stefan Bauer, Hugo Larochelle, Chris Pal, and Yoshua Bengio. Learning neural causal models from unknown interventions. ar Xiv preprint ar Xiv:1910.01075, 2019. [28] Niki Kilbertus, Giambattista Parascandolo, and Bernhard Schölkopf. Generalization in anti-causal learning. ar Xiv preprint ar Xiv:1812.00524, 2018. [29] Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, and Junmo Kim. Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9012 9020, 2019. [30] D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. [31] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014. [32] Murat Kocaoglu, Christopher Snyder, Alexandros G. Dimakis, and Sriram Vishwanath. Causal GAN: Learning causal implicit generative models with adversarial training. In International Conference on Learning Representations, 2018. [33] B. Lake. Towards more human-like concept learning in machines: compositionality, causality, and learning-to-learn. Massachusetts Institute of Technology, 2014. [34] B. Lake, T. Ullman, J. Tenenbaum, and Gershman S. Building machines that learn and think like people. In Behavioral and Brain Sciences, 2017. [35] C.H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. [36] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 624 639, 2018. [37] Yong-Lu Li, Yue Xu, Xiaohan Mao, and Cewu Lu. Symmetry and group in attribute-object compositions. CVPR, 2020. [38] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. ar Xiv preprint ar Xiv:1811.12359, 2018. [39] Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar Rätsch, Bernhard Schölkopf, and Olivier Bachem. Disentangling factors of variation using few labels. ICLR, 2020. [40] David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, and Léon Bottou. Discovering causal signals in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6979 6987, 2017. [41] Emile Mathieu, Tom Rainforth, N Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. ICML, 2019. [42] Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition with context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1792 1801, 2017. [43] Tanmoy Mukherjee, Makoto Yamada, and Timothy M Hospedales. Deep matching autoencoders. ar Xiv preprint ar Xiv:1711.06047, 2017. [44] Tushar Nagarajan and Kristen Grauman. Attributes as operators: factorizing unseen attribute-object compositions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 169 185, 2018. [45] Tushar Nagarajan and Kristen Grauman. Attributes as operators: factorizing unseen attribute-object compositions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 169 185, 2018. [46] Zhixiong Nan, Yang Liu, Nanning Zheng, and Song-Chun Zhu. Recognizing unseen attribute-object pair with generative model. In AAAI, volume 33, 2019. [47] Genevieve Patterson and James Hays. Coco attributes: Attributes for people, animals, and objects. European Conference on Computer Vision, 2016. [48] Tzuf Paz-Argaman, Yuval Atzmon, Gal Chechik, and Reut Tsarfaty. Zest: Zero-shot learning from text descriptions using textual similarity and visual summarization, 2020. [49] Judea Pearl. Causality: models, reasoning and inference, volume 29. Springer, 2000. [50] Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc Aurelio Ranzato. Task-driven modular networks for zero-shot compositional learning. In ICCV, 2019. [51] Dominik Rothenhäusler, Nicolai Meinshausen, Peter Bühlmann, and Jonas Peters. Anchor regression: heterogeneous data meets causality. ar Xiv preprint ar Xiv:1801.06229, 2018. [52] Paul K Rubenstein, Bernhard Schölkopf, and Ilya Tolstikhin. Learning disentangled representations with wasserstein auto-encoders. ICLR workshop, 2018. [53] Eduardo Hugo Sanchez, Mathieu Serrurier, and Mathias Ortner. Learning disentangled representations via mutual information estimation. ar Xiv preprint ar Xiv:1912.03915, 2019. [54] Bernhard Schölkopf. Causality for machine learning. ar Xiv preprint ar Xiv:1911.10500, 2019. [55] Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 459 466. Omnipress, 2012. [56] Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. Feature selection via dependence maximization. Journal of Machine Learning Research, 13(May):1393 1434, 2012. [57] Raphael Suter, Ðor de Miladinovi c, Bernhard Schölkopf, and Stefan Bauer. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. ar Xiv preprint ar Xiv:1811.00007, 2018. [58] Raphael Suter, Djordje Miladinovic, Bernhard Schölkopf, and Stefan Bauer. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In ICML, 2019. [59] Yao-Hung Hubert Tsai and Ruslan Salakhutdinov. Improving one-shot learning through fusing side information. ar Xiv preprint ar Xiv:1710.08347, 2017. [60] Kun Wei, Muli Yang, Hao Wang, Cheng Deng, and Xianglong Liu. Adversarial fine-grained composition learning for unseen attribute-object recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 3741 3749, 2019. [61] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In CVPR, 2018. [62] Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning - the good, the bad and the ugly. In CVPR, 2017. [63] A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In Computer Vision and Pattern Recognition (CVPR), Jun 2014. [64] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819 827, 2013.