# efficient_domain_generalization_via_commonspecific_lowrank_decomposition__523b4ec3.pdf Efficient Domain Generalization via Common-Specific Low-Rank Decomposition Vihari Piratla * 1 Praneeth Netrapalli 2 Sunita Sarawagi 1 Domain generalization refers to the task of training a model which generalizes to new domains that are not seen during training. We present CSD (Common Specific Decomposition), for this setting, which jointly learns a common component (which generalizes to new domains) and a domain specific component (which overfits on training domains). The domain specific components are discarded after training and only the common component is retained. The algorithm is extremely simple and involves only modifying the final linear classification layer of any given neural network architecture. We present a principled analysis to understand existing approaches, provide identifiability results of CSD, and study the effect of low-rank on domain generalization. We show that CSD either matches or beats state of the art approaches for domain generalization based on domain erasure, domain perturbed data augmentation, and meta-learning. Further diagnostics on rotated MNIST, where domains are interpretable, confirm the hypothesis that CSD successfully disentangles common and domain specific components and hence leads to better domain generalization; moreover, our code and dataset are publicly available at the following URL: https: //github.com/vihari/csd. 1. Introduction In the domain generalization (DG) task we are given domaindemarcated data from multiple domains during training, and our goal is to create a model that will generalize to instances from new domains during testing. Unlike in the more popu- * Most work done during internship at Microsoft Research, India. 1Department of Computer Science, Indian Institute of Technology, Mumbai, India 2Microsoft Research, Bangalore, India. Correspondence to: Vihari Piratla . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). lar domain adaptation task (Mansour et al., 2009; Ben-David et al., 2006; Daum e III et al., 2010) that explicitly adapts to a fixed target domain, DG requires zero-shot generalization to individual instances from multiple new domains. Domain generalization is of particular significance in largescale deep learning networks because large training sets often entail aggregation from multiple distinct domains, on which today s high-capacity networks easily overfit. Standard methods of regularization only address generalization to unseen instances sampled from the distribution of training domains, and have been shown to perform poorly on instances from unseen domains. Domain Generalization approaches have a rich history. Earlier methods were simpler and either sought to learn feature representations that were invariant across domains (Motiian et al., 2017; Muandet et al., 2013; Ghifary et al., 2015; Wang et al., 2019), or decomposed parameters into shared and domain-specific components (Khosla et al., 2012; Li et al., 2017). Of late however the methods proposed for DG are significantly more complicated and expensive. A recent favorite is gradient-based meta-learning that trains with sampled domain pairs to minimize either loss on one domain while updating parameters on another domain (Balaji et al., 2018; Li et al., 2018a), or minimizes the divergence between their representations (Dou et al., 2019). Another approach is to augment training data with domain adversarial perturbations (Shankar et al., 2018). Contributions Our paper starts with analyzing the domain generalization problem in a simple and natural generative setting. We use this setting to provide a principled understanding of existing DG approaches and improve upon prior decomposition methods. We design an algorithm CSD that decomposes only the last softmax parameters into a common component and a low-rank domain-specific component with a regularizer to promote orthogonality of the two parts. We prove identifiability of the shared parameters, which was missing in earlier decomposition-based approaches. We analytically study the effect of rank in trading off domain-specific noise suppression and domain generalization, which in earlier work was largely heuristics-driven. We show that our method despite its simplicity provides Efficient Domain Generalization via Common-Specific Low-Rank Decomposition consistent and competent gains on a battery of six tasks and under 12 domain settings; We emphasize the broad applicability of CSD, unlike several existing DG approaches, as a strength as much as the competent performance on popular datasets. Our experiments span both image and speech datasets and a large range of number of training domains (5 to 1000), in contrast to recent DG approaches evaluated only on a few domains. We provide empirical insights on the working of CSD on the rotated MNIST datasets where the domains are interpretable, and show that CSD indeed manages to separate out the generalizable shared parameters while training with simple domain-specific losses. We present an ablation study to evaluate the importance of the different terms in our training objective that led to improvements with regard to earlier decomposition approaches. 2. Related Work The work on Domain Generalization is broadly characterized by four major themes: Domain Erasure Many early approaches attempted to repair the feature representations so as to reduce divergence between representations of different training domains. Muandet et al. (2013) learns a kernel-based domain-invariant representation. Ghifary et al. (2015) estimates shared features by jointly learning multiple data-reconstruction tasks. Li et al. (2018b) uses MMD to maximize the match in the feature distribution of two different domains. The idea of domain erasure is further specialized in (Wang et al., 2019) by trying to project superficial (say textural) features out using image specific kernels. Domain erasure is also the founding idea behind many domain adaptation approaches, example (Ben-David et al., 2006; Hoffman et al., 2018; Ganin et al., 2016) to name a few. Augmentation The idea behind these approaches is to train the classifier with instances obtained by domains hallucinated from the training domains, and thus make the network ready for these neighboring domains. Shankar et al. (2018) proposes to augment training data with instances perturbed along directions of domain change. A second classifier is trained in parallel to capture directions of domain change. Volpi et al. (2018) applies such augmentation on single domain data. Another type of augmentation is to simultaneously solve for an auxiliary task. For example, Carlucci et al. (2019) achieves domain generalization for images by solving an auxiliary unsupervised jig-saw puzzle on the side. Meta-Learning/Meta-Training A recent popular approach is to pose the problem as a meta-learning task, whereby we update parameters using meta-train loss but simultaneously minimizing meta-test loss (Li et al., 2018a), (Balaji et al., 2018) or learn discriminative features that will allow for semantic coherence across meta-train and meta-test domains (Dou et al., 2019). More recently, this problem is being pursued in the spirit of estimating an invariant optimizer across different domains and solved by a form of meta-learning in (Arjovsky et al., 2019). Meta-learning approaches are complicated to implement, and slow to train. Decomposition In these approaches the parameters of the network are expressed as the sum of a common parameter and domain-specific parameters during training. Daum e III (2007) first applied this idea for domain adaptation. Khosla et al. (2012) applied decomposition to DG by retaining only the common parameter for inference. Li et al. (2017) extended this work to CNNs where each layer of the network was decomposed into common and specific low-rank components. Our work provides a principled understanding of when and why these methods might work and uses this understanding to design an improved algorithm CSD. Three key differences are: CSD decomposes only the last layer, imposes loss on both the common and domain-specific parameters, and constrains the two parts to be orthogonal. We show that orthogonality is required for theoretically proving identifiability. As a result, this newer avatar of an old decomposition-based approach surpasses recent, more involved augmentation and meta-learning approaches. Other Distributionally robust optimization (Sagawa et al., 2019) techniques deliver robustness to any mixture of the training distributions. This problem can be seen as a specific version of Domain Generalization, whose objective is to provide robustness to any distribution shift including the shift in population of the train domains. There has been an increased interest in learning invariant predictors through a causal viewpoint (Arjovsky et al., 2019; Ahuja et al., 2020) for better out-of-domain generalization. 3. Our Approach Our approach is guided by the following assumption about domain generalization settings. Assumption: There are common features in the data whose correlation with label is consistent across domains and domain specific features whose correlation with label varies (from positive to negative) across domains. Classifiers that rely on common features generalize to new unseen domains far better than those that rely on domain specific features. Note that we make no assumptions about a) the domain predictive power of common features and b) the net correlation between domain specific features and the label. Let us consider the following simple example which illustrates these points. There are D training domains and examples Efficient Domain Generalization via Common-Specific Low-Rank Decomposition (x, y) from domain i [D] are generated as follows: x = y(ec + βies) + N(0, Σi) Rm, i [D] (1) where y = 1 with equal probability, m is the dimension of the training examples, ec Rm is a common feature whose correlation with the label is constant across domains and es ec Rm is a domain specific feature whose correlation with the label, given by the coefficients βi, varies from domain to domain. In particular, for each domain i, suppose βi Unif [ 1, 2]. Note that though βi vary from positive to negative across various domains, there is a net positive correlation between es and the label y. N(0, Σi) denotes a standard normal random variable with mean zero and covariance matrix Σi. Since Σi varies across domains, every feature captures some domain information. We note that the assumption es ec is indeed restrictive we use it here only to make the discussion and expressions simpler. We relax this assumption later in this section when we discuss the identifiability of domain generalizing classifier. Our assumption at the beginning of this section (which envisages the possibility of seeing βi / [ 1, 2] at test time) implies that the only classifier that generalizes to new domains is one that depends solely on ec 1. Consider training a linear classifier on this dataset. We will describe the issues faced by existing domain generalization methods. Empirical risk minimization (ERM): When we empirically train a linear classifier using ERM with cross entropy loss on all of the training data, the resulting classifier puts significant nonzero weight on the domain specific component es. The reason for this is that there is a bias in the training data which gives an overall positive correlation between es and the label. Domain erasure (Ganin et al., 2016): Domain erasure methods seek to extract features that have the same distribution across different domains and construct a classifier using those features. However, since the noise covariance matrix in (1) varies across domains, none of the features have the same distribution across domains. We further empirically verified that domain erasure methods do not improve much over ERM. Domain adversarial perturbations (Shankar et al., 2018): Domain adversarial perturbations seek to augment the training dataset with adversarial examples obtained using domain classification loss, and train a classifier on the resulting augmented dataset. Since the common component ec has domain signal, the adversarial examples will induce variation in this component 1Note that this last statement relies on the assumption that ec es. If this is not the case, the correct domain generalizing classifier is the component of ec that is orthogonal to es i.e., ec ec,es es 2 es. See (4). and so the resulting classifier puts less weight on the common component. We verified this empirically and further observed that it does not improve much over ERM. Meta-learning: Meta-learning based DG approaches such as (Dou et al., 2019) work with pairs of domains. Parameters updated using gradients on loss of one domain, when applied on samples of both domains in the pair should lead to similar class distributions. If the method used to detect similarity is robust to domainspecific noise, meta-learning methods could work well in this setting. But meta-learning methods could be expensive to implement in practice either because they require the second order gradient updates or because they have a quadratic dependence on the number of domains and thereby not scale well to a large number of train domains. Decomposition based approaches (Khosla et al., 2012; Li et al., 2017): Decomposition based approaches rely on the observation that for problems like (1), there exist good domain specific classifiers wi, one for each domain i, such that: wi = ec + γies, where γi is a function of βi. Note that all these domain specific classifiers share the common component ec which is the domain generalizing classifier that we are looking for! If we are able to find domain specific classifiers of this form, we can extract ec from them. This idea can be extended to a generalized version of (1), where the latent dimension of the domain space is k i.e., say j=1 βi,jesj) + N(0, Σi). (2) esj ec Rm, and esj esℓfor j, ℓ [k], j = ℓare domain specific features whose correlation with the label, given by the coefficients βi,j, varies from domain to domain. In this setting, there exist good domain specific classifiers wi such that: wi = ec +Esγi, where ec Rm is a domain generalizing classifier, Es = es1 es2 esk Rm k consists of domain specific components and γi Rk is a domain specific combination of the domain specific components that depends on βi,j for j = 1, , k. With this observation, the algorithm is simple to state: train domain specific classifiers wi that can be represented as wi = wc + Wsγi Rm. Here the training variables are wc Rm, Ws Rm k and γi Rk. After training, discard all the domain specific components Ws and γi and return the common classifier wc. Note that this can equivalently be written as W = wc1 + WsΓ , (3) where W := w1 w2 w D , 1 RD is the all ones vector and Γ := γ1 γ2 γD . Efficient Domain Generalization via Common-Specific Low-Rank Decomposition This framing of the decomposition approach, in the context of simple examples as in (1) and (2), lets us understand the three main aspects that are not properly addressed by prior works: 1) identifiability of wc, 2) choice of k and 3) extension to non-linear models such as neural networks. Identifiability of the common component wc: None of the prior decomposition based approaches investigate identifiability of wc. In fact, given a general matrix W which can be written as wc1 + WsΓ , there are multiple ways of decomposing W into this form, so wc cannot be uniquely determined by this decomposition alone. For example, given a decomposition (3), for any (k + 1) (k + 1) invertible matrix R, we can write W = wc Ws R 1R 1 Γ . As long as the first row of R is equal to 1 0 0 , the structure of the decomposition (3) is preserved while wc might no longer be the same. Out of all the different wc that can be obtained this way, which one is the correct domain generalizing classifier? In the setting of (2), where ec Es, we saw that the correct domain generalizing classifier is wc = ec. In the setting where ec Es, the correct domain generalizing classifier is the projection of ec onto the space orthogonal to Span (Es) i.e., wc = ec PEsec, (4) where PEs is the projection matrix onto the span of the domain specific vectors es. The next lemma shows that (4) is equivalent to wc Span (Ws) and so we train classifiers (3) satisfying this condition. Lemma 1. Suppose W := ec1 +EsˆΓ = wc1 +WsΓ is a rank-(k + 1) matrix, where Es Rm k, ˆΓ RD k, Ws Rm k and Γ RD k are all rank-k matrices with k < m, D. Then, wc = ec PEsec if and only if wc Span (Ws). Proof of Lemma 1. If direction: Suppose wc Span (Ws). Then, W wc = ec, wc 1 + ˆΓ Es wc = wc 2 1. Since W is a rank-(k + 1) matrix, we know that 1 / Span ˆΓ and so it has to be the case that ec, wc = wc 2 and Es wc = 0. Both of these together imply that wc is the projection of ec onto the space orthogonal to Es i.e., wc = ec PEsec. Only if direction: Let wc = ec PEsec. Then ec1 wc1 + EsˆΓ = PEsec1 + EsˆΓ is a rank-k matrix and can be written as WsΓ with Span (Ws) = Span (Es). Since wc Span (Es), we also have wc Span (Ws). Why low rank?: An important choice in decomposition approaches is the rank in decomposition (3), which in prior works was justified heuristically, by appealing to number of parameters. We prove the following result, which gives us a more principled reason for the choice of low rank parameter k. Theorem 1. Given any matrix W Rm D, the minimizers of the function f(wc, Ws, Γ) = W wc1 WsΓ 2 F , where Ws Rm k and wc Span (Ws) can be computed by a simple SVD based algorithm (Algorithm 2 in Appendix A). In particular, we have: For k = 0, wc = 1 for k = D 1, wc = W +1/ W +1 2. The proof of this theorem is similar to that of the classical low rank approximation theorem of Eckart-Young-Mirsky, and is presented in the supplementary material. When W = wc1 + WsΓ + N, where N is a noise matrix (for example due to finite samples), both extremes k = 0 and k = D 1 have different advantages/drawbacks: k = 0: Averaging effectively reduces the noise component N but ends up retaining some domain specific components if there is net correlation with the label in the training data, similar to ERM. k = D 1: The pseudoinverse effectively removes domain specific components and retains only the common component. However, the pseudoinverse does not reduce noise to the same extent as a plain averaging would (since empirical mean is often asymptotically the best estimator for mean). In general, the sweet spot for k lies between 0 and D 1 and its precise value depends on the dimension and magnitude of the domain specific components as well as the magnitude of noise. In our implementation, we perform cross validation to choose a good value for k but also note that the performance of our algorithm is relatively stable with respect to this choice in Section 4.3. Extension to neural networks: Finally, prior works extend this approach to non-linear models such as neural networks by imposing decomposition of the form (3) for parameters in all layers separately. This increases the size of the model significantly and leads to worse generalization performance. Further, it is not clear whether any of the insights we gained above for linear models continue to hold when we include non-linearities and stack them together. So, we propose the following two simple modifications instead: enforcing the structure (3) only in the final linear layer, as opposed to the standard single softmax layer, and including a loss term for predictions of common component, in addition to the domain specific losses, Efficient Domain Generalization via Common-Specific Low-Rank Decomposition both of which encourage learning of features with commonspecific structure. Our experiments (Section 4.2) show that these modifications (orthogonality, changing only the final linear layer and including common loss) are instrumental in making decomposition methods state of the art for domain generalization. Our overall training algorithm below details the steps. 3.1. Algorithm CSD Our method of training neural networks for domain generalization appears as Algorithm 1 and is called CSD for Common-specific Decomposition.The analysis above was for the binary setting, but we present the algorithm for the multi-class case with C = # classes. The only extra parameters that CSD requires, beyond normal feature parameters θ and softmax parameters wc RC m, are the domainspecific low-rank parameters Ws RC m k and γi Rk, for i [D]. Here m is the representation size in the penultimate layer. Thus, Γ = [γ1, . . . , γD] can be viewed as a domain-specific embedding matrix of size k D. Note that unlike a standard mixture of softmax, the γi values are not required to be on the simplex. Each training instance consists of an input x, true label y, and a domain identifier i from 1 to D. Its domain-specific softmax parameter is computed by wi = wc + Wsγi. Instead of first computing the full-rank parameters and then performing SVD, we directly compute the low-rank decomposition along with training the network parameters θ. For this we add a weighted combination of these three terms in our training objective: (1) Orthonormality regularizers to make wc[y] orthogonal to domain-specific Ws[y] softmax parameters for each label y and to avoid degeneracy by controlling the norm of each softmax parameter to be close to 1. (2) A cross-entropy loss between y and distribution computed from the wi parameters to train both the common and low-rank domain-specific parameters. (3) A cross-entropy loss between y and distribution computed from the wc parameters. This loss might appear like normal ERM loss but when coupled with the orthogonality regularizer above it achieves domain generalization. 3.2. Synthetic setting: comparing CSD with ERM We use the data model proposed in Equation 1 to simulate multi-domain training data with D = 10 domains and m = 2 features. For each domain, we sample βi uniformly from -1, 2 and σij uniformly from 0, 1. We set ec = [1, 0] and es = [0, 1]. We sample 100 data points for each domain using its corresponding values: βi, Σi. We then fit the parameters of a linear classifier with log loss using either Algorithm 1 Common-Specific Low-Rank Decomposition (CSD ) 1: Given: D, m, k, C, λ, κ,train-data 2: Initialize params wc RC m, Ws RC m k 3: Initialize γi Rk : i [D] 4: Initialize params θ of feature network Gθ : X 7 Rm 5: ˆW = [w T c , W T s ]T 6: R PC y=1 Ik+1 ˆW[y]T ˆW[y] 2 F Orthonormality constraint 7: for (x, y, i) train-data do 8: wi wc + Wsγi 9: loss += L(Gθ(x), y; wi) + λL(Gθ(x), y; wc) 10: end for 11: Optimize loss+κR wrt θ, wc, Ws, γi 12: Return θ, wc for inference standard expected risk minimization (ERM) estimator or CSD . The scaled solution obtained using ERM is [1, 0.2] and [1, 0.03] using CSD with high probability from ten runs. As expected, the solution of ERM has a positive but small coefficient on the second component due to the net positive correlation on this component. CSD on the other hand correctly decomposed the common component. 4. Experiments We compare our method with three existing domain generalization methods: (1) MASF (Dou et al., 2019) is a recently proposed meta-learning based strategy to learn domaininvariant features. (2) CG: As a representative of methods that augment data for domain generalization we compare with (Shankar et al., 2018), and (3) LRD: the low-rank decomposition approach of (Li et al., 2017) but only at the last softmax layer. Our baseline is standard expected risk minimization (ERM) using cross-entropy loss that ignores domain boundaries altogether. We evaluate on six different datasets spanning image and speech data types and varying number of training domains. We assess quality of domain generalization as accuracy on a set of test domains that are disjoint from the set of training domains. Experiment setup details We use Res Net-18 to evaluate on rotated image tasks, Le Net for Handwritten Character datasets, and a multi-layer convolution network similar to what was used for Speech tasks in (Shankar et al., 2018). We added a layer normalization just before the final layer in all these networks since it helped generalization error on all methods, including the baseline. CSD is relatively stable to hyper-parameter choice, we set the default rank to 1, and parameters of weighted loss to λ = 1 and κ = 1. Efficient Domain Generalization via Common-Specific Low-Rank Decomposition These hyper-parameters along with learning rates of all other methods as well as number of meta-train/meta-test domains for MASF and step size of perturbation in CG are all picked using a task-specific development set. Handwritten character datasets: In these datasets we have characters written by many different people, where the person writing serves as domain and generalizing to new writers is a natural requirement. Handwriting datasets are challenging since it is difficult to disentangle a person s writing style from the character (label), and methods that attempt to erase domains are unlikely to succeed. We have two such datasets. (1) The Lipit K dataset2 earlier used in (Shankar et al., 2018) is a Devanagari Character dataset which has classification over 111 characters (label) collected from 106 people (domain). We train three different models on each of 25, 50, and 76 domains, and test on a disjoint set of 20 domains while using 10 domains for validation. (2) Nepali Hand Written Character Dataset (Nepali C)3 contains data collected from 41 different people on consonants as the character set which has 36 classes. Since the number of available domains is small, in this case we create a fixed split of 27 domains for training, 5 for validation and remaining 9 for testing. We use Le Net as the base classifier on both the datasets. In Table 1 we show the accuracy using different methods for different number of training domains on the Lipit K dataset, and on the Nepali dataset. We observe that across all four models CSD provides significant gains in accuracy over the baseline (ERM), and all three existing methods LRD, CG and MASF. The gap between prior decomposition-based approach (LRD) and ours, establishes the importance of our orthogonality regularizer and common loss term. MASF is better than CSD only for 25 domains and as the number of domains increases to 76, CSD s accuracy is 87.3 whereas MASF s is 85.9. In terms of training time MASF is 5 10 times slower than CSD, and CG is 3 4 times slower than CSD. In contrast CSD is just 1.1 times slower than ERM. Thus, the increased generalization of CSD incurs little additional overheads in terms of training time compared to existing methods. PACS: PACS4 is a popular domain generalization benchmark. The dataset in aggregate contains around 10,000 images from seven object categories collected from four different sources: Photo, Art, Cartoon and Sketch. Evaluation 2http://lipitk.sourceforge.net/datasets/ dvngchardata.htm 3https://www.kaggle.com/ashokpant/ devanagari-character-dataset 4https://domaingeneralization.github.io/ using this dataset trains on three of the four sources and tests on the left out domain. This setting is challenging since it tests generalization to a radically different target domain. We present comparisons in Table 2 of our method with Ji Gen (Carlucci et al., 2019), Epi-FCR (Li et al., 2019) and ERM baseline. In order for a fair comparison, we use the implementation of Ji Gen and made comparisons with only those methods which have provided numbers using a consistent implementation. We report numbers for the case when the baseline network is Res Net-18 and Alex Net. We perform almost the same or slightly better than image specific Ji Gen (Carlucci et al., 2019). We stay away from comparing with MASF (Dou et al., 2019), Epi FCR (Li et al., 2019) on Alex Net, since their reported numbers from different implementation have different baseline number. Speech utterances dataset We use the utterance data released by Google which was also used in (Shankar et al., 2018) and is collected from a large number of subjects5. The base classifier and the preprocessing pipeline for the utterances are borrowed from the implementation provided in the Tensorflow examples6. We used the default ten (of the 30 total) classes for classification similar to (Shankar et al., 2018). We use ten percent of total number of domains for each of validation and test. The accuracy comparison for each of the methods on varying number of training domains is shown in Table 3. We could not compare with MASF since their implementation is only made available for image tasks. Also, we skip comparison with LRD since earlier experiments established that it can be worse than even the baseline. Table 3 shows that CSD is better than both the baseline and CG on all domain settings. When the number of domains is very large (for example, 1000 in the table), even standard training can suffice since the training domains could cover the variations in the test domains. Rotated MNIST and Fashion-MNIST: Rotated MNIST is a popular benchmark for evaluating domain generalization where the angle by which images are rotated is the proxy for domain. We randomly select7 a subset of 2000 images for MNIST and 10,000 images for Fashion MNIST, the original set of images is considered to have rotated by 0 and is denoted as M0. Each of the images in the data 5https://ai.googleblog.com/2017/08/ launching-speech-commands-dataset.html 6https://github.com/tensorflow/ tensorflow/tree/r1.15/tensorflow/examples/ speech_commands 7 The earlier work on this dataset however lacks standardization of splits, train sizes, and baseline network across the various papers (Shankar et al., 2018) (Wang et al., 2019). Hence we rerun experiments using different methods on our split and baseline network. Efficient Domain Generalization via Common-Specific Low-Rank Decomposition Lipit K Nepali C Method 25 50 76 27 ERM (Baseline) 74.5 (0.4) 83.2 (0.8) 85.5 (0.7) 83.4 (0.4) LRD (Li et al., 2017) 76.2 (0.7) 83.2 (0.4) 84.4 (0.2) 82.5 (0.5) CG (Shankar et al., 2018) 75.3 (0.5) 83.8 (0.3) 85.5 (0.3) 82.6 (0.5) MASF (Dou et al., 2019) 78.5 (0.5) 84.3 (0.3) 85.9 (0.3) 83.3 (1.6) CSD (Ours) 77.6 (0.4) 85.1 (0.6) 87.3 (0.4) 84.1 (0.5) Table 1. Comparison of our method on two handwritting datasets: Lipit K and Nepali C. For Lipit K since number of available training domains is large we also report results with increasing number of domains. The numbers are average (and standard deviation) from three runs. Alg. Photo Art Cartoon Sketch Avg. Res Net-18 ERM 95.7 77.8 74.9 67.7 79.0 Ji Gen 96.0 79.4 75.2 71.3 80.5 Epi FCR 93.9 82.1 77.0 73.0 81.5 CSD 94.1 78.9 75.8 76.7 81.4 (.2) (1.1) (1.) (1.2) (.3) Alex Net ERM 90.0 66.7 69.4 60.0 71.5 Ji Gen 89.0 67.6 71.7 65.2 73.4 CSD 90.2 68.3 69.7 63.4 72.9 (.2) (1.2) (.3) (1.8) (.6) Table 2. Comparison of CSD with Ji Gen (Carlucci et al., 2019) and Epi FCR (Li et al., 2019) on PACS datset with Res Net-18 and Alex Net architecture. The header of each column identifies the target domain with last column showing average accuracy. Also shown are the standard deviation for CSD in parenthesis. Method 50 100 200 1000 ERM 72.6 (.1) 80.0 (.1) 86.8 (.3) 90.8 (.2) CG 73.3 (.1) 80.4 (.0) 86.9 (.4) 91.2 (.2) CSD 73.7 (.1) 81.4 (.4) 87.5 (.1) 91.3 (.2) Table 3. Accuracy comparison on speech utterance data with varying number of training domains. The numbers are average (and standard deviation) from three runs. split when rotated by θ degrees is denoted Mθ. The training data is union of all images rotated by 15 through 75 in intervals of 15 , creating a total of 5 domains. We evaluate on M0, M90. In that sense only in this artificially created domains, are we truly sure of the test domains being outside the span of train domains. Further, we employ batch augmentations such as flip left-right and random crop since they significantly improve generalization error and are commonly used in practice. We train using the Res Net-18 architecture. Table 4 compares the baseline, MASF, and CSD on MNIST and Fashion-MNIST. We show accuracy on test set from the same domains as training (in-domain) and test set from 0 and 90 that are outside the training domains. Note how the CSD s improvement on in-domain accuracy is insignificant, while gaining substantially on out of domain data. This shows that CSD specifically targets domain generalization. Surprisingly MASF does not perform well at all, and is significantly worse than even the baseline. One possibility could be that the domain-invariance loss introduced by MASF conflicts with the standard data augmentations used on this dataset. To test this, we compared all the methods without such augmentation. We observe that although all numbers have dropped 1 4%, now MASF is showing sane improvements over baseline, but CSD is better than MASF even in this setting. The results on Fashion-MNIST without batch augmentations is shown in Table 8 of Appendix and is further discussed in section 6. Hyper-parameter optimization and Validation set: The hyper-parameters in our our implementation includes algorithm specific hyper-parameter: rank of decomposition (k) and optimization specific parameters such as number of training epochs, learning rate, optimizer. We use a validation set to pick the optimal values for these parameters. Construction of the validation set varies slightly between each task. In handwritten character, speech tasks, we construct validation set from test data of unseen users. The set of users considered for validation are mutually exclusive with users (or domains) of train and test splits. In PACS and rotated experiments, the validation set is the validation split of train data thereby the validation set only reflects the performance on the train domains. 4.1. How does CSD work? We provide empirical evidence in this section that CSD effectively decomposes common and low-rank specialized components. Consider the rotated MNIST task trained on Res Net-18 as discussed in Section 4. Since each domain differs only in the amount of rotation, we expect Ws to be of rank 1 and so we chose k = 1 giving us one common and one specialized component. We are interested in finding out Efficient Domain Generalization via Common-Specific Low-Rank Decomposition MNIST Fashion-MNIST in-domain out-domain in-domain out-domain ERM 98.3 (0.0) 93.6 (0.7) 89.5 (0.1) 75.8 (0.7) MASF 98.2 (0.1) 93.2 (0.2) 86.9 (0.3) 72.4 (2.9) CSD 98.4 (0.0) 94.7 (0.2) 89.7 (0.2) 78.0 (1.5) Table 4. Performance comparison on rotated MNIST and rotated Fashion MNIST, shown are the in-domain and out-domain accuracies averaged over three runs along with standard deviation in the brackets. MNIST in-domain out-domain ERM 97.7 (0.) 89.0 (.8) MASF 97.8 (0.) 89.5 (.6) CSD 97.8 (0.) 90.8 (.3) Table 5. In-domain and out-domain accuracies on rotated MNIST without batch augmentations. Shown are average and standard deviation from three runs. (a) Beta fit on estimated probabilities of correct class using common common component. (b) Beta fit on estimated probabilities of correct class using specialized component. Figure 1. Distribution of probability assigned to the correct class using common or specialized components alone. if the common component is agnostic to the domains and see how the specialized component varies across domains. We look at the probability assigned to the correct class for all the train instances using only common component wc and using only specialized component Ws. For probabilities assigned to examples in each domain using each component, we fit a Beta distribution. Shown in Figure 1(a) is fitted beta distribution on probability assigned using wc and Figure 1(b) for ws. Note how in Figure 1(a), the colors are distinguishable, yet are largely overlapping. However in Figure 1(b), notice how modes corresponding to each domain are widely spaced, moreover the order of modes and spacing between them cleanly reflects the underlying degree of rotation from 15 to 75 . These observations support our claims on utility of CSD for low-rank decomposition. 4.2. Ablation study In this section we study the importance of each of the three terms in CSD s final loss: common loss computed from wc (Lc), specialized loss (Ls) computed from wi that sums common (wc) and domain-specific parameters (Ws, Γ), orthonormal loss (R) that makes wc orthogonal to domain specific softmax (Refer: Algorithm 1). In Table 6, we demonstrate the contribution of each term to CSD loss by comparing accuracy on Lipit K with 76 domains. Common Specialized Orthonormality Accuracy loss Lc loss Ls regularizer R Y N N 85.5 (.7) N Y N 84.4 (.2) N Y Y 85.3 (.1) Y N Y 85.7 (.4) Y Y N 85.8 (.6) Y Y Y 87.3 (.3) Table 6. Ablation analysis on CSD loss using Lipit K (76) The first row is the baseline with only the common loss. The second row shows prior decomposition methods that imposed only the specialized loss without any orthogonality or a separate common loss. This is worse than even the baseline. This can be attributed to decomposition without identifiability guarantees thereby losing part of wc when the specialized Ws is discarded. Using orthogonal constraint, third row, fixes this ill-posed decomposition however, understandably, just fixing the last layer does not gain big over baseline. Using both common and specialized loss even without orthogonal constraint showed some merit, perhaps because feature sharing from common loss covered up for bad decomposition. Finally, fixing this bad decomposition with orthogonality constraint and using both common and specialized loss constitutes our CSD algorithm and is signif- Efficient Domain Generalization via Common-Specific Low-Rank Decomposition icantly better than any other variant. This empirical study goes on to show that both Lc and R are important. Imposing Lc with wc does not help feature sharing if it is not devoid of specialized components from bad decomposition. A good decomposition on final layer without Lc does not help generalize much. 4.3. Importance of Low-Rank Table 7 shows accuracy on various speech tasks with increasing k controlling the rank of the domain-specific component. Rank-0 corresponds to the baseline ERM without any domain-specific part. We observe that accuracy drops with increasing rank beyond 1 and the best is with k = 1 when number of domains D 100. As we increase D to 200 domains, a higher rank (4) becomes optimal and the results stay stable for a large range of rank values. This matches our analytical understanding resulting from Theorem 1 that we will be able to successfully disentangle only those domain specific components which have been observed in the training domains, and using a higher rank will increase noise in the estimation of wc. Rank k 50 100 200 0 72.6 (.1) 80.0 (.1) 86.8 (.3) 1 74.1 (.3) 81.4 (.4) 87.3 (.5) 4 73.7 (.1) 80.6 (.7) 87.5 (.1) 9 73.0 (.6) 80.1 (.5) 87.5 (.2) 24 72.3 (.2) 80.5 (.4) 87.4 (.3) Table 7. Effect of rank constraint (k) on test accuracy for Speech task with varying number of train domains. 5. Conclusion We considered a natural multi-domain setting and looked at how standard classifier could overfit on domain signals and delved on efficacy of several other existing solutions to the domain generalization problem. Motivated by this simple setting, we developed a new algorithm called CSD that effectively decomposes classifier parameters into a common part and a low-rank domain-specific part. We presented a principled analysis to provide identifiability results of CSD while delineating the underlying assumptions. We analytically studied the effect of rank in trading off domainspecific noise suppression and domain generalization, which in earlier work was largely heuristics-driven. We empirically evaluated CSD against four existing algorithms on five datasets spanning speech and images and a large range of domains. We show that CSD generalizes better and is considerably faster than existing algorithms, while being very simple to implement. In future, we plan to investigate algorithms that combine data augmentation with parameter decomposition to gain even higher accuracy on test domains that are related to training domains. 6. Additional Results 6.1. Rotated Experiments Our implementation of Res Net-18 with rotation experiments does not resize the original image size of MNIST or Fashion MNIST. Thereby, our architecture has smaller number of parameters than the standard Res Net-18 architecture which processes images of much larger size. Table 8 contrasts the performance of CSD with baseline on rotated Fashion-MNIST dataset without batch augmentation. The default batch augmentations for Fashion-MNIST includes random lateral inversion (flip left with right) which effectively simulates more rotations in the train data. Since the batch augmentations are targeted toward rotation, we see large gap with respect to the use of batch augmentations, more acute on Fashion-MNIST, as reported in Table 4 with batch augmentation and Tables 5, 8 without. in-domain out-domain ERM 88.6 (0.0) 35.8 (1.6) CSD 88.7 (0.1) 38.0 (1.1) Table 8. In-domain and out-domain accuracies on rotated Fashion MNIST without batch augmentations. Shown are average and standard deviation from three runs. Acknowledgement We thank the anonymous reviewers for their constructive feedback on this work. The third author s research was partly sponsored by IBM AI Horizon Networks - IIT Bombay. Ahuja, K., Shanmugam, K., Varshney, K., and Dhurandhar, A. Invariant risk minimization games. ar Xiv preprint ar Xiv:2002.04692, 2020. Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Balaji, Y., Sankaranarayanan, S., and Chellappa, R. Metareg: Towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems, pp. 998 1008, 2018. Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. Analysis of representations for domain adaptation. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS 06, Efficient Domain Generalization via Common-Specific Low-Rank Decomposition 2006. URL http://dl.acm.org/citation. cfm?id=2976456.2976474. Carlucci, F. M., D Innocente, A., Bucci, S., Caputo, B., and Tommasi, T. Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2229 2238, 2019. Daum e III, H. Frustratingly easy domain adaptation. pp. 256 263, 2007. Daum e III, H., Kumar, A., and Saha, A. Co-regularization based semi-supervised domain adaptation. In NIPS, pp. 478 486, 2010. Dou, Q., de Castro, D. C., Kamnitsas, K., and Glocker, B. Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems, pp. 6447 6458, 2019. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016. Ghifary, M., Bastiaan Kleijn, W., Zhang, M., and Balduzzi, D. Domain generalization for object recognition with multi-task autoencoders. In ICCV, pp. 2551 2559, 2015. Hoffman, J., Mohri, M., and Zhang, N. Algorithms and theory for multiple-source adaptation. In Advances in Neural Information Processing Systems, pp. 8246 8256, 2018. Khosla, A., Zhou, T., Malisiewicz, T., Efros, A., and Torralba, A. Undoing the damage of dataset bias. In ECCV, pp. 158 171, 2012. Li, D., Yang, Y., Song, Y., and Hospedales, T. M. Deeper, broader and artier domain generalization. In IEEE International Conference on Computer Vision, ICCV, 2017. Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. Learning to generalize: Meta-learning for domain generalization. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018a. Li, D., Zhang, J., Yang, Y., Liu, C., Song, Y.-Z., and Hospedales, T. M. Episodic training for domain generalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1446 1455, 2019. Li, H., Pan, S. J., Wang, S., and Kot, A. C. Domain generalization with adversarial feature learning. CVPR, 2018b. Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation with multiple sources. In Advances in neural information processing systems, pp. 1041 1048, 2009. Motiian, S., Piccirilli, M., Adjeroh, D. A., and Doretto, G. Unified deep supervised domain adaptation and generalization. In ICCV, pp. 5715 5725, 2017. Muandet, K., Balduzzi, D., and Schlkopf, B. Domain generalization via invariant feature representation. In ICML, pp. 10 18, 2013. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019. Shankar, S., Piratla, V., Chakrabarti, S., Chaudhuri, S., Jyothi, P., and Sarawagi, S. Generalizing across domains via cross-gradient training. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=r1Dx7fb CW. Volpi, R., Namkoong, H., Sener, O., Duchi, J. C., Murino, V., and Savarese, S. Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems, pp. 5334 5344, 2018. Wang, H., He, Z., Lipton, Z. C., and Xing, E. P. Learning robust representations by projecting superficial statistics out. ar Xiv preprint ar Xiv:1903.06256, 2019. Efficient Domain Generalization via Common-Specific Low-Rank Decomposition Efficient Domain Generalization via Common-Specific Low-Rank Decomposition (Appendix) A. Proof of Theorem 1 Algorithm 2 minwc,Ws,Γ f(wc, Ws, Γ) = W wc1 WsΓ 2 F , s.t. Ws Rm k and wc Span (Ws) 1: Input W, k 2: wc 1 DW 1 3: Ws, Γ Top-k SVD W wc1 4: wnew c 1 (wc1 +WsΓ )+1 2 wc1 + WsΓ + 1 5: W new s Γnew wc1 + WsΓ wnew c 1 6: Return wnew c , W new s , Γnew Proof of Theorem 1. The high level outline of the proof is to show that the first two steps obtain the best rank-(k + 1) approximation of W such that the row space contains 1. The last two steps retain this low rank approximation while ensuring that wnew c Span (W new s ). The proof that the first two steps obtain the best rank-(k +1) approximation follows that of the classical low rank approximation theorem of Eckart-Young-Mirsky. We first note that the minimization problem of f under the given constraints, can be equivalently written as: s.t. rank(f W) k + 1 and 1 Span f W . (5) Let ew := 1 DW1 and W ew1 = e U eΣe V be the SVD of W ew1 . Since W ew1 1 = 0, we have that 1 Span e V . We will first show that f W := ew1 + e Uk eΣk e V k , where e Uk eΣk e V k is the top-k component of e U eΣe V , mini- mizes both W f W 2 F and W f W 2 2 among all matri- ces f W satisfying the conditions in (5). Let σi denote the ith largest singular value of e U eΣe V . Optimality in operator norm: Fix any f W satisfying the conditions in (5). Since 1 Span f W and rank f W = k + 1, there is a unit vector ev Span e Vk+1 such that ev Span f W . Let ev = e Vk+1x. Since ev = 1 and e V is an orthonormal matrix, we have P i x2 i = 1. We have: 2 (W f W)ev 2 = Wev 2 = W e Vk+1x 2 = e Uk+1eΣk+1x 2 = i=1 eσ2 i x2 i σ2 k+1 = W ew1 2 . This proves the optimality in operator norm. Optimality in Frobenius norm: Fix again any f W satisfying the conditions in (5). Let σi(A) denote the ith largest singular value of A and Ai denote the best rank-i approximation to A. Let W := W f W. We have that: σi(W ) = W W i 1 = W W i 1 + f W f W W + f W W i 1 f W = W W i 1 f W where the minimum is taken over all c W such that rank c W i + k, 1 Span c W . Picking c W = ew1 + e Uk+i 1eΣk+i 1 e V k+i 1, we see that σi(W f W) σk+i. It follows from this that W f W F W ew1 e Uk eΣk e V k F . For the last two steps, note that they preserve the matrix wc1 + WsΓ . If wc1 + W sΓ is the unique way to write wc1 + WsΓ such that wc Span we see that wc wc1 + W sΓ = wc 2 1 , meaning that wc = wc 2 wc1 + WsΓ + 1. This proves the theorem.