# robust_optimization_over_multiple_domains__9cd0aa10.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Robust Optimization over Multiple Domains Qi Qian, Shenghuo Zhu, Jiasheng Tang, Rong Jin, Baigui Sun, Hao Li Alibaba Group, Bellevue, WA, 98004, USA {qi.qian, shenghuo.zhu, jiasheng.tjs, jinrong.jr, baigui.sbg, lihao.lh}@alibaba-inc.com In this work, we study the problem of learning a single model for multiple domains. Unlike the conventional machine learning scenario where each domain can have the corresponding model, multiple domains (i.e., applications/users) may share the same machine learning model due to maintenance loads in cloud computing services. For example, a digit-recognition model should be applicable to hand-written digits, house numbers, car plates, etc. Therefore, an ideal model for cloud computing has to perform well at each applicable domain. To address this new challenge from cloud computing, we develop a framework of robust optimization over multiple domains. In lieu of minimizing the empirical risk, we aim to learn a model optimized to the adversarial distribution over multiple domains. Hence, we propose to learn the model and the adversarial distribution simultaneously with the stochastic algorithm for efficiency. Theoretically, we analyze the convergence rate for convex and non-convex models. To our best knowledge, we first study the convergence rate of learning a robust non-convex model with a practical algorithm. Furthermore, we demonstrate that the robustness of the framework and the convergence rate can be further enhanced by appropriate regularizers over the adversarial distribution. The empirical study on real-world fine-grained visual categorization and digits recognition tasks verifies the effectiveness and efficiency of the proposed framework. Introduction Learning a single model for multiple domains becomes a fundamental problem in machine learning and has found applications in cloud computing services. Cloud computing witnessed the development of machine learning in recent years. Apparently, users of these cloud computing services can benefit from sophisticated models provided by service carrier, e.g., Aliyun. However, the robustness of deployed models becomes a challenge due to the explosive popularity of the cloud computing services. Specifically, to maintain the scalability of the cloud computing service, only a single model will exist in the cloud for the same problem from different domains. For example, given a model for digits recognition in cloud, some users may call it to identify the handwritten digits while others may try to recognize the printed digits (e.g., house number). Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Illustration of optimizing over multiple domains. In this example, a digit-recognition model provided by cloud service carrier should be applicable for multiple domains, e.g., handwritten digits, printed digits. A satisfied model has to deal with both domains (i.e., handwritten digits, printed digits) well in the modern architecture of cloud computing services. This problem is illustrated in Fig. 1. Note that the problem is different from multi-task learning (Zhang and Yang 2017) that aims to learn different models (i.e., multiple models) for different tasks by exploiting the shared information between related tasks. In a conventional learning procedure, an algorithm may mix the data from multiple domains by assigning an adhoc weight for each example, and then learn a model accordingly. The weight is pre-defined and can be uniform for each example, which is known as empirical risk minimization (ERM). Explicitly, the learned model can handle certain domains well but perform arbitrarily poor on the others. The unsatisfied performance in certain domains will result in business interruption from users. Moreover, assigning even weights for all examples can suffer from the data imbalance problem when the examples from certain domains dominate. Recently, distributionally robust optimization has attracted much attention (Chen et al. 2017; Namkoong and Duchi 2016; Shalev-Shwartz and Wexler 2016). Unlike the conventional strategy with the uniform distribution, it aims to optimize the performance of the model in the worst case distribution over examples. The learned model is explicitly more robust by focusing on the hard examples. To learn a robust model, many existing work apply the convex loss functions, while the state-of-the-art performance for several important practical problems are reported from the meth- ods with non-convex loss functions, e.g, deep neural networks (He et al. 2016; Krizhevsky, Sutskever, and Hinton 2012; Szegedy et al. 2015). (Chen et al. 2017) proposed an algorithm to solve the non-convex problem, but their analysis relies on a near-optimal oracle for the non-convex subproblem, which is not feasible for most non-convex problems in real tasks. Besides, their algorithm has to go through the whole data set at least once to update the parameters at every iteration, which makes it too expensive for the largescale data set. In this work, we propose a framework to learn a robust model over multiple domains rather than examples. By learning the model and the adversarial distribution simultaneously, the algorithm can balance the performance between different domains adaptively. Compared with the previous work, the empirical data distribution in each domain remains unchanged and our framework only learns the distribution over multiple domains. Therefore, the learned model will not be potentially misled by the adversarial distribution over examples. Our framework is also comparatively efficient due to the adoption of stochastic gradient descent (SGD) for optimization. More importantly, we first prove that the proposed method converges with a rate of O(1/T 1/3) without the dependency on the oracle. To further improve the robustness of the framework, we introduce a regularizer for the adversarial distribution. We find that an appropriate regularizer not only prevents the model from a trivial solution but also accelerates the convergence rate to O( p log(T)/T). The detailed theoretical results are summarized in Table 1. The empirical study on pets categorization and digits recognition demonstrates the effectiveness and efficiency of the proposed method. Table 1: Convergence rate for the non-convex model and adversarial distribution ( Adv-Dist ) under different settings. Setting Convergence Model Adv-Dist Model Adv-Dist Smooth Concave O( 1 T 1/3 ) O( 1 T 1/3 ) Smooth Strongly Concave O( q T ) O( log(T ) Related Work Robust optimization has been extensively studied in the past decades (Bertsimas, Brown, and Caramanis 2011). Recently, it has been investigated to improve the performance of the model in the worst case data distribution, which can be interpreted as regularizing the variance (Duchi, Glynn, and Namkoong 2016). For a set of convex loss functions (e.g., a single data set), (Namkoong and Duchi 2016) and (Shalev Shwartz and Wexler 2016) proposed to optimize the maximal loss, which is equivalent to minimizing the loss with the worst case distribution generated from the empirical distribution of data. (Namkoong and Duchi 2016) showed that for the f-divergence constraint, a standard stochastic mirror descent algorithm can converge at the rate of O(1/ T) for the convex loss. In (Shalev-Shwartz and Wexler 2016), the analysis indicates that minimizing the maximal loss can improve the generalization performance. In contrast to a single data set, we focus on dealing with multiple data sets and propose to learn the non-convex model in this work. To tackle non-convex losses, (Chen et al. 2017) proposed to apply a near-optimal oracle. At each iteration, the oracle is called to return a near-optimal model for the given distribution. After that, the adversarial distribution over examples is updated according to the model from the oracle. With an α-optimal oracle, authors proved that the algorithm can converge to the α-optimal solution at the rate of O(1/ T), where T is the number of iterations. The limitation is that even if we assume a near-optimal oracle is accessible for the non-convex problem, the algorithm is too expensive for the real-world applications. It is because that the algorithm has to enumerate the whole data set to update the parameters at each iteration. Without a near-optimal oracle, we prove that the proposed method can converge with a rate of O( p log(T)/T) with an appropriate regularizer and the computational cost is much cheaper. Robust Optimization over Multiple Domains Given K domains, we denote the data set as {S1, , SK}. For the k-th domain, Sk = {xk i , yk i }, xk i is an example (e.g., an image) and yk i is the corresponding label. We aim to learn a model that performs well over all domains. It can be cast as a robust optimization problem as follows. s.t. k, fk(W) ϵ where W is the parameter of a prediction model. fk( ) is the empirical risk of the k-th domain as 1 |Sk|ℓ(xk i , yk i ; W) and ℓ( ) can be any non-negative loss function. Since the cross entropy loss is popular in deep learning, we will adopt it in the experiments. The problem is equivalent to the following minimax problem min W max p:p L(p, W) = p f(W) (1) where f(W) = [f1(W), , f K(W)] . p is an adversarial distribution over multiple domains and p , where is the simplex as = {p RK| PK k=1 pk = 1; k, pk 0}. It is a game between the prediction model and the adversarial distribution. The minimax problem can be solved in an alternating manner, which applies gradient descent to learn the model and gradient ascent to update the adversarial distribution. Considering the large number of examples in each data set, we adopt SGD to observe an unbiased estimation for the gradient at each iteration, which avoids enumerating the whole data set. Specifically, at the t-th iteration, a minibatch of size m is randomly sampled from each domain. The loss of the mini-batch from the k-th domain is ˆf t k(W) = 1 i=1 ℓ(ˆxk i:t, ˆyk i:t; W) It is apparent that E[ ˆf t k(W)] = fk(W) and E[ ˆf t k(W)] = fk(W). Algorithm 1 Stochastic Algorithm for Robust Optimization Input: Data set {S1, , SK}, size of mini-batch m, step-sizes ηw, ηp Initialize p1 = [1/K, , 1/K] for t = 1 to T do Randomly sample m examples from each domain Update Wt+1 as in Eqn. 2 Update pt+1 as in Eqn. 3 end for return W = 1 t Wt, p = 1 After sampling, we first update the model by gradient descent as Wt+1 = Wt ηwˆgt; where ˆgt = X k pt k ˆf t k(Wt) (2) Then, the distribution p is updated in an adversarial way. Since p is from the simplex, we can adopt multiplicative updating criterion (Arora, Hazan, and Kale 2012) to update it as pk t+1 = pk t exp(ηp ˆf t k(Wt)) Zt ; where Zt = X k pk t exp(ηp ˆf t k(Wt)) (3) Alg. 1 summarizes the main steps of the approach. For the convex loss functions, the convergence rate is well known (Nemirovski et al. 2009) and we provide a high probability bound for completeness. All detailed proofs of this work can be found in the supplementary. Lemma 1. Assume the gradient of W and the function value are bounded as t, ˆf t k(Wt) F σ, ˆf t(Wt) 2 γ and W, W F R. Let (W, p) denote the results returned by Alg. 1 after T iterations. Set the step-sizes as ηw = R σ T . Then, with a probability 1 δ, we have max p L(p, W) min W L( p, W) c1 T where c1 = O( p log(K)) and c2 is a constant. Lemma 1 shows that the proposed method with the convex loss can converge to the saddle point at the rate of O(1/ T) with high probability, which is a stronger result than the expectation bound in (Namkoong and Duchi 2016). Note that setting ηw = O( 1 T ) and ηp = O( q T ) will not change the order of the convergence rate, which means σ, γ and R are not required for implementation. Non-convexity Despite the extensive studies about the convex loss, there is little research about the minimax problem with non-convex loss. To provide the convergence rate for the non-convex problem, we first have the following lemma. Lemma 2. With the same assumptions as in Lemma 1, if ℓ( ) is non-convex but L-smoothness, we have X t E[ Wt L(pt, Wt) 2 F ] L(p0, W0) ηw + ηp Tγ2 2ηw + TLηwσ2 t E[L(pt, Wt)] max p t E[L(p, Wt)] ( log(K) Since the loss is non-convex, the convergence is measured by the norm of the gradient (i.e., stationary point), which is a standard criterion for the analysis in the non-convex problem (Ghadimi and Lan 2013). Lemma 2 indicates that W can converge to a stationary point where pt is a qualified adversary by setting the step-sizes elaborately. Furthermore, it demonstrates that the convergence rate of W will be influenced by the convergence rate of p via ηp. With Lemma 2, we have the convergence analysis of the non-convex minimax problem as follows. Theorem 1. With the same assumptions as in Lemma 2, if we set the step-sizes as ηw = L T 1/3 and ηp = γ T 2/3, we have t Wt L(pt, Wt) 2 F ] ( L(p0, W0) q 2 log(K) + q t L(pt, Wt)] E[max p 1 T t L(p, Wt)] γ p Remark Compared with the convex case in Lemma 1, the convergence rate of a non-convex problem is degraded from O(1/ T) to O(1/T 1/3). It is well known that the convergence rate of general minimization problems with a smooth non-convex loss can be up to O(1/ T) (Ghadimi and Lan 2013). Our results further demonstrate that minimax problems with non-convex loss is usually harder than non-convex minimization problems. Different step-sizes can lead to different convergence rates. For example, if the step-size for updating p is increased as ηp = 1/ T and that for model is decreased as ηw = 1/T 1/4, the convergence rate of p can be accelerated to O(1/ T) while the convergence rate of W will degenerate to O(1/T 1/4). Therefore, if a sufficiently small step-size is applicable for p, the convergence rate of W can be significantly improved. We exploit this observation to enhance the convergence rate in the next subsection. Regularized Non-convex Optimization A critical problem in minimax optimization is that the formulation is very sensitive to the outlier. For example, if there is a domain with significantly worse performance than others, it will dominate the learning procedure according to Eqn. 1 (i.e., one-hot value in p). Besides the issue of robustness, it is prevalent in real-world applications that the importance of domains is different according to their budgets, popularity, etc. Incorporating the side information into the formulation is essential for the success in practice. Given a prior distribution, the problem can be written as min W max p:p p f(W) s.t. D(p||q) τ where q is the prior distribution which can be a distribution defined from the side information or a uniform distribution for robustness. D( ) defines the distance between two distributions, e.g., Lp distance or KL-divergence DL2(p||q) = p q 2 2; DKL(p||q) = X k pk log(pk/qk) Since KL-divergence cannot handle the prior distribution with zero elements, optimal transportation (OT) distance becomes popular recently to overcome the drawback DOT(p||q) = min P U(p,q) P, M For computational efficiency, we use the version with an entropy regularizer (Cuturi 2013) and we have Proposition 1. Define the OT regularizer as DOT(p||q) = max α,β min P 1 ν i,j Pi,j log(P(i, j)) +Pi,j Mi,j + α (P1K p) + β (P1K q) (4) and it is convex in p. According to the duality theory (Boyd and Vandenberghe 2004), for each τ, we can have the equivalent problem with a specified λ min W max p:p ˆL(p, W) = p f(W) λ 2 D(p||q) (5) Compared with the formulation in Eqn. 1, we introduce a regularizer for the adversarial distribution. If D(p||q) is convex in p, the similar convergence as in Theorem. 1 can be obtained with the same analysis. Moreover, according to the research for SGD, the strongly convexity is the key to achieve the optimal convergence rate (Rakhlin, Shamir, and Sridharan 2012). Hence, we adopt a strongly convex regularizer i.e., L2 regularizer, for the distribution. The convergence rate for other strongly convex regularizers can be obtained with a similar analysis by defining the smoothness and the strongly convexity with the corresponding norm. Equipped with the L2 regularizer, the problem in Eqn. 5 can be solved with projected first-order algorithm. We adopt the projected gradient ascent to update the adversarial distribution as pt+1 = P (pt + ηt pˆht); where ˆht = ˆft λ(pt q) P (p) projects the vector p onto the simplex. The projection algorithm can be found in (Duchi et al. 2008) which is based on K.K.T. condition. We also provide the gradient of OT regularizer in the supplementary. Since the regularizer (i.e., L2) is strongly concave, the convergence of p can be accelerated dramatically, which leads to a better convergence rate for the minimax problem. The theoretical result is as follows. Theorem 2. With the same assumptions as in Theorem 1, if we assume t, ˆht 2 µ and set step-sizes as ηw = λLT and ηt p = 1 λt, we have t Wt ˆL(pt, Wt) 2 F ] log(T) + µπ2σ λL 12 + 2µσ p t ˆL(pt, Wt)] E[max p 1 T t ˆL(p, Wt)] µ2 log(T) Remark With the strongly concave regularizer, it is not surprise to obtain the O(log(T)/T) convergence rate for p. As we discussed in Lemma 2, a fast convergence rate of p can improve that of W. In Theorem 2, the convergence rate of W is improved from O(1/T 1/3) to O( p log(T)/T). It shows that the applied regularizer not only improves the robustness of the proposed framework but also accelerates the learning procedure. Moreover, the step-size for the adversarial distribution provides a trade-off between the bias and variance of the gradient. Therefore, the convergence rate can be further improved by reducing the variance. We shrink the gradient with a factor c and update the distribution as pt+1 = P (pt + ηt p 1 + c/t ˆht) When taking ηt p = 1 λt, the update becomes pt+1 = P (pt + 1 λ(t + c) ˆht) (6) With a similar analysis as Theorem 2, we have Theorem 3. With the same assumptions as in Theorem 2, if we set the step-size ηt p = 1 λ(t+c), we have t ˆL(pt, Wt)] E[max p 1 T t ˆL(p, Wt)] (λc + µ2 c + 1) + µ2 It shows that the constant c can control the trade-off between bias (i.e., λc) and variance (i.e., µ2 c + 1)). By setting the constant appropriately, we can have the following corollary Corollary 1. When setting c = µ2 λ2T ), the RHS in Theorem 3 is maximum. The optimality is from the fact that RHS is concave in c and detailed discussion can be found in the supplementary. The algorithm for robust optimization with the regularizer is summarized in Alg. 2. Algorithm 2 Stochastic Regularized Robust Optimization Input: Data set {S1, , SK}, size of mini-batch m, step-sizes ηw, ηp Initialize p1 = [1/K, , 1/K] Compute the constant c as in Corollary 1 for t = 1 to T do Randomly sample m examples from each domain Update Wt+1 with gradient descnet (Optional) Solve the problem in Eqn. 4 if applying DOT(pt||q) Update pt+1 with gradient ascent Project pt+1 onto the simplex end for Trade Efficiency for Convergence In this subsection, we study if we can recover the optimal convergence rate for the general non-convex problem as in (Ghadimi and Lan 2013). Note that (Chen et al. 2017) applies a near-optimal oracle to achieve the O(1/ T) convergence rate. Given a distribution, it is hard to observe an oracle for the non-convex model. In contrast, obtaining the near-optimal adversarial distribution with a fixed model is feasible. For the original problem in Eqn. 1, the solution is trivial as returning the index of the domain with the largest empirical loss. For the problem with the regularizer in Eqn. 5, the near-optimal p can be obtained efficiently by any first order methods (Boyd and Vandenberghe 2004). Therefore, we can change the updating criterion for the distribution at the t-th iteration to Obtain pt+1 such that pt+1 p t+1 1 ξt+1 where p t+1 = arg max p:p L(p, Wt) (7) With the new updating criterion and letting F(W) = maxp L(p, W), we can have a better convergence rate as follows. Theorem 4. With the same assumptions as in Theorem 1, if we update p as in Eqn. 7, where ξt = 1 t, and set the step-size as ηw = LT , we have T F(Wt) 2 F ] (F(W0) + 1) For the problem in Eqn. 1, ξt can be 0 by a single pass through the whole data set. It shows that with an expensive but feasible operator as in Eqn. 7, the proposed method can recover the optimal convergence rate for the non-convex problem. Experiments We conduct the experiments on training deep neural networks over multiple domains. The methods in the comparison are summarized as follows. Individual: It learns the model from an individual domain. Mixture Even: It learns the model from multiple domains with even weights, which is equivalent to fixing p as an uniform distribution. Mixture Opt: It implements the approach proposed in Alg. 2 that learns the model and the adversarial distribution over multiple domains simultaneously. We adopt the popular cross entropy loss as the loss function ℓ( ) in this work. Deep models are trained with SGD and the size of each mini-batch is set to 200. For the methods learning with multiple domains, the number of examples from different domains are the same in a mini-batch and the size is m = 200/K. Compared with the strategy that samples examples according to the learned distribution, the applied strategy is deterministic and will not introduce extra noise. The method is evaluated by investigating the worst case performance among multiple domains. For the worst case accuracy, it is defined as Accw = mink{Acc1, , Acc K}. The worst case loss is defined as fw(W) = maxk{f1(W), , f K(W)}. All experiments are implemented on an NVIDIA Tesla P100 GPU. Pets Categorization First, we compare the methods on a fine-grained visual categorization task. Given the data sets of VGG cats&dogs (Parkhi et al. 2012) and Image Net (Russakovsky et al. 2015), we extract the shared labels between them and then generate the subsets with desired labels from them, respectively. The resulting data set consists of 24 classes and the task is to assign the image of pets to one of these classes. For Image Net, each class contains about 1, 200 images for training while that of VGG only has 100 images. Therefore, we apply data augmentation by flipping (horizontal+vertical) and rotating ({45 , , 315 }) for VGG to avoid overfitting. After that, the number of images in VGG is similar to that of Image Net. Some exemplar images from these data sets are illustrated in Fig. 3. We can find that the task in Image Net is more challenging than that in VGG due to complex backgrounds. We adopt Res Net18 (He et al. 2016) as the base model in this experiment. It is initialized with the parameters learned from ILSVRC2012 (Russakovsky et al. 2015) and we set the learning rate as ηw = 0.005 for fine-tuning. Considering the small size of data sets, we also include the method of (Chen et al. 2017) in comparison and it is denoted as Mixture Oracle. Since the near-optimal oracle is infeasible for Mixture Oracle, we apply the model with 100 SGD iterations instead as suggested in (Chen et al. 2017). The prior distribution in the regularizer is set to the uniform distribution. Fig. 2 summarizes the worst case training loss among multiple domains for the methods in the comparison. Since the performance of models learned from multiple domains is significantly better than those learned from an individual set, we illustrate the results in separate figures. Fig. 2 (a) compares the proposed method to those with the individual data set. It is evident that the proposed method has the superior performance and learning with an individual domain cannot handle the data from other domains well. Fig. 2 (b) shows 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Mini-batches Worst Case Training Loss Individual Image Net Individual VGG Mixture Opt (a) Pets Categorization 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Mini-batches Worst Case Training Loss Mixture Even Mixture Oracle Mixture Opt (b) Pets Categorization 1 2 3 4 5 6 Number of Mini-batches #104 Worst Case Training Loss Individual MNIST Individual SVHN Mixture Opt (c) Digits Recognition 1 2 3 4 5 6 Number of Mini-batches #104 Worst Case Training Loss Mixture Even Mixture Opt (d) Digits Recognition Figure 2: Illustration of worst case training loss. Table 2: Comparison on pets categorization. We report the loss and accuracy (%) on each data set. Methods Image Net VGG Acc Trw Acc Tew Loss Tr Acc Tr Acc Te Loss Tr Acc Tr Acc Te Individual Image Net 0.07 98.95 89.92 0.85 74.56 80.44 74.56 80.44 Individual VGG 0.90 75.47 77.92 0.02 100.00 86.85 75.47 77.92 Mixture Even 0.17 95.56 88.50 0.05 99.58 89.85 95.56 88.50 Mixture Oracle 0.15 96.04 88.92 0.06 99.41 89.99 96.04 88.92 Mixture Opt 0.12 97.36 89.42 0.11 97.72 89.35 97.36 89.35 Figure 3: Exemplar images from Image Net and VGG. 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Mini-batches Difference in Training Loss Mixture Even Mixture Oracle Mixture Opt Figure 4: Comparison of discrepancy in losses. Running Time (seconds) Mixture Even Mixture Opt Mixture OT Mixture Oracle Figure 5: Comparison of running time. the results of the methods learning with multiple data sets. First, we find that both Mixture Oracle and Mixture Opt can achieve the lower worst case loss than Mixture Even, which 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Mini-batches Training Loss Mixture Even-Best Mixture Even-Worst Mixture Opt-Best Mixture Opt-Worst (a) σ {0, 4, 8, 12} 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Mini-batches Training Loss Mixture Even-Best Mixture Even-Worst Mixture Opt-Best Mixture Opt-Worst (b) σ {0, 10, 20, 30} Figure 6: Illustration of best and worst training loss on Image Net with Gaussian noise N(0, σ2). confirms the effectiveness of the robust optimization. Second, Mixture Opt performs best among all of these methods and it demonstrates that the proposed method can optimize the performance over the adversarial distribution. To investigate the discrepancy between the performances on two domains, we illustrate the result in Fig. 4. The discrepancy is measured by the difference between the empirical loss as f Image Net f VGG. We can find that f Image Net is smaller than f VGG at the beginning but f VGG decreases faster than f Image Net. It is because the model is initialized with the parameters pre-trained on Image Net. However, the task in VGG is easier than that in Image Net, and f VGG drops faster after a few iterations. Compared with the benchmark methods, the discrepancy from the proposed method is an order of magnitude better throughout the learning procedure. It verifies the robustness of Mixture Opt and also shows that the proposed method can handle the drifting between multiple domains well. Finally, to compare the performance explicitly, we include the detailed results in Table 2. Compared with the Mixture Even, we observe that Mixture Opt can pay more attention to Image Net than VGG and trade the perfor- 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Mini-batches Difference in Training Loss 6=0.1 6=0.05 6=0.01 (a) DL2(p||q) 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Mini-batches Difference in Distribution 6=0.1 6=0.05 6=0.01 (b) DL2(p||q) 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Mini-batches Difference in Training Loss 6=0.001 6=0.0005 6=0.0001 (c) DOT(p||q) 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Mini-batches Difference in Distribution 6=0.001 6=0.0005 6=0.0001 (d) DOT(p||q) Figure 7: Illustration of the influence of the regularizer. Table 3: Comparison on digits recognition. Methods MNIST SVHN Acc Trw Acc Tew Loss Tr Acc Tr Acc Te Loss Tr Acc Tr Acc Te Individual MNIST 0.001 100.00 98.81 4.01 30.80 29.58 30.80 29.58 Individual SVHN 1.91 66.66 68.25 0.10 97.11 91.84 66.66 68.25 Mixture Even 0.001 100.00 98.74 0.14 96.20 91.33 96.20 91.33 Mixture Opt 0.03 99.03 98.13 0.11 97.05 92.14 97.05 92.14 mance between them. To further demonstrate that Mixture Opt can trade the performance effectively, we conduct the experiments with noisy data. We simulate each individual domain by adding the random Gaussian noise from N(0, σ2) to each pixel of the images from Image Net pets. We vary the variance to generate the different domains and obtain two tasks where each has four domains with σ {0, 4, 8, 12} and σ {0, 10, 20, 30}, respectively. Fig. 6 compares the gap between the best and worst performance on different domains for Mixture Even and Mixture Opt. First, we can find that the proposed method improves the worst-case performance significantly while keeping the best performance almost the same. Besides, domains can achieve the similar performance for the simple task with variance in {0, 4, 8, 12}. For the hard task that includes an extreme domain with noise from N(0, 302), the best performance is not sacrificed much due to the appropriate regularizer in Mixture Opt. After the comparison of performance, we illustrate the influence of the parameter λ in Fig. 7. The parameter can be found in Eqn. 5 and it constrains the distance of the adversarial distribution to the prior distribution. Besides the L2 regularizer applied in Mixture Opt, we also include the results of the OT regularizer defined in Proposition 1 and the method is denoted as Mixture OT. Fig. 7 (a) and (c) compare the discrepancy between the losses as in previous experiments. It is obvious that the smaller the λ, the smaller the gap between two domains. Fig. 7 (b) and (d) summarize the drifting in a distribution, which is defined as p Image Net p VGG. Evidently, the learned adversarial distribution can switch adaptively according to the performance of the current model and the importance of multiple domains can be constrained well by setting λ appropriately. Finally, we compare the running time in Fig. 5. Due to the lightweight update for the adversarial distribution, Mixture Opt and Mixture OT have almost the same running time as Mixture Even. Mixture Oracle has to enumerate the whole data set after each 100 SGD iterations to update the current distribution, hence, its running time with only 50 complete iterations is nearly 3 times slower than the proposed method with 5, 000 iterations on these small data sets. Digits Recognition In this experiment, we examine the methods on the task of digits recognition, which is to identify 10 digits (i.e., 0-9) from images. There are two benchmark data sets for the task: MNIST and SVHN. MNIST (Le Cun et al. 1998) is collected for recognizing handwritten digits. It contains 60, 000 images for training and 10, 000 images for test. SVHN (Netzer et al. 2011) is for identifying the house numbers from Google Street View images, which consists of 604, 388 training images and 26, 032 test images. Note that the examples in MNIST are 28 28 gray images while those in SVHN are 32 32 color images. To make the format consistent, we resize images in MNIST to be 32 32 and repeat the gray channel in RGB channels to generate the color images. Considering the task is more straightforward than pets categorization, we apply the Alex Net (Krizhevsky, Sutskever, and Hinton 2012) as the base model in this experiment and set the learning rate as ηw = 0.01. With a different deep model, we also demonstrate that the proposed framework can incorporate with various deep models. Fig. 2 (c) and (d) show the comparison of the worst case training loss and Table 3 summarizes the detailed results. We can observe the similar conclusion as the experiments on pets categorization. Mixture Even can achieve good performance on these simple domains while the proposed method can further improve the worst case performance and provide a more reliable model for multiple domains. In this work, we propose a framework to learn a robust model over multiple domains, which is essential for the service of cloud computing. The introduced algorithm can learn the model and the adversarial distribution simultaneously, for which we provide a theoretical guarantee on the convergence rate. The empirical study on real-world applications confirms that the proposed method can obtain a robust nonconvex model. In the future, we plan to examine the performance of the method with more applications. Besides, extending the framework to multiple domains with partial overlapped labels is also important for real-world applications. Acknowledgments We would like to thank Dr. Juhua Hu from University of Washington Tacoma and anonymous reviewers for their valuable suggestions that help to improve this work. Arora, S.; Hazan, E.; and Kale, S. 2012. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing 8(1):121 164. Bertsimas, D.; Brown, D. B.; and Caramanis, C. 2011. Theory and applications of robust optimization. SIAM Review 53(3):464 501. Boyd, S., and Vandenberghe, L. 2004. Convex optimization. Cambridge university press. Chen, R. S.; Lucier, B.; Singer, Y.; and Syrgkanis, V. 2017. Robust optimization for non-convex objectives. In NIPS, 4708 4717. Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2292 2300. Duchi, J. C.; Shalev-Shwartz, S.; Singer, Y.; and Chandra, T. 2008. Efficient projections onto the l1-ball for learning in high dimensions. In ICML, 272 279. Duchi, J. C.; Glynn, P.; and Namkoong, H. 2016. Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach. Ar Xiv e-prints. Ghadimi, S., and Lan, G. 2013. Stochastic firstand zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4):2341 2368. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770 778. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1106 1114. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324. Namkoong, H., and Duchi, J. C. 2016. Stochastic gradient methods for distributionally robust optimization with fdivergences. In NIPS, 2208 2216. Nemirovski, A.; Juditsky, A.; Lan, G.; and Shapiro, A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19(4):1574 1609. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, 5. Parkhi, O. M.; Vedaldi, A.; Zisserman, A.; and Jawahar, C. V. 2012. Cats and dogs. In CVPR. Rakhlin, A.; Shamir, O.; and Sridharan, K. 2012. Making gradient descent optimal for strongly convex stochastic optimization. In ICML. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. Image Net Large Scale Visual Recognition Challenge. IJCV 115(3):211 252. Shalev-Shwartz, S., and Wexler, Y. 2016. Minimizing the maximal loss: How and why. In ICML, 793 801. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S. E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In CVPR, 1 9. Zhang, Y., and Yang, Q. 2017. A survey on multi-task learning. Co RR abs/1707.08114.