# domain_adversarial_training_a_game_perspective__ab4cc768.pdf

Published as a conference paper at ICLR 2022

DOMAIN ADVERSARIAL TRAINING A GAME PERSPECTIVE

David Acuna123, Marc T. Law2, Guojun Zhang34, Sanja Fidler123

1University of Toronto 2NVIDIA 3Vector Institute 4University of Waterloo

The dominant line of work in domain adaptation has focused on learning invariant representations using domain-adversarial training. In this paper, we interpret this approach from a game theoretical perspective. Deﬁning optimal solutions in domain-adversarial training as local Nash equilibria, we show that gradient descent in domain-adversarial training can violate the asymptotic convergence guarantees of the optimizer, oftentimes hindering the transfer performance. Our analysis leads us to replace gradient descent with high-order ODE solvers (i.e., Runge Kutta), for which we derive asymptotic convergence guarantees. This family of optimizers is signiﬁcantly more stable and allows more aggressive learning rates, leading to high performance gains when used as a drop-in replacement over standard optimizers. Our experiments show that in conjunction with state-of-the-art domain-adversarial methods, we achieve up to 3.5% improvement with less than half of training iterations. Our optimizers are easy to implement, free of additional parameters, and can be plugged into any domain-adversarial framework.

1 INTRODUCTION

Unsupervised domain adaptation (UDA) deals with the lack of labeled data in a target domain by transferring knowledge from a labeled source domain (i.e., a related dataset with different distribution where abundant labeled data already exists). The paramount importance of this paradigm has led to remarkable advances in the ﬁeld in terms of both theory and algorithms (Ben-David et al., 2007; 2010a;b; Mansour et al., 2009). Several state-of-the-art algorithms tackle UDA by learning domaininvariant representations in an adversarial fashion (Shu et al., 2018; Long et al., 2018; Saito et al., 2018; Hoffman et al., 2018; Zhang et al., 2019; Acuna et al., 2021). Their goal is to fool an auxiliary classiﬁer that operates in a representation space and aims to classify whether the datapoint belongs to either the source or the target domain. This idea, called Domain-Adversarial Learning (DAL), was introduced by Ganin et al. (2016) and can be more formally understood as minimizing the discrepancy between source and target domain in a representation space (Acuna et al., 2021).

Despite DAL being a dominant approach for UDA, alternative solutions have been sought as DAL is noticeably unstable and difﬁcult to train in practice (Sener et al., 2016; Sun et al., 2019; Chang et al., 2019). One major cause of instability is the adversarial nature of the learning algorithm which results from the introduction of the Gradient Reversal Layer (GRL, Ganin et al., 2016) (Figure 1). GRL ﬂips the sign of the gradient during the backward pass, which has profound implications on the training dynamics and asymptotic behavior of the learning algorithm. Indeed, GRL transforms gradient descent into a competitive gradient-based algorithm which may converge to periodic orbits and other non-trivial limiting behavior that arise for instance in chaotic systems (Mazumdar et al., 2020). Surprisingly, little attention has been paid to this fact, and speciﬁcally to the adversarial component and interaction among the three different networks in the algorithm. In particular, three fundamental questions have not been answered from an algorithmic point of view, 1) What is optimality in DAL? 2) What makes DAL difﬁcult to train and 3) How can we mitigate this problem?

In this work, we aim to answer these questions by interpreting the DAL framework through the lens of game theory. Speciﬁcally, we use tools developed by the game theoretical community in Ba sar & Olsder (1998); Letcher et al. (2019); Mazumdar et al. (2020) and draw inspiration from the existing two-player zero-sum game interpretations of Generative Adversarial Networks (GANs)

Published as a conference paper at ICLR 2022

(Goodfellow et al., 2014). We emphasize that in DAL, however, we have three rather than two networks interacting with each other, with partial cooperation and competition. We propose a natural three-player game interpretation for DAL, which is not necessarily equivalent to two-player zero-sum game interpretations (see Example 1), which we coin as the Domain-Adversarial Game. We also propose to interpret and characterize optimal solutions in DAL as local Nash Equilibria (see Section 3). This characterization introduces a proper mathematical deﬁnition of algorithmic optimality for DAL. It also provides sufﬁcient conditions for optimality that drives the algorithmic analysis.

With our proposed game perspective in mind, a simple optimization solution would be to use the Gradient Descent (GD) algorithm, which is the de facto solution but known to be unstable. Alternatively, we could also use other popular gradient based optimizers proposed in the context of differentiable games (e.g. Korpelevich, 1976; Mescheder et al., 2017). However, we notice that these do not outperform GD in practice (see 6). To understand why, we analyze the asymptotic behavior of gradient-based algorithms in the proposed domain-adversarial game ( 4). The main result of 4.2 (Theorem 2) shows that GD with GRL (i.e., the existing solution for DAL) violates the asymptotic convergence guarantees to local NE unless an upper bound is placed on the learning rate, which may explain its training instability and sensitivity to optimizer parameters. In 4.3, Appendix B.2 and Appendix E, we also provide a similar analysis for the popular game optimization algorithms mentioned above. We emphasize however that while some of our results may be of independent interest for learning in general games, our focus is DAL. 4.3 and 6 show both theoretically and experimentally that the limitations mentioned above disappear if standard optimizers are replaced with ODE solvers of at least second order. These are straightforward to implement as drop-in replacements to existing optimizers. They also lead to more stable algorithms, allow for more aggressive learning rates and provide notable performance gains.

2 PRELIMINARIES

Rλ (GRL) λr

n3GZp AYl Wyw KU0FMTGZfkw FXy Iy YWEKZ4v ZWwk ZUWZs Ni Ubgrf8ip XVS9q6rbu Kz Ubv M4in ACp3AOHlx De6h Dk1g PAMr/Dm PDovzrvzs Wgt OPn Mfy B8/k Dzce M7w=</latexit>g

Figure 1: We study domainadversarial training from a game perspective. In DAL (Ganin et al. (2016)), three networks interact with each other: the feature extractor (g), the domain classiﬁer (ˆh ) and the classiﬁer (ˆh). During backpropagation, the GRL ﬂips the sign of the gradient with respect to g.

We focus on the UDA scenario and follow the formulation from Acuna et al. (2021). This makes our analysis general and applicable to most state-of-the-art DAL algorithms (e.g., Ganin et al. (2016); Saito et al. (2018); Zhang et al. (2019)). We assume that the learner has access to a source dataset (S) with labeled examples and a target dataset (T) with unlabeled examples, where the source inputs xs i are sampled i.i.d. from a (source) distribution Ps and the target inputs xt i are sampled i.i.d. from a (target) distribution Pt, both over X. We have Y = {0, 1} for binary classiﬁcation, and Y = {1, ..., k} in the multiclass case. The risk of a hypothesis h : X Y w.r.t. the labeling function f, using a loss function ℓ: Y Y R+ under distribution D is deﬁned as: Rℓ D(h, f) := ED[ℓ(h(x), f(x))]. For simplicity, we deﬁne Rℓ S(h) := Rℓ Ps(h, fs) and Rℓ T (h) := Rℓ Pt(h, ft). The hypothesis class of h is denoted by H.

UDA aims to minimize the risk in the target domain while only having access to labeled data in the source domain. This risk is upper bounded in terms of the risk of the source domain, the discrepancy between the two distributions and the joint hypothesis error λ : Theorem 1. (Acuna et al. (2021)) Let us note ℓ: Y Y [0, 1], λ := minh H Rℓ S(h) + Rℓ T (h), and Dφ h,H(Ps||Pt) := suph H |Ex Ps[ℓ(h(x), h (x))] Ex Pt[φ (ℓ(h(x), h (x)))|. We have:

Rℓ T (h) Rℓ S(h) + Dφ h,H(Ps||Pt) + λ . (1)

The function φ : R+ R deﬁnes a particular f-divergence and φ is its (Fenchel) conjugate. As is typical in UDA, we assume that the hypothesis class is complex enough and both fs and ft are similar in such a way that the non-estimable term (λ ) is negligible and can be ignored.

Domain-Adversarial Training (see Figure 1) aims to ﬁnd a hypothesis h H that jointly minimizes the ﬁrst two terms of Theorem 1. To this end, the hypothesis h is interpreted as the composition of h = ˆh g with g : X Z, and ˆh : Z Y. Another function class ˆH is then deﬁned to formulate H := {ˆh g : ˆh ˆH, g G}. The algorithm tries to ﬁnd the function g G such that ˆh g minimizes the risk of the source domain (i.e. the ﬁrst term in Theorem 1), and its composition with ˆh and ˆh minimizes the divergence of the two distributions (i.e. the second term in Theorem 1).

Published as a conference paper at ICLR 2022

Algorithmically, the computation of the divergence function in Theorem 1 is estimated by a so-called domain classiﬁer ˆh ˆH whose role is to detect whether the datapoint g(xi) Z belongs to the source or to the target domain. When there does not exist a function ˆh ˆH that can properly distinguish between g(xs i) and g(xt i), g is said to be invariant to the domains.

Learning is performed using GD and the GRL (denoted by Rλ) on the following objective:

min ˆh ˆ H,g G,ˆh ˆ H Ex ps[ℓ(ˆh g, y)] αds,t(ˆh, ˆh , Rλ(g)), (2)

where ds,t(ˆh, ˆh , g) := Ex ps[ˆℓ(ˆh g, ˆh g)] Ex pt[(φ ˆℓ)(ˆh g, ˆh g)]. Mathematically, the GRL Rλ is treated as a pseudo-function deﬁned by two (incompatible) equations describing its forward and back-propagation behavior (Ganin & Lempitsky, 2015; Ganin et al., 2016). Speciﬁcally,

Rλ(x) := x and d Rλ(x)/dx := λ, (3)

where λ and α are hyper-parameters that control the tradeoff between achieving small source error and learning an invariant representation. The surrogate loss ℓ: Y Y R (e.g., cross-entropy) is used to minimize the empirical risk in the source domain. The choice of function ˆℓ: Y Y R and of conjugate φ of the f-divergence deﬁnes the particular algorithm (Ganin et al., 2016; Saito et al., 2018; Zhang et al., 2019; Acuna et al., 2021). From eq. 2, we can notice that GRL introduces an adversarial scheme. We next interpret eq. 2 as a three-player game where the players are ˆh, ˆh and g, and study its continuous gradient dynamics.

3 A GAME PERSPECTIVE ON DAL

We now interpret DAL from a game-theoretical perspective. In 3.1, we rewrite the DAL objective as a three-player game. In this view, each of the feature extractor and two classiﬁers is a player. This allows us to deﬁne optimality in terms of local Nash Equilibrium (see Def. 2 in Appendices). In 3.2, we introduce the vector ﬁeld, the game Hessian and the tools that allow us to characterize local NE for the players. This characterization leads to our analysis of the continuous dynamics in 4.

3.1 DOMAIN-ADVERSARIAL GAME

We now rewrite and analyze the DAL problem in eq. 2 as a three-player game. Let ˆH, ˆH and G be classes of neural network functions and deﬁne ω1 Ω1 and ω3 Ω3 as a vector composed of the parameters of the classiﬁer and domain classiﬁer networks ˆh ˆH and ˆh ˆH, respectively. Similarly, let ω2 Ω2 be the parameters of the feature extractor network g G. Their joint domain is denoted by Ω= Ω1 Ω2 Ω3 and the joint parameter set is ω = (ω1, ω2, ω3). Let each neural network be a player and its parameter choice to be its individual strategy (here continuous). The goal of each player is then to selﬁshly minimize its own cost function Ji : Ω R. We use the subscript

i to refer to all parameters/players but i. With the notation introduced, we can now formally deﬁne the Domain-Adversarial Game as the three-player game G(I, Ωi, Ji) where I := {1, 2, 3}, dim(Ω) = P3 i=1 dim(Ωi) = d, Ωi Rdi and:

J1(ω1, ω 1) := ℓ(ω1, ω2) + αds,t(ω) J2(ω2, ω 2) := ℓ(ω1, ω2) + αλds,t(ω) J3(ω3, ω 3) := αds,t(ω), (4)

We use the shorthand ℓ(ω1, ω2) for Ex,y ps[ℓ(ω1 ω2(x), y)], and ωi s refer to the feature extractor g and the classiﬁers (ˆh and ˆh ). Similar notation follows for ds,t. Here, we assume that each Ji is smooth in each of its arguments ωi Ωi.

The gradient ﬁeld of Equation (2) and the game s vector ﬁeld (see 3.2) are equivalent, making the original interpretation of DAL and our three-player formulation equivalent. However, it is worth noting that our intepretation does not explicitly require the use of Rλ in ds,t in Equation (4). We can write optimality conditions of the above problem through the concept of Nash Equilibrium: Deﬁnition 1. (Nash Equilibrium (NE)) A point ω Ωis said to be a Nash Equilibrium of the Domain-Adversarial Game if i {1, 2, 3}, ωi Ωi, we have: Ji(ω i , ω i) Ji(ωi, ω i).

In our scenario, the losses are not convex/concave. NE then does not necessarily exist and, in general, ﬁnding NE is analogous to, but much harder than, ﬁnding global minima in neural networks which

Published as a conference paper at ICLR 2022

is unrealistic using gradient-based methods (Letcher et al., 2019). Thus, we focus on local NE which relaxes the NE to a local neighborhood B(w , δ) := {||w w || < δ} with δ > 0 (see Deﬁnition 2).

Intuitively, a NE means that no player has the incentive to change its own strategy (here parameters of the neural network) because it will not generate any additional pay off (here it will not minimize its cost function). We emphasize that each player only has access to its own strategy set. In other words, the player J1 cannot change the parameters ω2, ω3. It only has access to ω1 Ω1.

While the motivation of the three-player game follows naturally from the original formulation of DAL where three networks interact with each other (see Figure 1), the optimization problem (2) could also be interpreted as the minimax objective of a two-player zero-sum game. Thus, a natural question arises: can we interpret the domain-adversarial game as a two player zero-sum game? This can be done for example by deﬁning ω 12 := (ω 1, ω 2), and considering the cost of the two players (ω12, ω3) as J12 = J and J3 = J where J(ω12, ω3) := Eps[ℓ(ω1, ω2)] + ds,t(ω). In general, however, the solution of the two-player game (ω 12, ω 3) is not equal to the NE solution of the three-player game (ω 1, ω 2, ω 3). This is because the team optimal solution ω 12 = (ω 1, ω 2) in general. We illustrate this in the following counterexample (see Ba sar & Olsder (1998) for more details): Example 1. Let the function J(ω) := 1

2 ω2 1 + 4ω1ω2 + ω2 2 ω2 3 .

(a) Suppose the three-player game ω = (ω1, ω2, ω3) with J1 = J2 = J and J3 = J. Each Ji is strictly convex in ωi. The NE solution of the game ω = (0, 0, 0) is unique. (b) Suppose the two-player game ω = (ω12, ω3) with J12 = J and J3 = J. The solution ω from (a) is not a NE solution. To see this, let ˆω := ( 1, 1, 0). We have that J12(ˆω) = 1 < J12(ω ) = 0. This contradicts Deﬁnition 1. One can verify that there is no NE in this two-player scenario.

3.2 CHARACTERIZATION OF THE DOMAIN-ADVERSARIAL GAME We now introduce the game s vector ﬁeld (also called pseudo-gradient) and the pseudo-gradient s Jacobian. We also provide a characterization of local NE based on them (see 3). These are the core concepts used in our analysis ( 4). We ﬁrst deﬁne the game s vector ﬁeld v(w), and its Jacobian H(ω) (also called the game Hessian (Letcher et al., 2019)):

v(ω) := ( ω1J1, ω2J2, ω3J3) Rd, H(ω) := v(ω) Rd d (5) Note that the vector ﬁeld v(w) and the three-player formulation naturally capture the behavior introduced by the GRL in the original formulation. Speciﬁcally, v(ω) is identical to the gradient with respect to the parameters of the original DAL objective with GRL (Equation (2)). Therefore, in both cases the behavior of GD is identical. Assuming the same initial conditions, they will reach the same solution. This shows the equivalence between our perspective and the original DAL formulation. We emphasize that by equivalence, we mean the same dynamics, and the same intermediate and ﬁnal solutions. Another fact worth emphasizing is that H(ω) is asymmetric. This is in contrast with the Hessian in supervised learning. Before proceeding with a characterization of local NE in terms of v(w) and H(ω), we ﬁrst deﬁne sufﬁcient and necessary conditions for local NEs: Proposition 1. (Necessary condition). Suppose each Ji is twice continuously differentiable in each ωi, any local NE ω satisﬁes: i) ωi Ji(ω ) = 0 and ii) i {1, 2, 3}, 2 ωi,ωi Ji(ω ) 0. Proposition 2. (Sufﬁcient condition). Suppose each Ji is twice continuously differentiable in each ωi. ω i is a local NE if i) ωi Ji(ω ) = 0 and ii) i, 2 ωi,ωi Ji(ω ) 0.

The necessary and sufﬁcient conditions from Propositions 1 and 2 are reminiscent of conditions for local optimality in continuous optimization (Nocedal & Wright, 2006). Similar conditions were also proposed in Ratliff et al. (2016) where the sufﬁcient condition deﬁnes the differential Nash equilibrium. We can now characterize a local NE in terms of v(w) and H(ω): Proposition 3. (Strict Local NE) w is a strict local NE if v(w) = 0 and H(ω) + H(ω) 0. The sufﬁcient condition implies that the NE is structurally stable (Ratliff et al., 2016). Structural stability is important as it implies that slightly biased estimators of the gradient (e.g., due to sampling noise) will not have vastly different behaviors in neighborhoods of equilibria (Mazumdar et al., 2020). In the following, we focus on the strict local NE (i.e., ω for which Proposition 3 is satisﬁed).

4 LEARNING ALGORITHMS

We deﬁned optimality as the local NE and provided sufﬁcient conditions in terms of the pseudogradient and its Jacobian. In this section, we assume the existence of the strict local NE (Prop. 3)

Published as a conference paper at ICLR 2022

in the neighborhood of the current point (e.g., initialization), and analyze the continuous gradient dynamics of the Domain-Adversarial Game (eq. 4 and eq. 5). We show that given the sufﬁcient conditions from Prop. 3, asymptotic convergence to a local NE is guaranteed through an application of Hurwitz condition (Khalil, 2002). Most importantly, we show that using GD with the GRL could violate those guarantees unless its learning rate is upper bounded (see Thm. 2 an Cor. 1). This is in sharp contrast with known results from supervised learning where the implicit regularization introduced by GD has been shown to be desirable (Barrett & Dherin, 2021). We also analyze the use of higher-order ODE solvers for DAL and show that the above restrictions are not required if GD is replaced with them. Finally, we compare our resulting optimizers with recently algorithms in the context of games.

Our algorithmic analysis is based on the continuous gradient-play dynamics and the derivation of the modiﬁed or high-resolution ODE of popular integrators (e.g., GD/Euler Method and Runge-Kutta). This type of analysis is also known in the numerical integration community as backward error analysis (Hairer et al., 2006) and has recently been used to understand the implicit regularization effect of GD in supervised learning (Barrett & Dherin, 2021). High resolution ODEs have also been used in Shi et al. (2018) to understand the acceleration effect of optimization algorithms, and more recently in Lu (2020). As in Shi et al. (2018); Lu (2020); Barrett & Dherin (2021), our derivation of the high resolution ODEs is in the full-batch setting. The derivation of the stochastic dynamics of stochastic discrete time algorithms is signiﬁcantly more complicated and is beyond the scope of this work.

We experimentally demonstrate that our results are also valid when there is stochasticity due to sampling noise in the mini-batch. We emphasize that our analysis does not put any constraint or structure on the players cost functions as opposed to Azizian et al. (2020); Zhang & Yu (2020). In our problem, the game is neither bilinear nor necessarily strongly monotone. See proofs in appendices.

4.1 CONTINUOUS GRADIENT DYNAMICS

Given v(ω) the continuous gradient dynamics can be written as:

ω(t) = v(ω). (6)

For later reasons and to distinguish between eq. 6 and the gradient ﬂow, we will refer to these as the gradient-play dynamics as in Ba sar & Olsder (1998); Mazumdar et al. (2020). These dynamics are well studied and understood when the game is either a potential or a purely adversarial game (see deﬁnitions in appendices). While eq. 2 may look like a single objective, the introduction of the GRL (Rλ), makes a fundamental difference between our case and the dynamics that are analyzed in the single-objective gradient-based learning and optimization literature. We summarize this below: Proposition 4. The domain-adversarial game is neither a potential nor necessarily a purely adversarial game. Moreover, its gradient dynamics are not equivalent to the gradient ﬂow.

Fortunately, we can directly apply the Hurwitz condition (Khalil, 2002) (also known as the condition for asymptotic stability, see Appendix A.1) to derive sufﬁcient conditions for which the continuous dynamics of the gradient play would converge. Lemma 1. (Hurwitz condition) Let v(w ) be the Jacobian of the vector ﬁeld at a stationary point w where v(w ) = 0. If the real part of every eigenvalue λ of v(w ) (i.e. in the spectrum Sp( v(w ))) is positive then the continuous gradient dynamics are asymptotically stable. In this work, we assume the algorithms are initialized in a neighborhood of a strict local NE ω . Therefore, Lemma 1 provides sufﬁcient conditions for the asymptotic convergence of the gradientplay dynamics to a local NE. In practice this assumption may not hold, and it is computationally hard to verify. Despite this, our experiments show noticeable performance gains in several tasks, benchmarks and network architectures (see 6).

4.2 ANALYSIS OF GD WITH THE GRL

We showed above that given the existence of a strict local NE, the gradient-play dynamics are attracted to the strict local NE. A natural question then arises: If under this assumption local asymptotic convergence is guaranteed, what could make DAL notoriously hard to train and unstable? In practice, we do not have access to an explicit solution of the ODE. Thus, we rely on integration algorithms to approximate the solution. One simple approach is to use the Euler method:

w+ = w ηv(w). (7)

Published as a conference paper at ICLR 2022

This is commonly known as GD. The equivalence between v(w) (game s vector ﬁeld) and the gradient of Equation (2) (original DAL formulation) follows from the use of the GRL (Rλ). We remind the reader that the GRL is a pseudo-function deﬁned by two (incompatible) equations describing its forward and back-propagation behavior, i.e., a ﬂip in the gradient s sign for the backward pass (see Figure 1, Section 2 and Ganin et al. (2016)). Equation (7) is then the default algorithm used in DAL. Now, to provide an answer to the motivating question of this section, we propose to analyze the high-resolution ODE of this numerical integrator (i.e., Euler) and in turn its asymptotic behavior. This is similar to deriving the modiﬁed continuous dynamics for which the integrator produces the exact solution (Hairer et al., 2006) and applying Hurwitz condition on the high-resolution ODE. Theorem 2. The high resolution ODE of GD with the GRL up to O(η) is:

2 v(w)v(w) (8)

Moreover, this is asymptotically stable (see Appendix A.1) at a stationary point w (Proposition 3) iff for all eigenvalue written as λ = a + ib Sp( v(w )), we have 0 > η(a2 b2)/2 > a.

A striking difference between Equation (6) and Equation (8) is made clear (additional term marked in red). This additional term is a result of the discretization of the gradient-play dynamics using Euler s method (i.e. GD) and leads to a different Jacobian of the dynamics. This term was recently shown to be beneﬁcial for standard supervised learning (Barrett & Dherin, 2021), where v(ω ) is symmetric and thus only has real eigenvalues. In our scenario, this term is undesirable. In fact, this additional term puts an upper bound on the learning rate η. The following corollary formalizes this: Corollary 1. The high resolution ODE of GD with GRL in Equation (8) is asymptotically stable only if the learning rate η is in the interval: 0 < η < 2a b2 a2 , for all λ = a + ib Sp( v(w )) with large imaginary part (i.e., such that |a| < |b|). To have good convergence properties, the imaginary part of the eigenvalues of v(w ) must be small enough. Therefore, if some eigenvalue λ = a + ib satisﬁes a < 0 and b2 a2 2a, the learning rates should be chosen to be very small. This is veriﬁed in Section 6 and in Example 2. Example 2. Consider the three-player game where ℓ(w1, w2) = w2 1 + 2w1w2 + w2 2, λ = 1 and

ds,t(w2, w3) = w2 2 + 99w2w3 w2 3. Then w = v(w) becomes: w = Aw = 2 2 0 2 4 99 0 99 2

eigenvalues of A are 2 and 3 2i

2449. From Corollary 1, η should be 0 < η < 6.2 10 3.

4.3 HIGHER ORDER ODE SOLVERS

The limitation described above exists because GD with the GRL can be understood as a discretization of the gradient-play dynamics using Euler s Method. Thus, it only approximates the continuous dynamics up to O(η). To obtain a better approximation, we consider Runge-Kutta (RK) methods of order two and beyond (Butcher, 1996). For example, take the improved Euler s method (a particular RK method of second order) that can be written as:

2(v(w) + v(w ηv(w))). (9)

Comparing Equation (9) (i.e., update rule of RK2) with Equation (7) (i.e., update rule of GD), one can see that the RK2 method is straightforward to implement in standard deep learning frameworks. Moreover, it does not introduce additional hyper-parameters. More importantly, such discrete dynamics approximate the continuous ODE of Equation (6) to a higher precision. In Appendix C, we provide asymptotic guarantees for the high resolution ODE of general RK methods , their generalized expression and the algorithm pseudo-code. See also Py Torch pseudo-code in Appendix E.

Limitation. A disadvantage of using high-order solvers is that they require additional extra steps. Speciﬁcally, one extra step in the case of RK2 (computation of the additional second term in Equation (9)). In our implementation, however, this was less than 2x slower in wall-clock time (see Appendix E.5 for more details and wall-clock comparison). Moreover, if not initialized in the neighborhood of a local NE, high-order solvers and gradient-based methods might also converge to a non-NE as described in Mazumdar et al. (2019) although this is likely a rare case.

Comparison vs other game optimization algorithms. DAL has not been previously interpreted from a game perspective. Our interpretation allows us to bring recently proposed algorithms to the context of differentiable games (Zhang & Yu, 2020; Azizian et al., 2020) to DAL. Classic examples are the Extra-Gradient (EG) method (Korpelevich, 1976) and Consensus Optimization (CO) (Mescheder et al., 2017). In Appendix B.2 we analyze the continuous dynamics of the EG method, and show that

Published as a conference paper at ICLR 2022

we cannot take the learning rate of EG to be large either. Thus, we obtain a similar conclusion as Corollary 1. Then, in practice for DAL, stability for EG comes at the price of slow convergence due to the use of small learning rates. We experimentally show this in Figure 3. With respect to CO, we show in Appendix C that this algorithm can be interpreted in the limit as an approximation of the RK2 solver. In practice, if its additional hyper-parameter (γ) is tuned thoroughly, CO may approximate the continuous dynamics better than GD and EG. We believe this may be the reason why CO slightly outperforms GD and EG (see Appendix E.4). In all cases, RK solvers outperform GD, EG and CO. This is in line with our theoretical analysis since they better approximate the continuous dynamics (Hairer et al., 2006). It is worth noting that many other optimizers have recently been proposed in the context of games e.g., Gidel et al. (2019a); Hsieh et al. (2020); Lorraine et al. (2021a;b). Some of them are modiﬁcations of the EG method that we compared to e.g. Extra-Adam (Gidel et al., 2019a) or double step-size EG (Hsieh et al., 2020). More practical modiﬁcations in terms of adaptive step size could also be applied on top of RK solvers as done in Qin et al. (2020). A comparison of all existing game optimizers in DAL, and a better theoretical understanding of such modiﬁcation on RK solvers are beyond the scope of this work. However, we believe it is an interesting and unexplored research direction that our game perspective on DAL enables.

5 RELATED WORK To the best of our knowledge, DAL has not been previously analyzed from a game perspective. Moreover, the stability of the optimizer and the implications of introducing the GRL has not been analyzed either. Here, we compare our results with the general literature.

Gradient-Based Learning in Games. Ratliff et al. (2016) proposed a characterization of local Nash Equilibrium providing sufﬁcient and necessary conditions for its existence. Mazumdar et al. (2020) proposed a general framework to analyze the limiting behavior of the gradient-play algorithms in games using tools from dynamical systems. Our work builds on top of this characterization but specializes them to the domain-adversarial problem. We propose a more stable learning algorithm that better approximates the gradient-play dynamics. Our resulting algorithm does not introduce explicit adjustments or modify the learning dynamics, nor does it require the computation of the several Hessian vector products or new hyperparameters. This is in contrast with general algorithms previously analyzed in the context of differentiable games (Azizian et al., 2020; Letcher et al., 2019).

Integration Methods and ML. Scieur et al. (2017) showed that accelerated optimization methods can be interpreted as integration schemes of the gradient ﬂow equation. Zhang et al. (2018) showed that the use of high order RK integrators can achieve acceleration in convex functions. In the context of two-players game (i.e GANs), Gemp & Mahadevan (2018) consider using a second-order ODE integrator. More recently, Qin et al. (2020) proposed to combine RK solvers with regularization on the generators gradient norm. Chen et al. (2018) interpreted the residual connection in modern networks as the Euler s integration of a continuous systems. In our case, we notice that the combination of GD with the GRL can be interpreted as the Euler s discretization of the continuous gradient play dynamics, which could prevent asymptotic convergence guarantees. We then study the discretization step of popular ODE solvers and provide simple guarantees for stability. Moreover, our analysis is based on a novel three-player game interpretation of the domain-adaptation problem. This is also different from a single potential function or two-player games (i.e. GANs).

Two-Player Zero-Sum Games have recently received signiﬁcant attention in the machine learning literature due to the popularity of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). For example, several algorithms have been proposed and analyzed (Mescheder et al., 2017; Mertikopoulos et al., 2019; Gidel et al., 2019a;b; Zhang & Yu, 2020; Hsieh et al., 2020), in both deterministic and stochastic scenarios. In our problem, we have a general three-player games resulting of a novel game interpretation of the domain-adversarial problem. It is worth noting that while Gidel et al. (2019a) focused on GANs, their convergence theory and methods for stochastic variational inequalities could also be applied to three-players games and thus DAL using our perspective.

6 EXPERIMENTAL RESULTS

We conduct an extensive experimental analysis. We compare with default optimizers used in domainadversarial training such as GD, GD with Nesterov Momentum (GD-NM) (as in Sutskever et al. (2013)) and Adam (Kingma & Ba, 2014). We also compare against recently proposed optimizers in the context of differentiable games such as EG (Korpelevich, 1976) and CO (Mescheder et al., 2017). We focus our experimental analysis on the original domain-adversarial framework of Ganin et al.

Published as a conference paper at ICLR 2022

(2016) (DANN). However, in section 6.2, we also show the versatility and efﬁcacy of our approach improving the performance of recently proposed So TA DAL framework (e.g., f-DAL (Acuna et al., 2021) combined with Implicit Alignment (Jiang et al., 2020)).

6.1 EXPERIMENTAL ANALYSIS ON DIGITS

0 8 16 24 # of Epochs

Target Task Loss

Grad. Descent Nesterov Momentum Adam Ours(RK2) Ours(RK4)

0 8 16 24 # of Epochs

Transfer Performance

Grad. Descent Nesterov Momentum Adam Ours(RK2) Ours(RK4)

Epoch: 0 Epoch: 12 Epoch: 24

,Gradient Descent ,Ours (RK2)

Figure 2: Our method vs popular optimizers on the Digits Benchmark. (Top-Left) Loss in target domain. (Top-Right) Transfer performance. (Bottom) t-SNE Visualization of the last layer representations during training. Our method converges faster, has better performance and produces more aligned features faster.

Implementation Details. Our ﬁrst experimental analysis is based on the digits benchmark with models trained from scratch (i.e., with random initialization). This benchmark constitutes of two digits datasets MNIST (CC BY-SA 3.0) and USPS (Le Cun et al., 1998; Long et al., 2018) with two transfer tasks (M U and U M). We adopt the splits and evaluation protocol from Long et al. (2018) and follow their standard implementation details.

For GD-NM, we use the default momentum value (0.9). We follow the same approach for the additional hyper-parameters of Adam. Hyperparameters such as learning rate, learning schedule and adaptation coefﬁcient (λ) are determined for all optimizers by running a dense grid search and selecting the best hyper-parameters on the transfer task M U. As usual in UDA, the best criteria are determined based on best transfer accuracy. The same parameters are then used for the other task (i.e., U M). We use the same seed and identically initialize the network weights for all optimizers. This analysis is conducted on Jax (Bradbury et al., 2018) (see Appendix D).

0 8 16 24 # of Epochs

Transfer Performance

Extra-Gradient Consensus Opt Ours(RK2) Ours(RK4)

Figure 3: Comparison among optimization algorithms (M U) with DANN.

Comparison vs optimizers used in DAL. Figure 2 (top) illustrates the training dynamics for the loss in the target domain and the performance transfer. As expected, our optimizer converges faster and achieves noticeable performance gains. A core idea of DAL is to learn domain-invariant representations, thus we plot in Figure 2 (bottom) t-SNE (Van der Maaten & Hinton, 2008) visualizations of the last layer features of the network. We show this over a sequence of epochs for GD with GRL vs RK2. A different color is used for the source and target datasets. In the comparison vs Adam, we emphasize that Adam computes adaptive learning rates which our method does not. That said, Figure 2 shows that our two methods RK2 and RK4 outperform all baselines in terms of both convergence and transfer performance. In Figure 7, we show how unstable these standard optimizers are when more aggressive step sizes are used. This is in line with our theoretical analysis. Experimentally, it can be seen that in DAL, GD is more stable than GD-NM and Adam, with the latter being the most unstable. This sheds lights on why well tuned GD-NM is often preferred over Adam in DAL.

Comparison vs game optimization algorithms. We now compare RK solvers vs other recently proposed game optimization algorithms. Speciﬁcally, we compare vs the EG method (Korpelevich, 1976) and CO (Mescheder et al., 2017). In every case, we perform a dense grid under the same budget for all the optimizer and report the best selection (see Appendix E for details). In line with our theoretical analysis of the continuous dynamics of the EG, we notice that the EG method is not able to train with learning rates bigger than 0.006, as a result it performs signﬁcantly worse than any other optimizer (including simple GD). Also inline with our theoretical analysis, CO performs better than EG and all other popular gradient algorithms used in DAL. This is because CO can be seen as an approximation of Heun s Method (RK2). More details on this in supplementary.

Robustness to hyper-parameters. Figure 4 shows transfer performance of our method for different choices of hyper-parameters while highlighting (green line) the best score of the best performing GD hyperparameters on the same dataset. Our method is robust to a wide variety of hyperparameters.

Published as a conference paper at ICLR 2022

0.1 0.3 0.4

LR Schedule

0.1 0.5 1.0

Transfer Acc.

Figure 4: Robustness to hyperparameters. We compare the transfer performance of our method for different hyperarameters in the task M U in the Digits benchmark. Green line shows the best score for the best performing hyperparameters of GD. Blue star corresponds to the best solution. Our method performs well for a wide variety of hyperparameters.

Figure 5: Sensitivity to Sampling Noise . Different amounts of sampling noise controlled by the batch size (64, 128, 160) (Visda).

Method M U U M Avg GD 90.0 0.4 93.4 0.7 91.7 Adam 92.8 0.3 96.8 0.2 94.8 GD-NM 91.8 0.3 94.4 0.4 93.1

Ours(RK2) 95.1 0.1 97.5 0.2 96.3 Ours(RK4) 95.0 1.4 97.3 0.1 96.1

Table 3: Accuracy (%) on Digits (DANN).

0 8000 16000 24000 # of Iterations

Transfer Performance

Grad. Descent (Nesterov) Ours (RK2)

Figure 6: Transfer Performance on Visda (DANN).

0 6000 12000 18000 24000 # of Iteration(s)

Target Task Loss

Grad. Descent LR:0.3 Nesterov Momentun LR:0.1

Figure 7: Stability anal. on Digits. Most aggressive step size before divergence. Adam diverges for η > 0.001. 6.2 COMPARISON IN COMPLEX ADAPTATION TASKS Method Sim Real GD-NM 71.7 0.7

Ours(RK2) 73.8 0.3

Table 1: Accuracy (DANN) on Visda 2017 with Res Net-50.

We evaluate the performance of our algorithm with Resnet-50 (He et al., 2016) on more challenging adaptation benchmarks. Specifically, this analysis is conducted on Visda-2017 benchmark (Peng et al., 2017). This is a simulation-to-real dataset with two different domains: (S) synthetic renderings of 3D models and (R) real images. For this experiment, we use Py Torch (Paszke et al., 2019), our evaluation protocol follows Zhang et al. (2019) and uses Res Net-50 as the backbone network. For the optimizer parameters, we tune thoroughly GD-NM, which is the optimizer used in this setting (Long et al., 2018; Zhang et al., 2019; Jiang et al., 2020; Acuna et al., 2021). For ours, we keep the hyper-parameters, but increase the learning rate (0.2), and the batch size to 128. In this task, our approach corresponds to the improved Euler s method (RK2). Table 1 shows the comparison. Figure 6 compares the training dynamics of our method vs GD-NM. In Figure 5, we evaluate the sensitivity of our method (in terms of transfer performance) to sampling noise as controlled by the batch size.

Method Sim Real f-DAL - GD-NM 72.9 (29.5K iter) f-DAL - RK2 (Ours) 76.4 (10.5K iter)

Table 2: Comparison using So TA DA adversarial frameworks with Res Net-50 on Visda.

Improving So TA DAL frameworks. We use this complex visual adaptation task to showcase the applicability of our method to So TA DAL frameworks. Speciﬁcally, we let the DA method being f-DAL Pearson as in Acuna et al. (2021) with Implicit Alignment Jiang et al. (2020). We use the tuned parameters and optimizer from Acuna et al. (2021); Jiang et al. (2020) as the baseline. In our case, we only increase the learning rate (0.2). Table 2 shows that our method achieves peak results (+3.5%) in 10.5K iterations (vs 29.5K iterations for GD-NM). Natural Language Processing Tasks. We also evaluate our approach on natural language processing tasks on the Amazon product reviews dataset (Blitzer et al., 2006). We show noticeable gains by replacing the GD with either RK2 or RK4. Results and details can be found in Appendix E.1. 7 CONCLUSIONS We analyzed DAL from a game-theoretical perspective where optimality is deﬁned as local NE. From this view, we showed that standard optimizers in DAL can violate the asymptotic guarantees of the gradient-play dynamics, requiring careful tuning and small learning rates. Based on our analysis, we proposed to replace existing optimizers with higher-order ODE solvers. We showed both theoretically and experimentally that these are more stable and allow for higher learning rates, leading to noticeable improvements in terms of the transfer performance and the number of training iterations. We showed that these ODE solvers can be used as a drop-in replacement and outperformed strong baselines.

Acknowledgements. We would like to thank James Lucas, Jonathan Lorraine, Tianshi Cao, Raﬁd Mahmood, Mark Brophy and the anonymous reviewers for feedback on earlier versions of this work.

Published as a conference paper at ICLR 2022

David Acuna, Guojun Zhang, Marc T. Law, and Sanja Fidler. f-domain adversarial learning: Theory and algorithms. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pp. 66 75. PMLR, 2021.

Waïss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. A tight and uniﬁed analysis of gradient-based methods for a whole spectrum of differentiable games. In Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2863 2873. PMLR, 26 28 Aug 2020.

David Barrett and Benoit Dherin. Implicit gradient regularization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=3q5Iq Urkc F.

Tamer Ba sar and Geert Jan Olsder. Dynamic Noncooperative Game Theory, 2nd Edition. Society for Industrial and Applied Mathematics, 1998. doi: 10.1137/1.9781611971132. URL https://epubs.siam.org/ doi/abs/10.1137/1.9781611971132.

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pp. 137 144, 2007.

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151 175, 2010a.

Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 129 136, 2010b.

Hugo Berard, Gauthier Gidel, Amjad Almahairi, Pascal Vincent, and Simon Lacoste-Julien. A closer look at the optimization landscapes of generative adversarial networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJe Vn CEKw H.

John Blitzer, Ryan Mc Donald, and Fernando Pereira. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pp. 120 128, 2006.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax.

John Charles Butcher. A history of runge-kutta methods. Applied numerical mathematics, 20(3):247 260, 1996.

Woong-Gi Chang, Tackgeun You, Seonguk Seo, Suha Kwak, and Bohyung Han. Domain-speciﬁc batch normalization for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7354 7362, 2019.

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, volume 31, 2018.

Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, volume 30, 2017.

Soﬁen Dhouib, Ievgen Redko, and Carole Lartizien. Margin-aware adversarial domain adaptation with optimal transport. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2514 2524. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/dhouib20b.html.

Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pp. 1180 1189. PMLR, 2015.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016.

Ian Gemp and Sridhar Mahadevan. Global convergence to the equilibrium of gans using variational inequalities. ar Xiv preprint ar Xiv:1808.01531, 2018.

Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial networks. In ICLR, 2019a.

Published as a conference paper at ICLR 2022

Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol, Gabriel Huang, Simon Lacoste Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 1802 1811. PMLR, 2019b.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Ernst Hairer, Marlis Hochbruck, Arieh Iserles, and Christian Lubich. Geometric numerical integration. Oberwolfach Reports, 3(1):805 882, 2006.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning, pp. 1989 1998. PMLR, 2018.

Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. In Advances in Neural Information Processing Systems, volume 33, pp. 16223 16234, 2020.

Xiang Jiang, Qicheng Lao, Stan Matwin, and Mohammad Havaei. Implicit class-conditioned domain alignment for unsupervised domain adaptation. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 4816 4827. PMLR, 13 18 Jul 2020. URL http://proceedings.mlr.press/v119/jiang20d.html.

Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pp. 4880 4889. PMLR, 2020.

Mingsheng Long Junguang Jiang, Bo Fu. Transfer-learning-library. https://github.com/thuml/ Transfer-Learning-Library, 2020.

Hassan K Khalil. Nonlinear systems. Prentice-Hall, 2002.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

GM Korpelevich. The extragradient method for ﬁnding saddle points and other problems. Matecon, 12:747 756, 1976.

Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, volume 49, pp. 1246 1257. PMLR, 2016.

Alistair Letcher, David Balduzzi, Sébastien Racanière, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. Differentiable game mechanics. Journal of Machine Learning Research, 20(84):1 40, 2019. URL http://jmlr.org/papers/v20/19-008.html.

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Ben Recht, and Ameet Talwalkar. Massively parallel hyperparameter tuning. 2018.

Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez, and Ion Stoica. Tune: A research platform for distributed model selection and training. ar Xiv preprint ar Xiv:1807.05118, 2018.

Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640 1650, 2018.

Jonathan Lorraine, David Acuna, Paul Vicol, and David Duvenaud. Complex momentum for optimization in games, 2021a.

Jonathan Lorraine, Paul Vicol, Jack Parker-Holder, Tal Kachman, Luke Metz, and Jakob Foerster. Lyapunov exponents for diversity in differentiable games. ar Xiv preprint ar Xiv:2112.14570, 2021b.

Haihao Lu. An o (sr)-resolution ode framework for discrete-time optimization algorithms and applications to convex-concave saddle-point problems. ar Xiv preprint ar Xiv:2001.08826, 2020.

Published as a conference paper at ICLR 2022

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Proceedings of The 22nd Annual Conference on Learning Theory (COLT 2009), Montréal, Canada, 2009. URL http://www.cs.nyu.edu/~mohri/postscript/nadap.pdf.

Eric Mazumdar, Lillian J Ratliff, and S Shankar Sastry. On gradient-based learning in continuous games. SIAM Journal on Mathematics of Data Science, 2(1):103 131, 2020.

Eric V Mazumdar, Michael I Jordan, and S Shankar Sastry. On ﬁnding local nash equilibria (and only local nash equilibria) in zero-sum games. ar Xiv preprint ar Xiv:1901.00838, 2019.

Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile. In International Conference on Learning Representations, 2019. URL https://openreview.net/ forum?id=Bkg8jj C9KQ.

Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. Advances in neural information processing systems, 30:1825 1835, 2017.

Dov Monderer and Lloyd S Shapley. Potential games. Games and economic behavior, 14(1):124 143, 1996.

Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library. pdf.

Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge. ar Xiv preprint ar Xiv:1710.06924, 2017.

Chongli Qin, Yan Wu, Jost Tobias Springenberg, Andy Brock, Jeff Donahue, Timothy Lillicrap, and Pushmeet Kohli. Training generative adversarial networks by solving ordinary differential equations. In Advances in Neural Information Processing Systems, volume 33, pp. 5599 5609, 2020.

Lillian J Ratliff, Samuel A Burden, and S Shankar Sastry. On the characterization of local nash equilibria in continuous games. IEEE Transactions on Automatic Control, 61(8):2301 2307, 2016.

Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classiﬁer discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3723 3732, 2018.

Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d'Aspremont. Integration methods and optimization algorithms. In Advances in Neural Information Processing Systems, volume 30, 2017.

Ozan Sener, Hyun Oh Song, Ashutosh Saxena, and Silvio Savarese. Learning transferrable representations for unsupervised domain adaptation. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/ b59c67bf196a4758191e42f76670ceba-Paper.pdf.

Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su. Understanding the acceleration phenomenon via high-resolution differential equations. ar Xiv preprint ar Xiv:1810.08907, 2018.

Rui Shu, Hung Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised domain adaptation. In International Conference on Learning Representations, 2018.

Yu Sun, Eric Tzeng, Trevor Darrell, and Alexei A Efros. Unsupervised domain adaptation through selfsupervision. ar Xiv preprint ar Xiv:1909.11825, 2019.

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139 1147. PMLR, 2013.

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.

Published as a conference paper at ICLR 2022

Guojun Zhang and Yaoliang Yu. Convergence of gradient-based methods on bilinear games. In International Conference on Learning Representations, 2020.

Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct runge-kutta discretization achieves acceleration. ar Xiv preprint ar Xiv:1805.00521, 2018.

Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 7404 7413, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr.press/ v97/zhang19i.html.

Published as a conference paper at ICLR 2022

SUPPLEMENTARY MATERIAL

1 Introduction 1

2 Preliminaries 2

3 A Game Perspective on DAL 3

3.1 Domain-Adversarial Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.2 Characterization of the Domain-Adversarial Game . . . . . . . . . . . . . . . . . . . . . . . 4

4 Learning Algorithms 4

4.1 Continuous Gradient Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.2 Analysis of GD with the GRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.3 Higher order ODE Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Related Work 7

6 Experimental Results 7

6.1 Experimental Analysis on Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

6.2 Comparison in complex adaptation tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

7 Conclusions 9

A Concepts in Game Theory 15

A.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2 Games Characterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.3 Case of Study in DANN. Original Formulation from Ganin et al. (2016) . . . . . . . . . . . . 15

B Derivation of high-resolution ODEs 17

B.1 High-resolution ODE of second-order Runge Kutta method . . . . . . . . . . . . . . . . . . 17

B.2 Continuous dynamics of Extra-Gradient (EG) . . . . . . . . . . . . . . . . . . . . . . . . . . 17

B.3 High-resolution ODE of classic fourth-order Runge Kutta method (RK4) . . . . . . . . . . . 18

C Proofs and additional theoretical results 20

C.1 Proposed Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C.2 CO approximates RK2 (Heun s Method) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

D Experimental Setup Additional Details 22

E Additional Experiments 23

E.1 Natural Language Processing Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

E.2 Sensitivity to Sampling Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

E.3 Additional Comparison vs Game Optimization Algorithms . . . . . . . . . . . . . . . . . . . 24

E.4 CO vs Gradient Descent and Extra-Gradient Algorithms . . . . . . . . . . . . . . . . . . . . 24

Published as a conference paper at ICLR 2022

E.5 Wall-Clock Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

F Py Torch Pseudo Code of RK2 Solver 25

A CONCEPTS IN GAME THEORY

A.1 DEFINITIONS

Deﬁnition 2. (Local Nash Equilibrium) : A point (w i , w i) Ωis said to be a local Nash Equilibrium of the domain-adversarial game if there exists some δ > 0 such that:

i {1, 2, 3}, Ji(w i , w i) Ji(wi, w i), s.t. ||ωi ω i ||2 < δ (10)

Intuitively, this is restricting the concept of NE to a local neighborhood B(x , δ) := {||x x ||2 < δ} with δ > 0.

A more practical characterization of the NE can be given in terms of the Best Response Map of each player which we now deﬁne.

Deﬁnition 3. (Best Response Map (BR)) The best response map BRi : Ω i Ωi of player i is deﬁned as:

BRi(ω i) := arg min ωi Ωi Ji(ωi, ω i), (11)

here the symbol emphasizes that the best response map is generally a set map and not a singleton, thus it is not a function in general. In other words, there may be a subset of element in Ωi for which Ji(., ω i) is a minimum.

The notion of NE can be deﬁned in terms of the generalized BR : Ω Ωmap. This can be thought as an stacked vector where the i-th element of BR is BRi(ω i).

Proposition 5. A point w i Ωis said to be a NE of the game if it is a ﬁxed point of the generalized BR : Ω Ω map. That is, ω BR(ω ) = i {1, 2, 3}, ω i BRi(ω i) (12)

Proof. This follows from the deﬁnitions of BR map and NE.

Deﬁnition 4. (Asymptotically Stable) A point ω is said to be a locally asymptotically stable point of the continuous dynamics ω = f(ω) if Re(λ) < 0 for all λ Sp( f(ω)), where Sp( f(ω)) is the spectrum of f(ω).

Deﬁnition 4 is also known as the Hurwitz condition Khalil (2002).

Deﬁnition 5. A stationary point x of a C2 function φ : Rn R is said to be a strict saddle point if:

λmin( 2 xxφ(x )) < 0 and,

λ( 2 xxφ(x )) > 0, for any other λ sp( 2 xxφ(x ))

A.2 GAMES CHARACTERIZATIONS

Potential Games. Potential Games were introduced in Monderer & Shapley (1996) and can be deﬁned as a type of game for which there exists an implicit potential function φ : Ω R such that φ(ω) = v(ω). Consequently, a necessary and sufﬁcient condition for the game to be potential is the Jacobian of the vector ﬁeld v(ω) being symmetric (see 3.3 in Mazumdar et al. (2020) and Monderer & Shapley (1996)).

Purely Adversarial Games. This particular type of game refers to the other extreme in which H(ω) is a non-symmetric matrix with purely imaginary eigenvalues. If the game Hessian is skew-symmetric these have also been called Hamiltonian Games Letcher et al. (2019).

A.3 CASE OF STUDY IN DANN. ORIGINAL FORMULATION FROM GANIN ET AL. (2016)

As mentioned in the main text (Section 2), our analysis is compatible with both the original and more recent formulation of domain-adversarial training such as Zhang et al. (2019); Acuna et al. (2021). In this section, we speciﬁcally derive additional results for DANN Ganin et al. (2016).

Published as a conference paper at ICLR 2022

In order to obtain the original formulation of DANN, let us deﬁne ˆℓ(_, b) = log(σ(b)) and φ (t) = log(1 et) in Equation (2). This corresponds to the Jensen-Shannon divergence (JS) (up to a constant shift that does not affect optimization). We can then rewrite ds,t as:

ds,t = Ex ps log σ ˆh g(x) + Ex pt log(1 σ ˆh g(x)) (13)

where σ(x) := 1 1+e x . To simplify the notation, we write H := σ H.

We can now re-deﬁne the pseudo-gradient v(w) of the game as the gradient of each player loss with respect to its parameters. Letting α = 1, we get from Equation (4).

v(ω) := ( ω1ℓ, ω2(ℓ+ λds,t), ω3ds,t) Rd. (14)

The following propositions characterize local NE in terms of the pseudo-gradient v(w) and its Jacobian H(ω). Proposition 6. (Local NE) Suppose v(w) = 0 and 2 ω1ℓ 2 ω1,ω2ℓ 2 ω1,ω2ℓ 2 ω2(ℓ+ λds,t)

0, 2 ω3ds,t 0, (15)

then w is an isolated local NE.

The proof is simple and follows from Propositions 1 and 2, the deﬁnition of the vector ﬁeld v(ω) and the condition H + H 0.

Cooperation with Competition. By examining the matrix H(ω), one can see that, in our scenario, the game is neither a potential game nor a purely adversarial game. However, we can write the vector ﬁeld in the following form:

0 λ ω2dst ω3dst

where the ﬁrst part corresponds to the gradient of the potential function φ(ω) = ℓ(ω1, ω2). The second part, on the other hand, corresponds to a function ˆv(w) whose Jacobian is a non-symmetric matrix. Analyzing them separately leads to either a potential or an adversarial game respectively. We deﬁne this particular type of game as cooperation (i.e., in the potential term) with competition (i.e., the adversarial term).

It is worth noting that, while the spectrum of the game Hessian for the ﬁrst term has only real eigenvalues, the second term can have complex eigenvalues with a large imaginary component. Indeed, it can be shown that this second term approximates the one obtained for a GAN using the non-saturating loss proposed by Goodfellow et al. (2014) (e.g. λ = 1). In other words, the second term can be written as the pseudo-gradient of the two player zero-sum game minω2 maxω3 dst. Building on this key observation and the work of Mescheder et al. (2017); Berard et al. (2020) (Figure 4), where it was experimentally shown that the eigenvalues of the game Hessian for GANs have indeed a large imaginary component around stationary points, we can assume that the spectrum of the game Hessian in our case also have eigenvalues with a large imaginary component around the stationary points. This observation can also be used with Corollary 1 to further motivate the use of higher-order ODE solvers instead of GD with the GRL. Example 3. Consider the three-player game Equation (16) where ℓ(w1, w2) = w2 1 + 2w1w2 + w2 2, λ = 1 and ds,t(w2, w3) = w2 2 + 99w2w3 w2 3. The gradient play dynamics w = v(w) becomes:

2 2 0 2 4 99 0 99 2

The eigenvalues of A are 2 and 3 2i

2449. From Corollary 1, η should be 0 < η < 6.2 10 3.

Is the three-player game formulation desired? In domain adaptation, optimization is a means to an end. The ﬁnal goal is to minimize the upper bound from Theorem 1 to ensure better performance in the target domain. One might then wonder whether interpreting optimality in terms of NE is desirable. In our context, NE means ﬁnding the optimal g , ˆh and ˆh of the cost functions deﬁned in Equation (4). This in turns leads to minimizing the upper bound in Theorem 1.

Remark on sequential games: Recently, Jin et al. (2020) introduced a notion of local min-max optimality for two-player s game exploiting the sequential nature of some problems in adversarial ML (i.e GANs). In domain-adversarial learning, updates are usually performed in a simultaneous manner using the GRL. Thus, we focus here on the general case where the order of the players is not known.

Published as a conference paper at ICLR 2022

B DERIVATION OF HIGH-RESOLUTION ODES

Lemma 2. The high resolution ODE of resulting of the GD algorithm with the GRL is:

2 v(w)v(w) + O(η2), (17)

Proof. This follows from Corollary 1 of Lu (2020).

B.1 HIGH-RESOLUTION ODE OF SECOND-ORDER RUNGE KUTTA METHOD

The high-resolution ODE was discussed in Shi et al. (2018); Lu (2020). For discrete algorithms with the following update:

w+ = w + f(η, w), (18)

we can think of the trajectory as a discretization of the continuous dynamics w : [0, + ) Rd, and in Equation (18), we have w = w(t), w+ = w(t + η). Here, with slight abuse of notation we also use w for the continuous function of dynamics.

We derive high-resolution ODE of the second-order Runge Kutta method:

wk+1/2 = wk η

2αv(wk), wk+1 = wk η((1 α)v(wk) + αv(wk+1/2)),

where 0 < α 1 and α is a constant. If α = 1/2, we obtain Heun s method; if α = 1, we obtain the midpoint method; if α = 2/3, we obtain the Ralston s method. Combining the two equations, we have: wk+1 wk

η = (1 α)v(wk) αv(wk η

2αv(wk)). (19)

Using the Taylor expansion:

2αv(wk)) = v(wk) η

2α v(wk) v(wk) + O(η2)

Plugging it back into Equation (19) and using the Taylor expansion wk+1 = wk + η w + η2 w/2, we have:

2η w = v(w) + 1

2 v(w) v(w) + O(η2). (20)

Now we make the assumption that we have the high-resolution ODE that:

w = f0(w) + ηf1(w) + O(η2). (21)

Taking the derivative over t we have:

w = f0(w)f0(w) + O(η). (22)

Combining Equation (20), Equation (21) and Equation (22), we obtain that:

f0(w) = v(w), f1(w) = 0, (23)

i.e., the high resolution ODE of second-order Runge Kutta method is:

w = v(w) + O(η2). (24)

B.2 CONTINUOUS DYNAMICS OF EXTRA-GRADIENT (EG)

The continuous dynamics of Gradient Descent Ascent (GDA), Extra-Gradient (EG) and Heun s method can be summarized as follows: w = v(w) + α v(w)v(w) For GDA, we have α = η/2; for EG, we have α = η/2 (Lu, 2020); for Heun s method, w = v(w) + O(η2). The Jacobian of the dynamics at the stationary point is v(w) + α v(w)2. Take λ = a + ib Sp( v(w)). The eigenvalue of the Jacobian of the dynamics is:

α(a + ib)2 + a + ib = a + α(a2 b2) + i(b + 2ab)α. (25)

We want the real part to be negative, i.e.:

a + α(a2 b2) < 0, (26)

a(1 + αa) < αb2. (27)

for EG, α = η/2 and the dynamics diverges if a(1+(η/2)a) ηb2/2. When η is large, and η(a2 b2)/2 a then it diverges. However, the high-resolution ODE of second-order Runge Kutta methods only requires a < 0.

Published as a conference paper at ICLR 2022

B.3 HIGH-RESOLUTION ODE OF CLASSIC FOURTH-ORDER RUNGE KUTTA METHOD (RK4)

In this subsection, we derive the high-resolution ODE of the classic fourth-order Runge Kutta method. We prove the following result: Theorem 3. The high-resolution ODE of the classic fourth-order Runge Kutta method (RK4):

6 (v(w) + 2v2(w) + 2v3(w) + v4(w)), (28)

v2(w) = v(w η

2 v(w)), v3(w) = v(w η

2 v2(w)), v4(w) = v(w ηv3(w)), (29)

w = v(w) + O(η4). (30)

Proof. We use the following Taylor expansion:

v(w + δ) = v(w) = v(w)δ + 1

2 2v(w)(δ, δ) + 1

6 3v(w)(δ, δ, δ) + O( δ4 ), (31)

where 2v(w) : Rd Rd Rd is a symmetric bilinear form, and 3v(w) : Rd Rd Rd Rd is a symmetric trilinear form. With the formula we have:

v4(w) = v(w) η v(w)v3(w) + η2

2 2v(w)(v3(w), v3(w)) η3

6 3v(w)(v3(w), v3(w), v3(w)) + O(η4),

v3(w) = v(w) η

2 v(w)v2(w) + η2

8 2v(w)(v2(w), v2(w)) η3

48 3v(w)(v2(w), v2(w), v2(w)) + O(η4),

v2(w) = v(w) η

2 v(w)v(w) + η2

8 2v(w)(v(w), v(w)) η3

48 3v(w)(v(w), v(w), v(w)) + O(η4).

Putting them together we have:

v4(w) + 2v3(w) + 2v2(w) + v(w) = 6v(w) η v(w)(v3(w) + v2(w) + v(w))

2v(w)(v3(w), v3(w)) + 1

2 2v(w)(v2(w), v2(w)) + 1

2 2v(w)(v(w), v(w)) +

4 3v(w)(v(w), v(w), v(w)) + O(η4), (35)

v3(w) + v2(w) + v(w) = 3v(w) η

2 v(w)(v2(w) + v(w)) + η2

4 2v(w)(v(w), v(w)) + O(η3),

v2(w) + v(w) = 2v(w) η

2 v(w)v(w) + O(η2). (37)

Bringing Equation (37) into Equation (36), we obtain:

v3(w) + v2(w) + v(w) = 3v(w) η v(w)v(w) + η2

4 2v(w)(v(w), v(w)) + η2

4 ( v(w))2v(w) + O(η3). (38)

Putting Equation (38) and Equation (37) together, we have:

η v(w)(v3(w) + v2(w) + v(w)) = 3η v(w)v(w) + η2( v(w))2v(w)

4 v(w) 2v(w)(v(w), v(w)) η3

4 ( v(w))3v(w) + O(η4).

Moreover, we have

2v(w)(v2(w), v2(w)) = 2v(w)(v η

2 v(w)v(w), v η

2 v(w)v(w)) + O(η2)

= 2v(w)(v(w), v(w)) η 2v(w)( v(w)v(w), v(w)) + O(η2), (40)

and similarly

2v(w)(v3(w), v3(w)) = 2v(w)(v(w), v(w)) η 2v(w)( v(w)v(w), v(w)) + O(η2). (41)

Published as a conference paper at ICLR 2022

Bringing Equation (39) and Equation (40) into Equation (35) results in

v4(w) + 2v3(w) + 2v2(w) + v(w) = 6v(w) 3η v(w)v(w) + η2(( v(w))2v(w) + 2v(w)(v(w), v(w))

4 (( v(w))3v(w) + 3 2v(w)( v(w)v(w), v(w))

+ v(w) 2v(w)(v(w), v(w)) + 3v(w)(v(w), v(w), v(w))) + O(η4). (42)

Let us now derive the high-resolution ODE. From Equation (28), we have:

6(v4(w) + 2v3(w) + 2v2(w) + v(w)). (43)

Let us assume that w+ = w(t + η) and w = w(t). Expanding the left we have:

6 ...w + η3

24 .... w. (44)

Let us assume that the high-resolution ODE up to O(η4) has the form:

w = f0(w) + ηf1(w) + η2f2(w) + η3f3(w) + O(η4). (45)

Taking derivatives on both sides, we have:

w = ( f0(w) + η f1(w) + η2 f2(w) + η3 f3(w)) w + O(η4)

= f0(w)f0(w) + η( f1(w)f0(w) + f1(w)f0(w)) + η2( f1(w)f1(w) + f0(w)f2(w) + f2(w)f0(w))

+ O(η3). (46)

Comparing the order O(1) on both sides of Equation (43) we have:

f0(w) = v(w), f1(w) + 1

2 f0(w)f0(w) = 1

2 v(w)v(w), (47)

which gives f0(w) = v(w) and f1(w) = 0. Bringing it back to Equation (46) we obtain:

w = v(w)v(w) + O(η2). (48)

We take the derivatives on both sides of Equation (48) to get: ...w = ( v(w)v(w)) w + O(η2)

= ( v(w)v(w))v(w) + O(η2). (49)

Let us now compute ( v(w)v(w))v(w). We note that ( v(w)v(w)) is a linear form and

v(w + δ)v(w + δ) = ( v(w) + 2v(w)δ + o( δ ))(v(w) + v(w)δ + o( δ ))

= v(w)v(w) + 2v(w)(δ, v(w)) + 2v(w)δ + o( δ ). (50)

Therefore, we have ...w = ( v(w)v(w))v(w) = 2v(w)(v(w), v(w)) 2v(w)v(w) + O(η2). (51)

With Equation (51) we can compare O(η2) on both sides on Equation (43) and obtain

f2(w) = 0. (52)

Finally, we take the derivative of Equation (51). Since

2v(w + δ)(v(w + δ), v(w + δ)) = ( 2v(w) + 3v(w)δ)(v(w) + v(w)δ, v(w) + v(w)δ)

= 2v(w)(v(w), v(w)) + 3v(w)(δ, v(w), v(w)) + 2 2v(w)(v(w), v(w)δ) + o( δ ), (53)

( 2v(w)(v(w), v(w)))v(w) = 3v(w)(v(w), v(w), v(w)) + 2 2v(w)(v(w), v(w)v(w)). (54)

Similarly, since

( v(w + δ))( v(w + δ))v(w + δ) = ( v(w) + 2v(w)δ)( v(w) + 2v(w)δ)(v(w) + v(w)δ)

= ( v(w))2v(w) + 2v(w)(δ, v(w)v(w))

+ v(w) 2v(w)(δ, v(w)) + v(w))3δ + ( δ ) (55)

Published as a conference paper at ICLR 2022

( v(w))2v(w)(v(w)) = 2v(w)(v(w), v(w)v(w)) + v(w) 2v(w)(v(w), v(w)) + v(w)3v(w). (56)

Combining Equation (51), Equation (54) and Equation (56) gives: .... w = ...w w = v(w)3v(w) + 3v(w)(v(w), v(w), v(w)) + 3 2v(w)(v(w), v(w)v(w))

+ v(w) 2v(w)(v(w), v(w)). (57)

Combining Equation (45), Equation (46), Equation (51) and Equation (57) and comparing the O(η3) term of Equation (43), we obtain

f3(w) = 0. (58)

Therefore, the high-resolution ODE is:

w = v(w) + O(η4). (59)

C PROOFS AND ADDITIONAL THEORETICAL RESULTS

Several proofs in this section are based on the derivations of the high resolution ODEs that were proven in Appendix B.

Theorem 4. Suppose φ : Rn R and φ is C2. Suppose also any stationary point x of φ s.t. φ(x ) = 0 is either:

1. a strict local minimum, i.e. 2 xxφ(x ) 0;

2. a strict saddle. (Deﬁnition 5)

Then, the gradient ﬂow is attracted towards a local minimum of φ.

x = φ(x) (60)

Deﬁne g(x) := φ(x).

a) Let us ﬁrst show that strict saddles are non-asymptotically stable ﬁxed points. Thus, gradient descent is not attracted to them.

g(x) 2 xxφ(x)(x x ) since φ(x ) = 0 (61)

g(z) Λ(z z ) where z := U T x (62)

This implies g(z)i λmin( 2 xxφ(x ))(z z )i > 0. Thus, we just showed that strict saddles are nonasymptotically stable, as desired.

Let us now show that all the local minima are asymptotically stable, and gradient descent is then attracted to them.

b) Deﬁne V (x) := φ(x) φ(x ) to be a Lyapunov function. We have x is asymptotically stable if:

V (x)T g(x) < 0 x = x (63)

From which the result follows. (e.g V (x)T g(x) = ||g(x)||2 2).

Similar results with a different proof can be derived from Lee et al. (2016).

Theorem 2. The high resolution ODE of GD with the GRL up to O(η) is:

2 v(w)v(w) (64)

Moreover, this is asymptotically stable (see Appendix A.1) at a stationary point w (Proposition 3) iff for all eigenvalue written as λ = a + ib Sp( v(w )), we have 0 > η(a2 b2)/2 > a.

Proof. This theorem can be obtained by computing the Jacobian of the ODE dynamics and using the Hurwitz condition Lemma 1. Speciﬁcally, we have the Jacobian of the dynamics at the stationary point is:

2 v(ω )2 (65)

Published as a conference paper at ICLR 2022

Take λ = a + ib Sp( v(ω )), we must have:

2 ( a ib)2] < 0, where ℜstands for the real part. (66)

2 (a2 b2) < 0 (67)

as desired.

Corollary 1. For the high resolution ODE of GD with GRL (i.e., Equation (64)) to be asymptotically stable, the learning rate η should be upper bounded by:

0 < η < 2a b2 a2 , (68)

for all λ = a + ib Sp( v(w )) with large imaginary part (i.e. such that |a| < |b|).

Proof. With |a| < |b| where b is the imaginary part of λ C, an algebraic manipulation of Theorem 2. i.e. η(b2 a2) < 2a leads to the desired result.

Theorem 3. The high resolution ODE of any RK2 method up to O(η), w = v(w), is asymptotically stable if for all eigenvalues λ = a + ib Sp( v(w )), we have a < 0.

Proof. This theorem can be obtained by computing the Jacobian of the ODE dynamics (see Appendix B for the derivation) and using the Hurwitz condition Lemma 1. For details, see the proof of Theorem 2.

Unlike the high resolution ODE of GD with GRL in Corollary 1, up to O(η), there is no upper bound constraint on the learning rate. Therefore, at the cost of an additional extra-gradient computation, we are allowed to take a more aggressive step. We observe in practice that this leads to both faster convergence and better transfer performance.

Lemma 3. Suppose the game G(I, Ωi, Ji) is a potential game with potential function φ. Suppose the minimizers of φ are either a local minimum or a strict saddle (see Deﬁnition 5). The gradient play dynamics converges to a local Nash Equilibrium.

Proof. A potential game can be analyzed in terms of the implicit function φ(w) since φ(w) = v(w). From Theorem 4, we know the gradient ﬂow in φ(w) converges to a point w that is a local minimum. Thus, this follows from Proposition 6.

Proposition 4. The domain-adversarial game is neither a potential nor necessarily a purely adversarial game. Moreover, its gradient dynamics are not equivalent to the gradient ﬂow.

Proof. From appendix A.2, a sufﬁcient and necessary condition for a game to be potential is v(ω) being symmetric. Computing v(ω) using Equation (4) leads to v(ω) being asymmetric, from which the ﬁrst part of the result follows. Another way to see this, it is to notice a there is no potential function φ satisfying φ(ω) = v(ω) (because of the ﬂip in the sign in ds,t). Since the game is not a potential game then the vector ﬁeld is not equal to the gradient ﬂow (see also for more details Mazumdar et al. (2020). ) To see the game is not necessarily a purely adversarial game, it sufﬁces to show an example for which this does not happen. See Appendix A.3 and example therein.

C.1 PROPOSED LEARNING ALGORITHM

Algorithm 1 Pseudo-Code of the Proposed Learning Algorithm

Input: source data Ds, target data Dt, player losses Ji, network weights {ωi}3 i=1, learning rate η for t = 1 to T 1 do

xs, ys Ds, xt Dt v = [gradient(xs, xt, ys, Ji) for i in 1..3 ]. ω = Gradient Descent(v, w, η) ω = Runge-Kutta Solver(v, w, η, steps=1) end for

Published as a conference paper at ICLR 2022

C.2 CO APPROXIMATES RK2 (HEUN S METHOD)

We have the update rule of CO Mescheder et al. (2017):

ω+ = ω [ηv(ω) + γ 1

2 ||v(ω)||2 2] (69)

2 ||v(ω)||2 2 = v(w) v(w)

From RK2 (Heun s Method), we have:

2 (v(w) + v(w ηv(w))). (70)

2 (2v(w) η v(w)v(w) + O(η2)) (71)

If v(w) = v(w) = v(w) skew-symmetric, we can see that Equation (71) approximates the CO optimization method from Mescheder et al. (2017). This assumption is however not necessarily true in our case which may explain why RK2 was better in experiments. Experimentally, for CO, the best transfer performance was obtained for γ = 0.0001. We also observed that CO is very sensitive to the choice of γ.

D EXPERIMENTAL SETUP ADDITIONAL DETAILS

Our algorithm is implemented in Jax Bradbury et al. (2018) (Digits, NLP benchmark) and Py Torch (Visual Task). Experiments are conducted on a NVIDIA Titan V and V100 GPU Cards.

The digits benchmark constitutes of two digits datasets MNIST and USPS Le Cun et al. (1998); Long et al. (2018) with two transfer tasks (M U and U M). We adopt the splits and evaluation protocol from Long et al. (2018) which uses 60,000 and 7,291 training images and the standard test set of 10,000 and 2,007 test images for MNIST and USPS, respectively. We follow the standard implementation details of Long et al. (2018) for this benchmark. Therefore, we use Le Net Le Cun et al. (1998) as the backbone architecture, dropout (0.5), ﬁx the batch size to 32, the number of iterations per epoch to 937 and use weight decay (0.005). For GD-NM, we use the default momentum value (0.9). We follow the same approach for the additional hyper-parameters of Adam.

Grid Search Details for Comparison vs Standard Optimizers. For the learning rate, learning schedule and adaptation coefﬁcient (λ), we run the following grid search and select the best hyper-parameters for each optimizer.

learning rate: 0.0001, 0.001, 0.01, 0.03, 0.1, 0.3, 0.4.

learning scheduler: none, polynomial.

adaptation coefﬁcient (λ): 0.1, 0.5, 1.

The grid is run using Ray Tune Liaw et al. (2018) and on the transfer task M U. To avoid expending resources in bad runs we use a Hyper Band Scheduler Li et al. (2018). The same hyper-parameters are used for the other task (i.e U M). The best criteria correspond to the transfer accuracy (as it is typically done in UDA).

Grid Search Details for Comparison vs Game Optimization Algorithms. Similar to the previous experiment, for the comparison vs game optimization algorithms, we run the following grid search in order to select their best parameters for the comparison. In the case of CO, we additionally add γ extra-hypermeter to the grid as we experimentally found this optimizer to be sensitive to this parameter.

learning rate: 0.0001, 0.001, 0.006, 0.01, 0.03, 0.1, 0.3, 0.4.

learning scheduler: none, polynomial.

Adaptation coefﬁcient (λ): 0.1, 0.5, 1.

CO Only. Additional (γ): 10, 1.0, 0.1, 0.01, 0.001, 0.001, 0.0001

We show the best results of this grid search in Table 6. For EG, we observe training diverges for learning rates greater than 0.006.

Experiment on Visda 2017. For the visual tasks our implementation is built on Py Torch(1.5.1) and on top of the public available implementations of Long et al. (2018); Zhang et al. (2019); Junguang Jiang (2020). We did not run grid search here as it would be very computationally expensive. For the optimizer parameters, we tune thoroughly GD with Nesterov Momentum following the recommendation and implementation from Long et al. (2018); Zhang et al. (2019). For our method, we keep the exact same hyper-parameters of GD with NM, but

Published as a conference paper at ICLR 2022

increase the learning rate. We do not experiment with RK4 since the performance was similar to RK2 in the other benchmarks (see Tables 3 and 4), and it is more computationally expensive.

Experiment on Visda 2017 with So TA algorithms. For the comparison using So TA algorithms, we use the well-tuned hyper-parameters from Jiang et al. (2020) and set the discrepancy measure to be f-DAL Pearson as in Acuna et al. (2021) since this empirically was shown to be better than both JS and γ-JS/MDD. We use Implicit Alignment Jiang et al. (2020) as this method also allows to account for the dissimilarity between the labels marginals. We did not run grid search here as it would be very computationally expensive. In our case, we only increase the learning rate to 0.2.

Experiment on Natural Language. In this experiment, we use the Amazon product reviews dataset Blitzer et al. (2006) that contains online reviews of products collected on the Amazon website. We follow the splits and evaluation protocol from Courty et al. (2017); Dhouib et al. (2020) and choose 4 of its subsets corresponding to different product categories, namely: books (B), dvd (D), electronics (E) and kitchen (K). As in Courty et al. (2017); Dhouib et al. (2020), from where we took the baseline result, the network architecture is a two layer MLP with sigmoid function. In our case, hyper-parameters are kept the same but the learning rate is increased by a factor of 10.

Implementation on different frameworks. We use both Jax and Pytorch in our paper for simplicity. Since the proposed method only requires changing some lines of code, it can be easily implemented in different frameworks. Prototyping in Jax small scale experiments such as the one in the Digits datasets is very easy. We then use Py Torch in order to integrate our optimizers with large-scale So TA UDA methods (their codebase was implemented in Py Torch). This shows the versatility and simplicity of implementation of high-order ODE solvers. We did not see any issue while working with either Jax or Pytorch. Our choice of framework was simply based on modifying existing source code to evaluate our optimizer.

Remark on error bars. For experimental analysis involving a grid-search, we use the same seed and identically initialize the networks. That is, we give every optimizer the same fair opportunity to win as reported above. Note that reporting avg and std from a grid-search signiﬁcantly increases the computational demand so we did not do this. That said, we report avg and std for our main experiments using the best parameters found on the grid-search.

E ADDITIONAL EXPERIMENTS

E.1 NATURAL LANGUAGE PROCESSING TASKS.

We also evaluate our approach on natural language processing tasks. We use the Amazon product reviews dataset (Blitzer et al., 2006). We follow the splits and evaluation protocol from Courty et al. (2017); Dhouib et al. (2020) and choose 4 of its subsets corresponding to different product categories, namely: books (B), dvd (D), electronics (E) and kitchen (K). In our case, hyper-parameters are the same provided by authors but the learning rate is increased by a factor of 10. Table 4 shows noticeable gains by replacing the optimizer with either RK2 or RK4.

B D B E B K D B D E D K E B E D E K K B K D K E Avg DANN (Dhouib et al., 2020) 80.6 74.7 76.7 74.7 73.8 76.5 71.8 72.6 85 71.8 73 84.7 76.3

with RK2 82.3 74.3 75.7 80.2 79.2 79.6 72.3 74.2 86.5 72.6 73.6 86.2 78.1 with RK4 81.6 75.0 78.1 79.4 78.3 80.0 74.1 74.6 86.7 73.5 74.6 86.0 78.5

Table 4: Accuracy (%) on the Amazon Reviews Sentiment Analysis Dataset (NLP)

E.2 SENSITIVITY TO SAMPLING NOISE

Table 5: Sensitivity to Sampling Noise controlled by the batch size in the Visda Dataset. Resnet-50

Batch Size Avg Std 64 67.82 0.29 128 67.50 0.11 160 67.42 0.24

Ours(RK2) 64 73.20 0.36 128 73.81 0.26 160 74.18 0.15

Published as a conference paper at ICLR 2022

Table 6: Comparison vs Game Optimization Algorithms (best result from the grid search)

Algorithm M U

GD 90.0 GD-NM 93.2

EG Korpelevich (1976) 86.7 CO Mescheder et al. (2017) 93.8

Ours(RK2) 95.3 Ours(RK4) 95.9

E.3 ADDITIONAL COMPARISON VS GAME OPTIMIZATION ALGORITHMS

In Table 5, we show the sensitivity of our method to sampling Noise controlled by the mini-batch size. Speciﬁcally, we show results for 64, 128, 160 ( with 160 being the bigger size we could ﬁt in GPU memory). We also add results for GD. We observed that our method performs slightly better when the batch size was bigger.

Table 6 shows the results of the grid-search for the comparison vs game optimization algorithms. For each optimizer, these results correspond to their best (in terms of transfer performance) determined from the gridsearch explained in Appendix E. We also add GD for comparison. Our method notably outperforms the baselines.

E.4 CO VS GRADIENT DESCENT AND EXTRA-GRADIENT ALGORITHMS

In Table 6 we can observe that RK solvers outperform the baselines. Interestingly, we can also observe that CO outperforms both EG and GD-NM. In this section, we provide some intuition about why this may be the case in practice and conduct an additional experimental on this direction.

In the analysis presented in Section 4, we show that a better discretization implies better convergence to a local optimality (assuming ω exists, and that it is a strict local NE). Speciﬁcally, under those assumptions, we show that in both GD and EG methods we should put an upper bound to their learning rate (see Section 4 and appendix B.2). In Appendix C.2, on the other hand, we show CO can be interpreted in the limit as an approximation of the RK2 solver. Based on this, we believe in practice CO may approximate the continuous dynamics better than GD and EG. Particularly, if we have an additional hyperparameter (γ) to tune thoroughly. We believe this might also be the reason of why we found CO to be very sensitive to the γ parameter. In Table 7, we compare the performance before and after removing the best performing γ from the grid search. We can see for values others than the best γ, CO does not outperform GD-NM either.

Table 7: Performance of CO vs others. CO(γ =1e-3) corresponds to the best result after removing 1e-4 from the grid search.

Method M U GD 90.0 GD-NM 93.2

CO (γ = 1e-4) 93.8 CO (γ = 1e-3) 91.6

E.5 WALL-CLOCK COMPARISON

Table 8: Average Time Per Iteration Comparison (Wall-Clock Comparison)

Algorithm Avg time per iteration GD-NM 0.26 0.003 Ours(RK2) 0.49 0.008

Published as a conference paper at ICLR 2022

In Table 8, we show experimental results of a wall-clock comparison between RK2 and GD-NM. Speciﬁcally, we compute the average over 100 runs on a NVIDIA GPU: TITAN V, Cuda Version 10.1 and Py Torch version: 1.5.1. As we can see, the average increase in time is less than 2x slower in wall-clock time.This is in line with the claims made in our submission. We additionally emphasize that this time can be improved with more efﬁcient implementations (e.g. Cuda). Moreover, many more efﬁcient high-order ODE solvers exist and could be used in the future. Our work also inspires future research in this direction.

F PYTORCH PSEUDOCODE OF RK2 SOLVER

Below, we show how to implement the RK2 solver in Py Torch by simply modifying the existing implementation of SGD.

# Based on https://github.com/pytorch/pytorch/blob/release/1.5/torch/optim/sgd.py class RK2(torch.optim.optimizer):

# ...... def update_step(self,fwd_bwd_fn):

""" fwd_bwd_fn: fn that computes y=model(x); loss(y,yhat).backward() example:

# training loop x, yhat=sample_from_dataloader() optimizer.zero_grad()

def fwd_bwd_fn():

y=model(x) loss=compute_loss(y,yhat) loss.backward() return loss,y

fwd_bwd_fn() optimizer.update(fwd_bwd_fn) """ # We want to compute the update equation: # w - n/2* v(w) - n/2 * v(w-n*v(w)) temp_step = {} with torch.no_grad():

for jj_, group in enumerate(self.param_groups):

#ignore weight-decay for simplicity.

for ii_, p in enumerate(group['params']):

if p.grad is None:

continue d_p = p.grad # v(w)

# first part of the step. # w - n/2*v(w) p1 = p.data.clone().detach() p1.add_(d_p, alpha=-0.5 * group['lr'])

#storing for later used. temp_step[f"{jj_}_{ii_}"] = p1

# Computing now w-n*v(w) for the second part of the step p.add_(d_p, alpha=-1.0 * group['lr'])

# set to none p.grad = None # self.zero_grad()

# extra step evaluation. # this function does model forward,compute loss and loss backward. loss, _ = fwd_bwd_fn() # gradients are now v(w-n*v(w))

with torch.no_grad():

Published as a conference paper at ICLR 2022

for jj_, group in enumerate(self.param_groups):

for ii_, p in enumerate(group['params']):

if p.grad is None:

continue d_p = p.grad # v(w-n*v(w))

# unpacking the first part of the step. p1 = temp_step[f"{jj_}_{ii_}"]

# updating p with the stored value from before. # this is equivalent to p = p1 = w-n/2 * v(w) (but sligthly faster) p.zero_() p.add_(p1)

#w= w - n/2*v(w) - n/2 * v(w-n*v(w)) p.add_(d_p, alpha=-0.5 * group['lr'])