# deep_learning_games__85a63dba.pdf

Deep Learning Games

Dale Schuurmans Google daes@ualberta.ca

Martin Zinkevich Google martinz@google.com

We investigate a reduction of supervised learning to game playing that reveals new connections and learning methods. For convex one-layer problems, we demonstrate an equivalence between global minimizers of the training problem and Nash equilibria in a simple game. We then show how the game can be extended to general acyclic neural networks with differentiable convex gates, establishing a bijection between the Nash equilibria and critical (or KKT) points of the deep learning problem. Based on these connections we investigate alternative learning methods, and ﬁnd that regret matching can achieve competitive training performance while producing sparser models than current deep learning strategies.

1 Introduction

In this paper, we investigate a new approach to reducing supervised learning to game playing. Unlike well known reductions [8, 29, 30], we avoid duality as a necessary component in the reduction, which allows a more ﬂexible perspective that can be extended to deep models. An interesting ﬁnding is that the no-regret strategies used to solve large-scale games [35] provide effective stochastic training methods for supervised learning problems. In particular, regret matching [12], a step-size free algorithm, appears capable of efﬁcient stochastic optimization performance in practice.

A central contribution of this paper is to demonstrate how supervised learning of a directed acyclic neural network with differentiable convex gates can be expressed as a simultaneous move game with simple player actions and utilities. For variations of the learning problem (i.e. whether regularization is considered) we establish connections between the critical points (or KKT points) and Nash equilibria in the corresponding game. As expected, deep learning games are not simple, since even approximately training deep models is hard in the worst case [13]. Nevertheless, the reduction reveals new possibilities for training deep models that have not been previously considered. In particular, we discover that regret matching with simple initialization can offer competitive training performance compared to state-of-the-art deep learning heuristics while providing sparser solutions.

Recently, we have become aware of unpublished work [2] that also proposes a reduction of supervised deep learning to game playing. Although the reduction presented in this paper was developed independently, we acknowledge that others have also begun to consider the connection between deep learning and game theory. We compare these two speciﬁc reductions in Appendix J, and outline the distinct advantages of the approach developed in this paper.

2 One-Layer Learning Games

We start by considering the simpler one-layer case, which allows us to introduce the key concepts that will then be extended to deep models. Consider the standard supervised learning problem where one is given a set of paired data {(xt, yt)}T t=1, such that (xt, yt) X Y, and wishes to learn a

Work performed at Google Brain while on a sabbatical leave from the University of Alberta.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

predictor h:X Y. For simplicity, we assume X = Rm and Y = Rn. A standard generalized linear model can be expressed as h(x) = φ(θx) for some output transfer function φ : Rn Rn and matrix θ Rn m denoting the trainable parameters of the model. Despite the presence of the transfer function φ, such models are typically trained by minimizing an objective that is convex in z = θx.

OLP (One-layer Learning Problem) Given a loss function ℓ: Rn Rn R that is convex in the ﬁrst argument, let ℓt(z) = ℓ(z, yt) and Lt(θ) = ℓt(θxt). The training problem is to minimize L(θ) = T 1 PT t=1 Lt(θ) with respect to the parameters θ.

We ﬁrst identify a simple game whose Nash equilibria correspond to global minima of the one-layer learning problem. This basic relationship establishes a connection between supervised learning and game playing that we will exploit below. Although this reduction is not a signiﬁcant contribution by itself, the one-layer case allows us to introduce some key concepts that we will deploy later when considering deep neural networks. A one-shot simultaneous move game is deﬁned by specifying: a set of players, a set of actions for each player, and a set of utility functions that specify the value to each player given a joint action selection [36, Page 9] (also see Appendix E). Corresponding to the OLP speciﬁed above, we propose the following game.

OLG (One-layer Learning Game) There are two players, a protagonist p and an antagonist a. The protagonist chooses a parameter matrix θ Rm n. The antagonist chooses a set of T vectors and scalars {at, bt}T t=1, at Rn, bt R, such that a t z + bt ℓt(z) for all z Rn; that is, the antagonist chooses an afﬁne minorant of the local loss for each training example. Both players make their action choice without knowledge of the other player s choice. Given a joint action selection (θ, {at, bt}) we deﬁne the utility of the antagonist as U a = T 1 PT t=1 a t θxt + bt, and the utility of the protagonist as U p = U a. This is a two-person zero-sum game with continuous actions.

A Nash equilibrium is deﬁned by a joint assignment of actions such that no player has any incentive to deviate. That is, if σp = θ denotes the action choice for the protagonist and σa = {at, bt} the choice for the antagonist, then the joint action σ = (σp, σa) is a Nash equilibrium if U p( σp, σa) U p(σp, σa) for all σp, and U a(σp, σa) U a(σp, σa) for all σa.

Using this characterization one can then determine a bijection between the Nash equilibria of the OLG and the global minimizers of the OLP.

Theorem 1 (1) If (θ , {at, bt}) is a Nash equilibrium of the OLG, then θ must be a global minimum of the OLP. (2) If θ is a global minimizer of the OLP, then there exists an antagonist strategy {at, bt} such that (θ , {at, bt}) is a Nash equilibrium of the OLG. (All proofs are given in the appendix.)

Thus far, we have ignored the fact that it is important to control model complexity to improve generalization, not merely minimize the loss. Although model complexity is normally controlled by regularizing θ, we will ﬁnd it more convenient to equivalently introduce a constraint θ Θ for some convex set Θ (which we assume satisﬁes an appropriate constraint qualiﬁcation; see Appendix C). The learning problem and corresponding game can then be modiﬁed accordingly while still preserving the bijection between their solution concepts.

OCP (One-layer Constrained Learning Problem) Add optimization constraint θ Θ to the OLP. OCG (One-layer Constrained Learning Game) Add protagonist action constraint θ Θ to OLG.

Theorem 2 (1) If (θ , {at, bt}) is a Nash equilibrium of the OCG, then θ must be a constrained global minimum of the OCP. (2) If θ is a constrained global minimizer of the OCP, then there exists an antagonist strategy {at, bt} such that (θ , {at, bt}) is a Nash equilibrium of the OCG.

2.1 Learning Algorithms

The tight connection between convex learning and two-person zero-sum games raises the question of whether techniques for ﬁnding Nash equilibria might offer alternative training approaches. Surprisingly, the answer appears to be yes.

There has been substantial progress in on-line algorithms for ﬁnding Nash equilibria, both in theory [5, 24, 34] and practice [35]. In the two-person zero-sum case, large games are solved by pitting two regret-minimizing learning algorithms against each other, exploiting the fact that when both achieve a regret rate of ǫ/2, their respective average strategies form an ǫ-Nash equilibrium [38]. For the game as described above, where the protagonist action is θ Θ and the antagonist action is denoted σa,

we imagine playing in rounds, where on round k the joint action is denoted by σ(k) = (θ(k), σ(k) a ). Since the utility function for each player U i for i {p, a}, is afﬁne in their own action choice for any ﬁxed action chosen by the other player, each faces an online convex optimization problem [37] (note that maximizing U i is equivalent to minimizing U i; see also Appendix G). The total regret of a player, say the protagonist, is deﬁned with respect to their utility function after K rounds as Rp(σ(1) . . . σ(K)) = maxθ Θ PK k=1 U p(θ, σ(k) a ) U p(θ(k), σ(k) a ). (Nature can also be introduced to choose a random training example on each round, which simply requires the deﬁnition of regret to be expressed in terms of expectations over nature s choices.)

To accommodate regularization in the learning problem, we impose parameter constraints Θ. A particularly interesting case occurs when one deﬁnes Θ = {θ : θ 1 β}, since the L1 ball constraint is equivalent to imposing L1 regularization. There are two distinct advantages to L1 regularization in this context. First, as is well known, L1 encourages sparsity in the solution. Second, and much less appreciated, is the fact that any polytope constraint allows one to reduce the constrained online convex optimization problem to learning from expert advice over a ﬁnite number of experts [37]: Given a polytope Θ, deﬁne the convex hull basis H(Θ) to be a matrix whose columns are the vertices in Θ. An expert can then be assigned to each vertex in H(Θ), and an algorithm for learning from expert advice can then be applied by mapping its strategy on round k, ρ(k) (a probability distribution over the experts), back to an action choice in the original problem via θ(k) = H(Θ)ρ(k), while the utility vector on round k, u(k), can be passed back to the experts via H(Θ) u(k) [37].

Since this reduction allows any method for learning from expert advice to be applied to L1 constrained online convex optimization, we investigated whether alternative algorithms for supervised training might be uncovered. We considered two algorithms for learning from expert advice: the normalized exponentiated weight algorithm (EWA) [22, 32] (Algorithm 3); and regret matching (RM), a simpler method from the economics and game theory literature [12] (Algorithm 2). For supervised learning, these algorithms operate by using a stochastic sample of the gradient to perform their updates (outer loop Algorithm 1). EWA possesses superior regret bounds that demonstrate only a logarithmic dependence on the number of actions; however RM is simpler, hyperparameter-free, and still possesses reasonable regret bounds [9, 10]. Although exponentiated gradient methods have been applied to supervised learning [18, 32], we not aware of any previous attempt to apply regret matching to supervised training. We compared these to projected stochastic gradient descent (PSGD), which is the obvious modiﬁcation of stochastic gradient descent (SGD) that retains a similar regret bound [7, 28] (Algorithm 4).

2.2 Evaluation

To investigate the utility of these methods for supervised learning, we conducted experiments on synthetic data and on the MNIST data set [20]. Note that PSGD and EWA have a step size parameter, η(k), that greatly affects their performance. The best regret bounds are achieved for step sizes of the form ηk 1/2 and η log(m)k 1/2 respectively [28]; we also tuned η to generate the best empirical results. Since the underlying optimization problems are convex, these experiments merely focus on the speed of convergence to a global minimum of the constrained training problem.

The ﬁrst set of experiments considered synthetic problems. The data dimension was set to m = 10, and T = 100 training points were drawn from a standard multivariate Gaussian. For univariate prediction, a random hyperplane was chosen to label the data (hence the data was linearly separable, but not with a large margin). The logistic training loss achieved by the running average of the protagonist strategy θ over the entire training set is plotted in Figure 1a. For multivariate prediction, a 4 10 target matrix, θ , was randomly generated to label training data by arg max(θ xt). The training softmax loss achieved by the running average of the protagonist strategy θ over the entire training set is shown in Figure 1b. The third experiment was conducted on MNIST, which is an n = 10 class problem over m = 784 dimensional inputs with T = 60, 000 training examples, evidently not linearly separable. For this experiment, we used mini-batches of size 100. The training loss of the running average protagonist strategy θ (single run) is shown in Figure 1c. The apparent effectiveness of RM in these experiments is a surprising outcome. Even after tuning η for both PSGD and EWA, they do not surpass the performance of RM, which is hyperparameter free. We did not anticipate this observation; the effectiveness of RM for supervised learning appears not to have been previously noticed. (We do not expect RM to be competitive in high dimensional sparse problems, since its regret bound has a square root and not a logarithmic dependence on n [9].)

(a) Logistic loss, synthetic data.

(b) Softmax loss, synthetic data.

(c) Softmax loss, MNIST data.

Figure 1: Training loss achieved by different no-regret algorithms. Subﬁgures (a) and (b) are averaged over 100 repeats, log scale x-axis. Subﬁgure (c) is averaged over 10 repeats (psgd theory off scale).

3 Deep Learning Games

A key contribution of this paper is to show how the problem of training a feedforward neural network with differentiable convex gates can be reduced to a game. A practical consequence of this reduction is that it suggests new approaches to training deep models that are inspired by methods that have recently proved successful for solving massive-scale games.

Feedforward Neural Network A feedforward neural network is deﬁned by a directed acyclic graph with additional objects attached to the vertices and edges. The network architecture is speciﬁed by N = (V, E, I, O, F), where V is a set of vertices, E V V is a set of edges, I = {i1 . . . im} V is a set of input vertices, O = {o1 . . . on} V is a set of output vertices, and F = {fv : v V } is a set of activation functions, where fv : R R. The trainable parameters are given by θ : E R.

In the graph deﬁned by G = (V, E), a path (v1, ..., vk) consists of a sequence of vertices such that (vj, vj+1) E for all j. A cycle is a path where the ﬁrst and last vertex are equal. We assume that G contains no cycles, the input vertices have no incoming edges (i.e. (u, i) E for all i I, u V ), and the output vertices have no outgoing edges (i.e. (o, v) E for all o O, v V ). A directed acyclic graph generates a partial order on the vertices where u v if and only if there is a path from u to v. For all v V , deﬁne Ev = {(u, u ) E : u = v}. The network is related to the training data by assuming |I| = m, the number of input vertices corresponds to the number of input features, and |O| = n, the number of output vertices corresponds to the number of output dimensions. It is a good idea (but not required) to have two additional bias inputs, whose corresponding input features are always set to 0 and 1, respectively, and have edges to all non-input nodes in the graph. Usually, the activation functions on input and output nodes are the identity, i.e. fv(x) = x for v I O.

Given a training input xt Rm, the computation of the network N is expressed by a circuit value function ct that assigns values to each vertex based on the partial order over vertices: ct(ik, θ) = fik(xtk) for ik I; ct(v, θ) = fv P

u:(u,v) E ct(u, θ)θ(u, v) for v V I. (1)

Let ct(o, θ) denote the vector of values at the output vertices, i.e. (ct(o, θ))k = ct(ok, θ). Since each fv is assumed differentiable, the output ct(o, θ) must also be differentiable with respect to θ.

When we wish to impose constraints on θ we assume the constraints factor over vertices, and are applied across the incoming edges to each vertex. That is, for each v V I the parameters θ restricted to Ev are required to be in a set Θv REv, and Θ = Q

v V I Θv. (We additionally assume each Θv satisﬁes constraint qualiﬁcations see Appendix C and can also alter the factorization requirement to allow more complex network architectures see Appendix H). If Θ = RE, we consider the network to be unconstrained. If Θ is bounded, we consider the network to be bounded.

DLP (Deep Learning Problem) Given a loss function ℓ(z, y) that is convex in the ﬁrst argument satisfying 0 ℓ(z, y) < for all z Rn, deﬁne ℓt(z) = ℓ(z, yt) and Lt(θ) = ℓt(ct(o, θ)). The training problem is to ﬁnd a θ Θ that minimizes L(θ) = T 1 PT t=1 Lt(θ).

DLG (Deep Learning Game) We deﬁne a one-shot simultaneous move game [36, page 9] with inﬁnite action sets (Appendix E); we need to specify the players, action sets, and utility functions.

Players: The players consist of a protagonist p for each v V I, an antagonist a, and a set of self-interested zannis sv, one for each vertex v V .2 Actions: The protagonist for vertex v chooses a parameter function θv Θv. The antagonist chooses a set of T vectors and scalars {at, bt}T t=1, at Rn, bt R, such that a t z + bt ℓt(z) for all z Rn; that is, the antagonist chooses an afﬁne minorant of the local loss for each training example. Each zanni sv chooses a set of 2T scalars (qvt, dvt), qvt R, dvt R, such that qvtz + dvt fv(z) for all z R; that is, the zanni chooses an afﬁne minorant of its local activation function fv for each training example. All players make their action choice without knowledge of the other player s choice. Utilities: For a joint action σ = (θ, {at, bt}, {qvt, dvt}), the zannis utilities are deﬁned recursively following the parial order on vertices. First, for each i I the utility for zanni si on training example t is U s it(σ) = dit + qitxit, and for each v V I the utility for zanni sv on example t is U s vt(σ) = dvt + qvt P u:(u,v) E U s tu(σ)θ(u, v). The total utility for each zanni sv is given

by U s v(σ) = PT t=1 U s vt(σ) for v V . The utility for the antagonist a is then given by U a = T 1 PT t=1 U a t where U a t (σ) = bt + Pn k=1 akt U s okt(σ). The utility for all protagonists are the same, U p(σ) = U a(σ). (This representation also allows for an equivalent game where nature selects an example t, tells the antagonist and the zannis, and then everyone plays their actions simultaneously.) The next lemma shows how the zannis and the antagonist can be expected to act.

Lemma 3 Given a ﬁxed protagonist action θ, there exists a unique joint action for all agents σ = (θ, {at, bt}, {qvt, dvt}) where the zannis and the antagonist are playing best responses to σ. Moreover, U p(σ) = L(θ), θU p(σ) = L(θ), and given some protagonist at v V I, if we hold all other agents strategies ﬁxed, U p(σ) is an afﬁne function of the strategy of the protagonist at v. We deﬁne σ as the joint action expansion for θ.

There is more detail in the appendix about the joint action expansion. However, the key point is that if the current cost and partial derivatives can be calculated for each parameter, one can construct the afﬁne function for each agent. We will return to this in Section 3.1.

A KKT point is a point that satisﬁes the KKT conditions [15, 19]: roughly, that either it is a critical point (where the gradient is zero), or it is a point on the boundary of Θ where the gradient is pointing out of Θ perpendicularly (see Appendix C). We can now state the main theorem of the paper, showing a one to one relationship between KKT points and Nash equilibria.

Theorem 4 (DLG Nash Equilibrium) The joint action σ =(θ, {at, bt}, {qvt, dvt}) is a Nash equilibrium of the DLG iff it is the joint action expansion for θ and θ is a KKT point of the DLP.

Corollary 5 If the network is unbounded, the joint action σ = (θ, {at, bt}, {qvt, dvt}) is a Nash equilibrium of the DLG iff it is the joint action expansion for θ and θ is a critical point of the DLP.

Finally we note that sometimes we need to add constraints between edges incident on different nodes. For example, in a convolutional neural network, one will have edges e = {u, v} and e = {u , v } such that there is a constraint θe = θe (see Appendix H). In game theory, if two agents act simultaneously it is difﬁcult to have one agent s viable actions depend on another agent s action. Therefore, if parameters are constrained in this manner, it is better to have one agent control both. The appendix (beginning with Appendix B) extends our model and theory to handle such parameter tying, which allows us to handle both convolutional networks and non-convex activation functions (Appendix I). Our theory does not apply to non-smooth activation functions, however (e.g. Re LU gates), but these can be approximated arbitrarily closely by differentiable activations.

3.1 Learning Algorithms

Characterizing the deep learning problem as a game motivates the consideration of equilibrium ﬁnding methods as potential training algorithms. Given the previous reduction to expert algorithms, we will consider the use of the L1 ball constraint Θv = {θv : θv 1 β} at each vertex v. For deep learning, we have investigated a simple approach by training independent protagonist agents at each vertex against a best response antagonist and best response zannis [14]. In this case, it is possible

2 Nomenclature explanation: Protagonists nominally strive toward a common goal, but their actions can interfere with one another. Zannis are traditionally considered servants, but their motivations are not perfectly aligned with the protagonists. The antagonist is diametrically opposed to the protagonists.

Algorithm 1 Main Loop

On round k, observe some xt (or mini batch) Antagonist and zannis choose best responses which ensures U p v (θv) = L(θ(k) v ) g(k) v U p v (θv) Apply update to r(k) v , ρ(k) v and θ(k) v v V

Algorithm 2 Regret Matching (RM)

r(k+1) v r(k) v + H(Θv) g(k) v ρ(k) v H(Θv) g(k) v ρ(k+1) v r(k+1) v

+/ 1 r(k+1) v

θ(k+1) v H(Θv)ρ(k+1) v

Algorithm 3 Exp. Weighted Average (EWA)

r(k+1) v r(k) v + η(k)H(Θv) g(k) v ρ(k+1) v exp(r(k+1) v )/(1 exp(r(k+1) v )) θ(k+1) v H(Θv)ρ(k+1) v

Algorithm 4 Projected SGD

r(k+1) v r(k) v + η(k)H(Θv) g(k) v ρ(k+1) v L2_project_to_simplex(r(k+1) v ) θ(k+1) v H(Θv)ρ(k+1) v

to devise interesting and novel learning strategies based on the algorithms for learning from expert advice. Since the optimization problem is no longer convex in a local protagonist action θv, we do not expect convergence to a joint, globally optimal strategy among protagonists. Nevertheless, one can develop a generic approach for using the game to generate a learning algorithm.

Algorithm Outline On each round, nature chooses a random training example (or mini-batch). For each v V , each protagonist v selects her actions θv Θv deterministically. The antagonist and zannis then select their actions, which are best responses to the θv and to each other.3 The protagonist utilities U p v are then calculated. Given the zanni and antagonist choices, U p v is afﬁne in the protagonist s action, and also by Lemma 3 for all e Ev, we have Lt

we = U p v (θv) we . Each protagonist v V then observes their utility and uses this to update their strategy. See Algorithm 1 for the general loop, and Algorithms 2, 3 and 4 for speciﬁc updates.

Given the characterization developed previously, we know that a Nash equilibrium will correspond to a critical point in the training problem (which is almost certain to be a local minimum rather than a saddle point [21]). It is interesting to note that the usual process of backpropagating the sampled (sub)gradients corresponds to computing the best response actions for the zannis and the antagonist, which then yields the resulting afﬁne utility for the protagonists.

3.2 Experimental Evaluation

We conducted a set of experiments to investigate the plausibility of applying expert algorithms at each vertex in a feedforward neural network. For comparison, we considered current methods for training deep models, including SGD [3], SGD with momentum [33], RMSprop, Adagrad [6], and Adam [17]. Since none of these impose constraints, they technically solve an easier optimization problem, but they are also un-regularized and therefore might exhibit weaker generalization. We tuned the step size parameter for each comparison method on each problem. For the expert algorithms, RM, EWA and PSGD, we found that EWA and PSGD were not competitive, even after tuning their step sizes. For RM, we initially found that it learned too quickly, with the top layers of the model becoming sparse; however, we discovered that RM works remarkably well simply by initializing the cumulative regret vectors r(0) v with random values drawn from a Gaussian with large standard deviation σ.

As a sanity check, we ﬁrst conducted experiments on synthetic combinatorial problems: parity , deﬁned by y = x1 xm and folded parity , deﬁned by y = (x1 x2) (xm 1 xm) [27]. Parity cannot be approximated by a single-layer model but is representable with a single hidden layer of linear threshold gates [11], while folded parity is known to be not representable by a (small weights) linear threshold circuit with only a single hidden layer; at least two hidden layers are required [27]. For parity we trained a m-4m-1 architecture, and for folded parity we trained a m-4m-4m-1 architecture, both fully connected, m = 8. Here we chose the L1 constraint bound to be β = 10 and the initialization scale as σ = 100. For the nonlinear activation functions we used a smooth

3 Conceptually, each zanni has a copy of the algorithm of each protagonist and an algorithm for selecting a joint action for all antagonists and zannis, and thus do not technically depend upon θv. In practice, these multiple copies are unnecessary, and one merely calculates θv Θv ﬁrst.

(a) Learning Parity with logistic loss.

(b) MNIST, full layers, train loss.

(c) MNIST, full layers, test error.

(d) Folded Parity, logistic loss.

(e) MNIST, convolutional, train loss.

(f) MNIST, convolutional, test error.

Figure 2: Experimental results. (a) Parity, m-4m-1 architecture, 100 repeats. (d) Folded parity, m-4m-4m-1 architecture, 100 repeats. (b) and (c): MNIST, 784-1024-1024-10 architecture, 10 repeats. (e) and (f): MNIST, 28 28-c(5 5, 64)-c(5 5, 64)-c(5 5, 64)-10 architecture, 10 repeats.

approximation of the standard Re LU gate fv(x) = τ log(1 + ex/τ) with τ = 0.5. The results shown in Figure 2a and Figure 2d conﬁrm that RM performs competitively, even when producing models with sparsity, top to bottom, of 18% and 13% for parity, and 27%, 19% and 21% for folded parity.

We next conducted a few experiments on MNIST data. The ﬁrst experiment used a fully connected 784-1024-1024-10 architecture, where RM was run with β = 30 and initialization scales (σ1, σ2, σ3) = (50, 200, 50). The second experiment was run with a convolutional architecture 28 28-c(5 5, 64)-c(5 5, 64)-c(5 5, 64)-10 (convolution windows 5 5 with depth 64), where RM was run with (β1, β2, β3, β4) = (30, 30, 30, 10) and initialization scales σ = 500. The mini-batch size was 100, and the x-axis in the plots give results after each update batch of 600 mini-batches (i.e. one epoch over the training data). The training loss and test loss are shown in Figures 2b, 2c, 2e and 2f, showing the evolution of the training loss and test misclassiﬁcation errors. We dropped all but SGD, Adam, RMSprop and RM here, since these seemed to dominate the other methods in our experiments. It is surprising that RM can demonstrate convergence rates that are competitive with tuned RMSprop, and even outperforms methods like SGD and Adam that are routinely used in practice. An even more interesting ﬁnding is that the solutions found by RM were sparse while achieving lower test misclassiﬁcation errors than standard deep learning methods. In particular, in the fully connected case, the ﬁnal solution produced by RM zeroed out 32%, 26% and 63% of the parameter matrices (from the input to the output layer) respectively. For the convolutional case, RM zeroed out 29%, 27%, 28% and 43% of the parameter matrices respectively. Regarding run times, we observed that our Tensorﬂow implementation of RM was only 7% slower than RMSProp on the convolutional architecture, but 85% slower in the fully connected case.

4 Related Work

There are several works that consider using regret minimization to solve ofﬂine optimization problems. Once stochastic gradient descent was connected to regret minimization in [4], a series of papers followed [26, 25, 31]. Two popular approaches are currently Adagrad [6] and traditional stochastic gradient descent. The theme of simplifying the loss is very common: it appears in batch gradient and incremental gradient approaches [23] as the majorization-minimization family of algorithms. In the

regret minimization literature, the idea of simplifying the class of losses by choosing a minimizer from a particular family of functions ﬁrst appeared in [37], and has since been further developed.

By contrast, the history of using games for optimization has a much shorter history. It has been shown that a game between people can be used to solve optimal coloring [16]. There is also a history of using regret minimization in games: of interest is [38] that decomposes a single agent into multiple agents, providing some inspiration for this paper. In the context of deep networks, a paper of interest connects brain processes to prediction markets [1]. However, the closest work appears to be the recent manuscript [2] that also poses the optimization of a deep network as a game. Although the games described there are similar, unlike [2], we focus on differentiable activation functions, and deﬁne agents with different information and motivations. Importantly, [2] does not characterize all the Nash equilibria in the game proposed. We discuss these issues in more detail in Appendix J.

5 Conclusion

We have investigated a reduction of deep learning to game playing that allowed a bijection between KKT points and Nash equilibria. One of the novel algorithms considered for supervised learning, regret matching, appears to provide a competitive alternative that has the additional beneﬁt of achieving sparsity without unduly sacriﬁcing speed or accuracy. It will be interesting to investigate alternative training heuristics for deep games, and whether similar successes can be achieved on larger deep models or recurrent models.

[1] D. Balduzzi. Cortical prediction markets. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, pages 1265 1272, 2014.

[2] D. Balduzzi. Deep online convex optimization using gated games. http://arxiv.org/abs/1604.01952, 2016.

[3] L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade - Second Edition, pages 421 436. 2012.

[4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, September 2004.

[5] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

[6] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

[7] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efﬁcient projections onto the l1-ball for learning in high dimensions. In Inter. Conf. on Machine Learning, pages 272 279, 2008.

[8] Y. Freund and R. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999.

[9] G. Gordon. No-Regret algorithms for structured prediction problems. Technical Report CMU-CALD-05112, Carnegie Mellon University, 2005.

[10] G. Gordon. No-regret algorithms for online convex programs. In NIPS 19, 2006.

[11] A. Hajnal. Threshold circuits of bounded depth. JCSS, 46(2):129 154, 1993.

[12] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127 1150, 2000.

[13] K. Hoeffgen, H. Simon, and K. Van Horn. Robust trainability of single neurons. JCSS, 52(2):114 125, 1995.

[14] M. Johanson, N. Bard, N. Burch, and M. Bowling. Finding optimal abstract strategies in extensive form games. In AAAI Conference on Artiﬁcial Intelligence, pages 1371 1379, 2012.

[15] W. Karush. Minima of functions of several variables with inequalities as side constraints. Master s thesis, Univ. of Chicago, Chicago, Illinois, 1939.

[16] M. Kearns, S. Suri, and N. Montfort. An experimental study of the coloring problem on human subject networks. Science, 313:824 827, 2006.

[17] D. Kingma and J. Ba. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2014.

[18] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, 1997.

[19] H. Kuhn and A. Tucker. Nonlinear programming. In Proceedings of 2nd Berkeley Symposium, pages 481 492. University of California Press, 1951.

[20] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[21] J. Lee, M. Simchowitz, M. Jordan, and B. Recht. Gradient Descent Only Converges to Minimizers. In 29th Annual Conference on Learning Theory, volume 49, 2016.

[22] N. Littlestone and M. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2):212 261, 1994.

[23] J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829 855, 2015.

[24] A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems 26, pages 3066 3074, 2013.

[25] N. Ratliff, D. Bagnell, and M. Zinkevich. Subgradient methods for structured prediction. In Eleventh International Conference on Artiﬁcial Intelligence and Statistics (AISTATS-07), 2007.

[26] N. Ratliff, J. A. Bagnell, and M. Zinkevich. Maximum margin planning. In Twenty Second International Conference on Machine Learning (ICML-06), 2006.

[27] A. Razborov. On small depth threshold circuits. In Algorithm Theory (SWAT 92), 1992.

[28] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012.

[29] S. Shalev-Shwartz and Y. Singer. Convex repeated games and Fenchel duality. In NIPS 19, 2006.

[30] S. Shalev-Shwartz and Y. Singer. A primal-dual perspective of online learning algorithms. Machine Learning, 69(2-3):115 142, 2007.

[31] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3 30, 2011.

[32] N. Srinivasan, V. Ravichandran, K. Chan, J. Vidhya, S. Ramakirishnan, and S. Krishnan. Exponentiated backpropagation algorithm for multilayer feedforward neural networks. In ICONIP, volume 1, 2002.

[33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings ICML, pages 1139 1147, 2013.

[34] V. Syrgkanis, A. Agarwal, H. Luo, and R. Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems 28, pages 2971 2979, 2015.

[35] O. Tammelin, N. Burch, M. Johanson, and M. Bowling. Solving heads-up limit Texas hold em. In International Joint Conference on Artiﬁcial Intelligence, IJCAI, pages 645 652, 2015.

[36] V. Vazirani, N. Nisan, T. Roughgarden, and É Tardos. Algorithmic Game Theory. Cambridge Press, 2007.

[37] M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Twentieth International Conference on Machine Learning, 2003.

[38] M. Zinkevich, M. Bowling, M. Johanson, and C. Piccione. Regret minimization in games with incomplete information. In NIPS, 2007.