# symmetry_teleportation_for_accelerated_optimization__53e82b6e.pdf

Symmetry Teleportation for Accelerated Optimization

Bo Zhao University of California, San Diego

bozhao@ucsd.edu

Nima Dehmamy

IBM Research nima.dehmamy@ibm.com

Robin Walters Northeastern University r.walters@northeastern.edu

Rose Yu University of California, San Diego

roseyu@ucsd.edu

Existing gradient-based optimization methods update parameters locally, in a direction that minimizes the loss function. We study a different approach, symmetry teleportation, that allows parameters to travel a large distance on the loss level set, in order to improve the convergence speed in subsequent steps. Teleportation exploits symmetries in the loss landscape of optimization problems. We derive loss-invariant group actions for test functions in optimization and multi-layer neural networks, and prove a necessary condition for teleportation to improve convergence rate. We also show that our algorithm is closely related to second order methods. Experimentally, we show that teleportation improves the convergence speed of gradient descent and Ada Grad for several optimization problems including test functions, multi-layer regressions, and MNIST classiﬁcation. Our code is available at https://github.com/Rose-STL-Lab/Symmetry-Teleportation.

1 Introduction

Consider the optimization problem of ﬁnding argminw L(w), where L is a loss function and w are the parameters. While ﬁnding global minima of L(w) is hard for non-convex problems, we can use gradient descent (GD) to ﬁnd local minima. In GD we apply the following update rule at every step:

wt+1 wt r L, (1)

where is the learning rate. Gradient descent is a ﬁrst-order method that uses only gradient information. It is easy to compute but suffers from slow convergence. Second-order methods such as Newton s method use the second derivative to account for the landscape geometry. These methods enjoy faster convergence, but calculating the second derivative (Hessian) can be computationally expensive over high dimensional spaces (Hazan, 2019).

Figure 1: Left to right: gradient descent, second-order methods, proposed method.

In this paper, we propose a new optimization method based on parameter space symmetries, which are group actions on the parameter space that leave the loss unchanged. Our algorithm takes advantage of higher-order landscape geometry but uses only gradient information in most steps, thus avoiding the computational cost of second-order methods.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Speciﬁcally, we look beyond local optimization and ask: what if we allow the parameters to make a big jump once in a while? As shown in Figure 1, during optimization, we teleport the parameters to another point with steeper gradients using a loss-invariant symmetry transformation. After teleportation, the loss stays the same, but the gradient and hence the rate of loss decay changes. The increased magnitude of the gradient can reduce the number of steps required to converge, leading to acceleration of the gradient descent.

The locality of gradient descent is reﬂected in its formulation in terms of a proximal term; our method circumvents this locality by teleporting to new locations with the same loss. A step of GD is equivalent to the following proximal mapping (Combettes and Pesquet, 2011). Let h , i denote the inner product, we have

wt+1 = argminw

h(r L)|wt, wi + 1

2kw wtk2 is the proximal term that keeps wt+1 close to wt. Adaptive gradient methods deﬁne the proximal term using the Mahalanobis distance kw wtk2

G 1/2 to account for landscape geometry. For example, in Ada Grad, G is the sum of the outer product of gradients (Duchi et al., 2011). Our proposed teleportation technique relaxes the requirement from the proximal term. We teleport to points on the same level set of L(wt), but allow w to be far from wt in Euclidean distance.

In summary, our main contributions are:

We propose symmetry teleportation, an accelerated gradient-based optimization algorithm

that exploits symmetries in the loss landscape. We derive loss-invariant group actions of test functions and multi-layer neural networks. We provide necessary conditions and examples of when teleportation improves convergence

rate of the current and subsequent steps. We show that our algorithm is closely related to second-order methods. Experimentally, we show that teleportation improves the convergence speed of gradient

descent and Ada Grad in various optimization problems.

2 Related Work

Continuous parameter space symmetries have been identiﬁed in neural networks with homogeneous (Badrinarayanan et al., 2015; Du et al., 2018) and radial activation functions (Ganev et al., 2021). The effect of symmetry transformations on gradients has been examined for translation, scale, and rescale symmetries in Kunin et al. (2021). We contribute to this line of work by deriving loss-invariant group actions in multi-layer neural networks with invertible activation functions.

Several works exploit symmetry to facilitate optimization. For example, Path-SGD (Neyshabur et al., 2015) improves optimization in Re LU networks by path regularization. G-SGD (Meng et al., 2019) performs weight updates in a scale-invariant space, which also exploits the symmetry in Re LU networks. Van Laarhoven (2017) analyzes the effect of L2 regularization in batch normalization which gives scale-invariant functions. Bamler and Mandt (2018) transforms parameters in the symmetry orbits to address slow movement along directions of weakly broken symmetry and ﬁnd the optimal g 2 G that minimizes L(g w). In comparison, we search within G-orbits for points which maximize |d L(g w)/dt| = kr L(g w)k2.

Natural gradient (Amari, 1998), adaptive gradient methods (Duchi et al., 2011; Kingma and Ba, 2015), and their approximations (Martens and Grosse, 2018; Gupta et al., 2018) improve the direction of parameter updates instead of directly transforming parameters. If the group acts transitively on the level set, and we teleport to the point with maximum gradient, then our update direction is the same as that in Newton s method. We prove that our algorithm is connected to second-order optimization methods and show that teleportation can be used to improve these methods empirically.

The concept of neural teleportation was ﬁrst explored using quiver representation theory (Armenta and Jodoin, 2021; Armenta et al., 2020). These works provide a way to explore level curves of the loss of neural networks and show that random teleportation speeds up gradient descent experimentally and theoretically. Our algorithm improves neural teleportation by searching for teleportation destinations that lead to the largest improvement in the magnitude of gradient.

Several works study how parameter initialization affects convergence rate (Saxe et al., 2014; Tarmoun et al., 2021; Min et al., 2021). If we apply a group transformation only at initialization, our method is similar to that of Tarmoun et al. (2021). We do not guarantee that the transformed parameters lead to faster convergence rate throughout the entire training. However, we accelerate convergence at least for a short time after initialization. Additionally, our method is not restricted to initialization and can be applied at any time during training.

Most contemporary neural networks are overparametrized. While this has been shown to improve generalization (Belkin et al., 2019), an important question is how overparametrization affects optimization. A series of works starting from Arora et al. (2018) show that overparametrization resulting from the depth of a neural network accelerates convergence. Another view is that the symmetry created by overparametrization poses constraints on trajectories in the form of conserved quantities (Głuch and Urbanke, 2021). Additionally, the symmetry generates additional trajectories. When the trajectories created by overparametrization are equivalent, model compression by removing symmetry reduces training time (Ganev et al., 2021). However, when trajectories are not equivalent, gradient ﬂows on some paths are faster than others. We search for the better paths created by overparametrization.

3 Symmetry Teleportation

We propose symmetry teleportation, an accelerated gradient-based optimization algorithm that exploits symmetries in the loss landscape. Symmetry teleportation searches for the best gradient descent trajectory by teleporting parameters to a point with larger gradients using a group action. The resulting algorithm requires only gradient computations but is able to account for the global landscape geometry, leading to faster loss decay.

Let the group G be a set of symmetries which leave the loss function L invariant: L(g (w, X)) = L(w, X), where g 2 G. We perform gradient descent for tmax steps. We deﬁne an index set K {0, 1, 2, ..., tmax 1} as a teleportation schedule. At epochs that are in the schedule, we transform parameters using group element g 2 G to the location where the gradient is largest, then continue with gradient descent. Algorithm 1 describes the details of this procedure. Note that the loss does not change after teleportation (Line 2-5) since L is G-invariant.

Algorithm 1: Symmetry Teleportation Input: Loss function L(w), learning rate , number of epochs tmax, initialized parameters w0,

symmetry group G, teleportation schedule K. Output: wtmax.

1 for t 0 to tmax 1 do

2 if t 2 K then

3 g argmaxg2Gk(r L)|g wtk2

4 wt g wt 5 end if

6 wt+1 wt (r L)|wt 7 end for

8 return wtmax

Algorithm 1 can be generalized to apply teleportation to stochastic gradient descent (Appendix A.1). We discuss some design choices below and provide detailed analysis in the next two sections.

When the action of G is continuous, teleportation can be implemented by parameterizing and performing gradient ascent on g. For example, the SO(2) group can be parameterized by the rotation angle . Small transformations g 2 GLd(R) (general linear group) can be parameterized as g I + "T where " 1 and T are arbitrary d d matrices. For discrete groups, search algorithms or random sampling can be used to ﬁnd a group element that improves the magnitude of the gradient.

Although g (w, X) does not change X in the cases we consider in this paper, Algorithm 1 can be extended to allow transformations on both parameters and data. The group actions on data during teleportations can be precomposed and applied to the input data at inference time. The algorithm extension that allows transformation on the input can be found in Appendix A.2.

The teleportation schedule K is a hyperparameter that determines when to perform teleportation. We deﬁne K as a set to allow ﬂexible teleportation schedules, such as with non-ﬁxed frequencies or teleporting only at the earlier epochs.

4 Symmetry Groups of Certain Optimization Problems

We give a few practical examples to demonstrate how teleportation can be used to accelerate optimization. Speciﬁcally, we ﬁrst consider two test functions which are often used to evaluate optimization algorithms (Back, 1996). We then derive the symmetries of multi-layer neural networks.

4.1 Test functions

Rosenbrock function. The Rosenbrock function originally introduced by Rosenbrock (1960) has a characteristic global minimum that is inside a long, narrow, parabolicly-shaped ﬂat valley. Finding the valley is easy but reaching the minimum is difﬁcult. On a 2-dimensional space, the Rosenbrock function has the following form:

Lr(x1, x2) = 100(x2

1 x2)2 + (x1 1)2. (3)

Booth function. The Booth function (Jamil and Yang, 2013) is also deﬁned on R2 and has one global minimum at (1, 3) where the function evaluates to 0:

Lb(x1, x2) = (x1 + 2x2 7)2 + (2x1 + x2 5)2. (4)

The following proposition identiﬁes the symmetry of these two test functions. Proposition 4.1. The Rosenbrock and Booth functions have rotational symmetry. In other words, there exist action maps ar, ab : SO(2) R2 ! R2, such that for all g 2 SO(2),

Lr(x1, x2) = Lr(ar(g, [x1, x2])) and Lb(x1, x2) = Lb(ab(g, [x1, x2])).

The exact forms of the action maps are deferred to Appendix B.1.2 and B.1.3. During the teleportation step, our goal is to maximize the gradient within a level set of the loss: maxg2SO(2) k d L(g (x1,x2))

4.2 Multi-layer Neural Networks

Next, we consider feed-forward neural networks. Denote the output of the mth layer by hm 2 Rdm n, where dm is the hidden dimension and n is the number of samples. Denote the input by h0 = X 2 Rd0 n. In terms of the previous layer output hm = Wmhm 1 where Wm 2 Rdm dm 1

(we absorb biases into Wm by adding an extra row of ones in hm 1), the output hm is

hm = σ( hm) = σ (Wmhm 1) . (5)

We assume the activation functions σ : R ! R are bijections. For instance, Leaky-Re LU is bijective. For other activation, such Sigmoid or Tanh, we analytically extend them to bijective functions (e.g. tanh(x) + ex N e x+N for N 1). For linear activations, we have linear symmetries: Proposition 4.2. A linear network is invariant under all groups Gm GLdm acting as

g (Wm, Wm 1) = (Wmg 1, g Wm 1), g Wk = Wk, 8k /2 {m, m 1}. (6)

Similar, in the nonlinear case, we want to ﬁnd g Wm that keep the outputs hk for k 6= m 1 invariant. With nonlinearity, hm = Wmσ(Wm 1hm 2). This network has a GLdm 1 symmetry given by

g (Wm, hm 1) = (Wmg 1, ghm 1), g hm = Wmg 1ghm 1 = hm. (7)

Thus, we have the following proposition regarding symmetries of nonlinear networks: Proposition 4.3. Assume that hm 2 is invertible. A multi-layer network with bijective activation σ has a GLdm 1 symmetry. For gm 2 Gm = GLdm 1(R) the following group action keeps hp with p m invariant

m k = m σ 1 (gmσ (Wm 1hm 2)) h 1

m 2 k = m 1 Wk k 62 {m, m 1}

Proofs can be found in Appendix B.2. Note that this group action depends on the input to the network as well as the current weights of all the lower layers. Yet, since the action keeps the output of upper and lower layers invariant, multiple Gm for different m applied at the same time still keep the network output invariant. The proposition assumes that hm 2 is square and full-rank. When n < dm 2 and hm 2 has rank n, (8) (with left inverse of hm 2) keeps the loss invariant but does not satisfy the identity axiom of a group action.

5 Theoretical Analysis

5.1 What symmetries help accelerate optimization

We now discuss the conditions that need to be satisﬁed for teleportation to accelerate GD. For brevity, we denote all trainable parameters (e.g. {W1, Wp} for the p-layer neural network) by a single ﬂattened vector w 2 Rn. Consider a symmetry G of the loss function L(w), meaning for all g 2 G, L(w) = L(g w). We quantify how teleportation by G changes the rate of loss decay, given by

= [r L]T r L = kr Lk2

where is the matrix of learning rates (which must be positive semi-deﬁnite) and kvk2

= v T v is the Mahalanobis norm. A constant learning rate means = I.

The following proposition provides the condition a symmetry needs to satisfy to accelerate optimization. (Proofs can be found in Appendix C.1.)

Proposition 5.1. Let w0 = g w be a point we teleport to. Let J = @w0/@w be the Jacobian. Symmetry teleportation using g accelerates the rate of decay in L if it satisﬁes

J 1 T r L(w)

> kr L(w)k2

If the action of the symmetry group G GLn is linear we have w0 = g w = gw and J = g. It follows that if G is a subgroup of the orthogonal group, d L/dt will be invariant:

Corollary 5.2. Let O denote the orthogonal group of invariances of the inverse of the learning rate, 1, meaning for g 2 O , g T 1g = 1. Then

8g 2 O , d L(g w)

In the simple case where the learning rate is a constant, O = O(n) becomes the classic orthogonal group (e.g. rotations) with g T g = I. In general, when g preserves the norm of the gradient, symmetry teleportation has no effect on the convergence rate.

From Theorem 3.2 in Hazan (2019), assuming that L is β-smooth and is bounded by |L| M, the gradient norm in gradient descent converges as 1 2β

t=1 k(r L)|wtk 2M. Teleportation increases (r L)|wt, therefore requiring less time to reach convergence.

5.2 Improvement of subsequent steps

Since teleportation moves the parameters to a point with a larger gradient, and subsequent GD steps are local, we would expect that teleportation improves the magnitude of the gradient for a few future steps as well. The following results formalize this intuition (proofs in Appendix C.2.).

Assumption 5.3 (Lipschitz Continuity). The l2 norm of the gradient is Lipschitz continuous with constant L 2 R 0 , which is

,,,, Lkw1 w2k2, (12)

where w1, w2 are two points in the parameter space and L is the Lipschitz constant.

Proposition 5.4. Consider the gradient descent with a G-invariant loss L(w) and learning rate 2 R+. Let wt be the parameter at time t and w0

t = g wt the parameter teleported by g 2 G. Let wt+T and w0

t+T be the parameters after T more steps of gradient descent from wt and w0

t respectively. Under Assumption 5.3, if L < 1, and

tk2 k@L/@wtk2

(1 L)T , (13)

The proposition provides a sufﬁcient condition for teleportation to improve T future steps. The condition is met when L is small, is small, or the initial improvement,

2, is large.

5.3 Convergence analysis for convex quadratic functions

Teleportation improves the magnitude of gradient for the current step. We have also shown that the magnitude of gradient stays large for a few subsequent steps (Proposition 5.4). In this section, we analyze a class of functions where teleporting once guarantees optimality at all future times. We consider a trajectory optimal if for every point on the trajectory, the magnitude of gradient is at a local maximum in the loss level set that contains the point.

Consider a positive deﬁnite quadratic form LA(w) = w T Aw, where w 2 Rn is the parameter and A 2 Rn n is a positive deﬁnite matrix. The gradient of LA is r LA = 2Aw, and the Hessian of LA is H = 2A. Since A is deﬁned to be positive deﬁnite, LA is convex. The function LA has global minimum at a single point w = 0.

Let be a representation of O(n) acting on Rn. For g 2 O(n), we deﬁne the following group action:

1 2 w. (15)

Then it can be shown that LA(w) admits a O(n) symmetry:

LA(g w) = w T A

1 2 w = w T Aw = LA(w). (16)

Let Sc = {w : LA(w) = c} be a level set of LA. We show that after a teleportation, every point on the gradient ﬂow trajectory is optimal in its level set (with full proof in Appendix C.3).

Proposition 5.5. If at point w, kr LA|wk2 is at a maximum in SLA(w), then for any point w0 on the gradient ﬂow trajectory starting from w, kr LA|w0k2 is at a maximum in SLA(w0).

Finally, we observe that for LA, teleportation moves the parameters closer to the global minimum in Euclidean distance. In other words, maximizing the magnitude of gradient minimizes the distance to w in a loss level set. Proposition 5.6. Consider a point w in the parameter space. Let g = arg maxg2G kr LA|g wk2

2. Then g w = arg minw02SLA(w) kw0 w k2

5.4 Relation to second-order optimization methods

Since our algorithm involves optimizing the gradients, symmetry teleportation is closely related to second-order optimization methods. At the target point to teleport to, gradient descent becomes equivalent to Newton s method (proof in Appendix C.4). Proposition 5.7. Let SL0 = {w : L(w) = L0} be a level set of L. If at a particular w 2 SL0 we have kr L(w)k2 kr L(w0)k2 for all w0 in a small neighborhood of w in SL0, then the gradient r L(w) has the same direction as the direction from Newton s method, H 1r L(w).

Proposition 5.7 provides an alternative way to interpret teleportation. Instead of computing the Newton s direction, we search within the loss level set for a point where the gradient has the same

Figure 2: Optimization of the Rosenbrock function (top row) and Booth function (bottom row) using (a) gradient descent and (b) the proposed algorithm. Contours represent the level sets of the loss function. Loss L and convergence rate d L/dt are shown in (c) and (d). Teleportation helps move the parameters towards the target.

direction as Newton s direction. However, symmetry teleportation does not require computing the full Hessian matrix. The second derivative required for optimizing over continuous groups is obtained by taking derivatives with respect to parameters and then with respect to the group element, as opposed to taking the derivative with respect to parameters twice. This makes the computation signiﬁcantly more feasible than Newton s method on neural networks with large number of parameters.

In practice, however, the group action used for teleportation is usually not transitive. Additionally, we do not teleport using the optimal group element since it can be unbounded. We also do not apply teleportation in every gradient descent step. Therefore, Proposition 5.7 serves as an intuition instead of an exact formulation of how teleportation works. We provide empirical evidence in Section 6 that these approximations do not erase the beneﬁts of teleportation completely, and leave theoretical investigations of the connection to second-order methods under approximations as future work.

6 Experiments

6.1 Acceleration through symmetry teleportation

We examine the effect of symmetry teleportation on optimization. We illustrate teleportation in the parameter space on two test functions and show a speedup in regression and classiﬁcation problems using multilayer neural networks. For test functions, we compared with GD for illustration purposes. For multi-layer neural networks, we also include Ada Grad as a more competitive baseline.

Rosenbrock function. We apply symmetry teleportation to optimize the 2-variable Rosenbrock function (3). The parameters x1, x2 are initialized to ( 1, 1). Each algorithm is run 1000 steps with learning rate 10 3. We teleport the parameters every 100 steps. The group elements are found by gradient ascent on , the parameter for the SO(2) group, for 10 steps with learning rate 10 1.

The trajectory of parameters and the loss level sets are plotted in Figure 2a,b. The blue star denotes the ﬁnal position of parameters, the green star denotes the target, and orange dots are the positions from which symmetry transforms start. While gradient descent is not able to reach the target in 1000 steps, teleportation allows large steps and reaches the target much earlier. Figure 2d shows that

teleportation improves the norm of gradients in the following step, and 2c shows its effect on the loss value. Teleportations clearly reduce the number of steps needed for convergence.

Booth function. We also test symmetry teleportation on the Booth function deﬁned in Eqn. (4). We initialize the parameters x1, x2 to (5, 5). Each algorithm is run 10 steps with learning rate 0.08. We perform symmetry teleportation on the parameters once, before epoch 5. The group elements are found by gradient ascent on for 10 steps with learning rate 0.001. is initialized uniformly

Figure 3: Multi-layer network optimization using gradient descent with and without teleportation. Teleportation reduces both the number of epochs and the total computational time required to reach convergence. At the same loss values, the teleported version has a larger gradient.

at random over [0, ). Similar to the Rosenbrock function, teleportation moves the parameters to a trajectory with a larger convergence rate (Figure 2 bottom row).

Multilayer neural network regression. We further evaluated our method on a three-layer neural network with a regression loss min W1,W2,W3 k Y W3σ(W2σ(W1X))k2. The dimension of weight matrices are W3 2 R8 7, W2 2 R7 6, and W1 2 R6 5. X 2 R5 4 is the data, Y 2 R8 4 is the target, and σ is the Leaky Re LU activation with slope coefﬁcient 0.1. Additional hyperparameter details can be found in Appendix D.2.

Teleportation of the GL(R) group can be performed by ﬁnding an element x in its Lie algebra, such that transforming w by the group element g = exp(x) improves the gradient d L

dt . We use the ﬁrst order approximation of the exponential map. Between each pair of weight matrices, we replace gm by I + T and g 1

m by I T in (8), where T 2 Rdm dm is initialized to 0. Then we perform gradient ascent steps on T with objective deﬁned in Line 3 of Algorithm 1 and update the pair of weights.

Figure 3a and 3b show the training curves plotted against epochs and time. Shaded area denotes one standard deviation from 5 runs. Since GD and Ada Grad use different learning rates, they are not directly comparable. However, the addition of teleportation clearly improves both algorithms. Figure 3c and 3d shows the squared norm of gradient from a single run. Teleportation increases the magnitude of gradient, and the trajectory with teleportation has a larger d L/dt value at the same loss values, which demonstrates that teleportation ﬁnds a better trajectory.

MNIST classiﬁcation. We apply symmetry teleportation on the MNIST classiﬁcation task (Deng, 2012). We split the training set into 48,000 for training and 12,000 for validation. The input data has dimension 28 28 and is ﬂattened into a vector. The output of the neural network has dimension 10 corresponding to the 10 digit classes. We used a three-layer neural network with hidden dimension [512, 512], Leaky Re LU activations, and cross-entropy loss. Learning rate is 2 10 3, and learning rate for teleportation is 10 3. Each optimization algorithm is run 80 epochs with batch size of 20. Immediately after the ﬁrst epoch, we apply teleportation using data from one mini-batch, and repeat for 4 different mini-batches. For each mini-batch, 10 gradient ascent steps are used to optimize gm.

Figure 4a,b shows the effect of teleportation on training and validation loss, and Figure 4c,d shows the norm of the gradient in training. While teleportation signiﬁcantly accelerates the decrease of the training loss in SGD, its effect on the validation loss is limited and detrimental for Ada Grad. Therefore, teleportation on MNIST makes training faster at the beginning but leads to earlier overﬁt and slightly worse validation accuracy. A possible reason is that regions with large gradients have sharp minima that do not generalize well.

6.2 Teleportation schedule

The effect of teleportation varies depending on its time and frequency. Figure 5 shows the result of teleportation on MNIST with different hyperparameters. In all experiments, we use SGD with batch size 20, learning rate 2 10 3, and 10 gradient ascent steps for each teleportation.

In Figure 5a, we randomly select 4 different mini-batches and apply teleportation on each of them individually, but at different epochs. Teleportation before training has the worst performance. After epoch 0, the effect of teleportation is stronger when it is applied earlier. In Figure 5b, we again apply teleportations on 4 randomly selected mini-batches, but repeat this with 5 different teleportation

Figure 4: MNIST classiﬁcation using gradient descent with and without teleportation. Solid lines are training loss and dashed lines are validation loss.

Figure 5: Teleportation (a) once at different epoch, (b) 5 times with different teleportation schedules, and (c) using different number of mini-batches. The lists in the legend of (a) and (c) denote the epoch numbers in teleportation schedule where teleportation happens.

schedules (hyperparameter K in Algorithm 2) as shown in the legend. All schedules have the same number of teleportations. Smaller intervals between teleportations accelerate convergence more signiﬁcantly. In Figure 5c, we apply teleportation immediately after the ﬁrst epoch, but use different numbers of mini-batches (hyperparameter B in Algorithm 2) and teleport using each of them individually. Using more mini-batches to teleport leads to faster decrease in training loss but is also more prone to overﬁtting.

6.3 Runtime analysis

The additional amount of computation introduced by teleportation depends on the implementation of Line 2-5 of Algorithm 1. We discuss our implementation for teleporting multi-layer neural networks as an example. Teleportating each pair of adjacent weight matrices requires computing the inverse of the output from the previous layer. Denote the batch size as n, the largest dimension of all layers as dmax, and the number of layers as l. Assume that dmax > n. Computing the inverse of the output of each layer has complexity O(d2

maxn), and computing the pseudoinverse for all layers has complexity O(d2

maxnl). Note that all matrices we invert have dimensions at most dmax n.

For one gradient ascent step on g, the forward and backward pass both have complexity O(d2

maxnl). This is the same as the forward and backward pass of gradient descent on w because the architecture is the same except with approximately twice as many layers. We perform t gradient ascent steps on g. Therefore, the computation cost for one teleportation is O(d2

We show empirically that the runtime for teleportation scales polynomially with matrix dimensions and linearly with the number of layers. We record the runtime of gradient descent on a Leaky-Re LU neural network for 300 epochs, with a 10-step teleportation every 10 epochs. Each pair of adjacent weight matrices is transformed by a group action during teleportation. Figure 6(a) shows the runtime for a 3-layer network using square weights and data matrices with different dimensions. Figure 6(b) shows the runtime for 128-by-128 weight and data matrices with different number of layers.

Although the teleportation step has the same complexity as a gradient descent step, the runtime is dominated by teleportation due to larger constants in the complexity analysis. The trade-off between teleportation time and convergence rate depends on speciﬁc problems. In our experiments, the convergence rate is improved by a small number of teleportation steps which does not add signiﬁcant computational overhead.

Figure 6: Gradient descent training time on a Leaky-Re LU neural network for 300 epochs, with a 10-step teleportation every 10 epochs. (a) 3-layer network using square weights and data matrices, with different dimensions. (b) Weights and data are all 128 by 128 matrices, but the network has different number of layers.

7 Discussion and Conclusions

We proposed a new optimization algorithm that exploits the symmetry in the loss landscape to improve convergence. It is interesting to note that optimizing d L/dt locally sometimes leads to an improved global path. One example is the ellipse function (Figure 1), where teleporting once ensures that d L/dt is optimal at every L value along the trajectory. Another example is the matrix factorization problem L(U, V ) = k Y UV k2

2. Tarmoun et al. (2021) shows that the convergence rate increases with the imbalance U T U V T V . Consider the transformation U, V ! Ug, g 1V . To optimize d L(U, V )/dt locally, we would need a large g 2 GLn, but a large g also leads to a large U T U V T V which is positively correlated with the overall convergence rate of the entire trajectory. Hence teleportation is guaranteed to produce a better trajectory.

A potential future direction is to derive the exact expression for how teleportation affects the loss value at a later time in gradient ﬂow, which may lead to a closed-form solution of the optimal teleportation destination. Additionally, inspired by the landscape view of parameter space symmetry

Sim sek et al. (2021), teleportation using discrete (permutation) symmetries may allow us to reach a better minimum. Finally, additional theory can be developed to explain the relationship between teleportation and second-order methods under the approximations we introduced, especially to quantify the improvement on the overall convergence rate, and to derive its effect on generalization bounds. Integrating teleportation with other advanced optimizers such as Adam and RMSprop would be another interesting future step.

Acknowledgments and Disclosure of Funding

This work was supported in part by U.S. Department Of Energy, Ofﬁce of Science, U. S. Army Research Ofﬁce under Grant W911NF-20-1-0334, Google Faculty Award, Amazon Research Award, and NSF Grants #2134274, #2107256 and #2134178. R. Walters is supported by the Roux Institute and the Harold Alfond Foundation. We are grateful to Iordan Ganev for many helpful discussions.

Shun-Ichi Amari. Natural gradient works efﬁciently in learning. Neural computation, 10(2):251 276,

Marco Armenta and Pierre-Marc Jodoin. The representation theory of neural networks. Mathematics,

9(24), 2021. ISSN 2227-7390.

Marco Armenta, Thierry Judge, Nathan Painchaud, Youssef Skandarani, Carl Lemaire, Gabriel Gibeau

Sanchez, Philippe Spino, and Pierre-Marc Jodoin. Neural teleportation. ar Xiv preprint ar Xiv:2012.01118, 2020.

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit

acceleration by overparameterization. In International Conference on Machine Learning, pages 244 253. PMLR, 2018.

Thomas Back. Evolutionary algorithms in theory and practice: evolution strategies, evolutionary

programming, genetic algorithms. Oxford university press, 1996.

Vijay Badrinarayanan, Bamdev Mishra, and Roberto Cipolla. Symmetry-invariant optimization in

deep networks. ar Xiv preprint ar Xiv:1511.01754, 2015.

Robert Bamler and Stephan Mandt. Improving optimization for models with continuous symmetry

breaking. In International Conference on Machine Learning, pages 423 432. PMLR, 2018.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning

practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In

Fixed-point algorithms for inverse problems in science and engineering, pages 185 212. Springer,

Li Deng. The MNIST database of handwritten digit images for machine learning research. IEEE

Signal Processing Magazine, 29(6):141 142, 2012.

Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous

models: Layers are automatically balanced. Neural Information Processing Systems, 2018.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and

stochastic optimization. Journal of machine learning research, 12(7), 2011.

Iordan Ganev, Twan van Laarhoven, and Robin Walters. Universal approximation and model compression for radial neural networks. ar Xiv preprint ar Xiv:2107.02550, 2021.

Grzegorz Głuch and Rüdiger Urbanke. Noether: The more things change, the more stay the same.

ar Xiv preprint ar Xiv:2104.05508, 2021.

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimiza-

tion. In 35th International Conference on Machine Learning, 2018.

Elad Hazan. Lecture notes: Optimization for machine learning. ar Xiv preprint ar Xiv:1909.03550,

Momin Jamil and Xin-She Yang. A literature survey of benchmark functions for global optimisation

problems. International Journal of Mathematical Modelling and Numerical Optimisation, 4(2): 150 194, 2013.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International

Conference on Learning Representations, 2015.

Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka.

Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In International Conference on Learning Representations, 2021.

Serge Lang. Algebra. Graduate Texts in Mathematics. Springer, 2002. ISBN 978-0387953854.

James Martens and Roger B Grosse. Optimizing neural networks with kronecker-factored approximate

curvature. In International Conference on International Conference on Machine Learning. PMLR, 2018.

Qi Meng, Shuxin Zheng, Huishuai Zhang, Wei Chen, Zhi-Ming Ma, and Tie-Yan Liu. G-SGD:

Optimizing relu neural networks in its positively scale-invariant space. International Conference on Learning Representations, 2019.

Hancheng Min, Salma Tarmoun, René Vidal, and Enrique Mallada. On the explicit role of initializa-

tion on the convergence and implicit bias of overparametrized linear networks. In International Conference on Machine Learning. PMLR, 2021.

Behnam Neyshabur, Russ R Salakhutdinov, and Nati Srebro. Path-SGD: Path-normalized optimization

in deep neural networks. In Advances in Neural Information Processing Systems, 2015.

Ho Ho Rosenbrock. An automatic method for ﬁnding the greatest or least value of a function. The

Computer Journal, 3(3):175 184, 1960.

Andrew M. Saxe, James L. Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics

of learning in deep linear neural networks. International Conference on Learning Representations, 2014.

Berﬁn Sim sek, François Ged, Arthur Jacot, Francesco Spadaro, Clément Hongler, Wulfram Gerstner,

and Johanni Brea. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pages 9722 9732. PMLR, 2021.

Salma Tarmoun, Guilherme Franca, Benjamin D Haeffele, and Rene Vidal. Understanding the

dynamics of gradient ﬂow in overparameterized linear models. In International Conference on Machine Learning, pages 10153 10161. PMLR, 2021.

Twan Van Laarhoven. L2 regularization versus batch and weight normalization. Advances in Neural

Information Processing Systems, 2017.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [No] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]