# targetbased_surrogates_for_stochastic_optimization__fc034083.pdf Target-based Surrogates for Stochastic Optimization Jonathan Wilder Lavington * 1 Sharan Vaswani * 2 Reza Babanezhad 3 Mark Schmidt 1 4 Nicolas Le Roux 5 6 We consider minimizing functions for which it is expensive to compute the (possibly stochastic) gradient. Such functions are prevalent in reinforcement learning, imitation learning and adversarial training. Our target optimization framework uses the (expensive) gradient computation to construct surrogate functions in a target space (e.g. the logits output by a linear model for classification) that can be minimized efficiently. This allows for multiple parameter updates to the model, amortizing the cost of gradient computation. In the full-batch setting, we prove that our surrogate is a global upper-bound on the loss, and can be (locally) minimized using a black-box optimization algorithm. We prove that the resulting majorization-minimization algorithm ensures convergence to a stationary point of the loss. Next, we instantiate our framework in the stochastic setting and propose the SSO algorithm, which can be viewed as projected stochastic gradient descent in the target space. This connection enables us to prove theoretical guarantees for SSO when minimizing convex functions. Our framework allows the use of standard stochastic optimization algorithms to construct surrogates which can be minimized by any deterministic optimization method. To evaluate our framework, we consider a suite of supervised learning and imitation learning problems. Our experiments indicate the benefits of target optimization and the effectiveness of SSO. *Equal contribution 1University of British Columbia 2Simon Fraser University 3Samsung - SAIT AI Lab, Montreal 4 Canada CIFAR AI Chair (Amii) 5Microsoft Research 6 Canada CIFAR AI Chair (MILA). Correspondence to: Jonathan Wilder Lavington , Sharan Vaswani . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1. Introduction Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants (Duchi et al., 2011; Kingma and Ba, 2015) are ubiquitous optimization methods in machine learning (ML). For supervised learning, iterative first-order methods require computing the gradient over individual mini-batches of examples, and that cost of computing the gradient often dominates the total computational cost of these algorithms. For example, in reinforcement learning (RL) (Williams, 1992; Sutton et al., 2000) or online imitation learning (IL) (Ross et al., 2011), policy optimization requires gathering data via potentially expensive interactions with a real or simulated environment. We focus on algorithms that access the expensive gradient oracle to construct a sequence of surrogate functions. Typically, these surrogates are chosen to be global upperbounds on the underlying function and hence minimizing the surrogate allows for iterative minimization of the original function. Algorithmically, these surrogate functions can be minimized efficiently without additional accesses to the gradient oracle, making this technique advantageous for the applications of interest. The technique of incrementally constructing and minimizing surrogate functions is commonly referred to as majorization-minimization and includes the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) as an example. In RL, common algorithms (Schulman et al., 2015; 2017) also rely on minimizing surrogates. Typically, surrogate functions are constructed by using the convexity and/or smoothness properties of the underlying function. Such surrogates have been used in the stochastic setting (Mairal, 2013; 2015). The prox-linear algorithm (Drusvyatskiy, 2017) for instance, uses the composition structure of the loss function and constructs surrogate functions in the parametric space. Unlike these existing works, we construct surrogate functions over a well-chosen target space rather than the parametric space, leveraging the composition structure of the loss functions prevalent in ML to build better surrogates. For example, in supervised learning, typical loss functions are of the form h(θ) = ℓ(f(θ)), where ℓis (usually) a convex loss (e.g. the squared loss for regression or the logistic loss for classification), while f corresponds to a transformation (e.g. linear or high-dimensional, non-convex as in Target-based Surrogates for Stochastic Optimization the case of neural networks) of the inputs. Similarly, in IL, ℓmeasures the divergence between the policy being learned and the ground-truth expert policy, whereas f corresponds to a specific parameterization of the policy being learned. More formally, if Θ is the feasible set of parameters, f : Θ Z is a potentially non-convex mapping from the parametric space Θ Rd to the target space Z Rp and ℓ: Z R is a convex loss function. For example, for linear regression, h(θ) = 1 2 Xθ y 2 2, z = f(θ) = Xθ and ℓ(z) = 1 2 z y 2 2. In our applications of interest, computing zℓ(z) requires accessing the expensive gradient oracle, but θf(θ) can be computed efficiently. Unlike Nguyen et al. (2022) who exploit this composition structure to prove global convergence, we will use it to construct surrogate functions in the target space. Johnson and Zhang (2020) also construct surrogate functions using the target space, but require access to the (stochastic) gradient oracle for each model update, making the algorithm proposed inefficient in our setting. Moreover, unlike our work, Johnson and Zhang (2020) do not have theoretical guarantees in the stochastic setting. Concurrently with our work, Woodworth et al. (2023) also consider minimizing expensive-to-evaluate functions, but do so by designing proxy loss function which is similar to the original function. We make the following contributions. Target smoothness surrogate: In Section 3, we use the smoothness of ℓwith respect to z in order to define the target smoothness surrogate and prove that it is a global upper-bound on the underlying function h. In particular, these surrogates are constructed using tighter bounds on the original function. This ensures that additional progress can be made towards the minimizer using multiple model updates before recomputing the expensive gradient. Target optimization in the deterministic setting: In Section 3, we devise a majorization-minimization algorithm where we iteratively form the target smoothness surrogate and then (locally) minimize it using any black-box algorithm. Although forming the target smoothness surrogate requires access to the expensive gradient oracle, it can be minimized without additional oracle calls resulting in multiple, computationally efficient updates to the model. We refer to this framework as target optimization. This idea of constructing surrogates in the target space has been recently explored in the context of designing efficient off-policy algorithms for reinforcement learning (Vaswani et al., 2021). However, unlike our work, Vaswani et al. (2021) do not consider the stochastic setting, or provide theoretical convergence guarantees. In Algorithm 1, we instantiate the target optimization framework and prove that it converges to a stationary point at a sublinear rate (Lemma C.1 in Appendix C). Stochastic target smoothness surrogate: In Section 4, we consider the setting where we have access to an expensive stochastic gradient oracle that returns a noisy, but unbiased estimate of the true gradient. Similar to the deterministic setting, we access the gradient oracle to form a stochastic target smoothness surrogate. Though the surrogate is constructed by using a stochastic gradient in the target space, it is a deterministic function with respect to the parameters and can be minimized using any standard optimization algorithm. In this way, our framework disentangles the stochasticity in zℓ(z) (in the target space) from the potential non-convexity in f (in the parametric space). Target optimization in the stochastic setting: Similar to the deterministic setting, we use m steps of GD to minimize the stochastic target smoothness surrogate and refer to the resulting algorithm as stochastic surrogate optimization (SSO). We then interpret SSO as inexact projected SGD in the target space. This interpretation of SSO allows the use of standard stochastic optimization algorithms to construct surrogates which can be minimized by any deterministic optimization method. Minimizing surrogate functions in the target space is also advantageous since it allows us to choose the space in which to constrain the size of the updates. Specifically, for overparameterized models such as deep neural networks, there is only a loose connection between the updates in the parameter and target space. In order to directly constrain the updates in the target space, methods such as natural gradient (Amari, 1998; Kakade, 2001) involve computationally expensive operations. In comparison, SSO has direct control over the updates in the target space and can also be implemented efficiently. Theoretical results for SSO: Assuming h(θ) to be smooth, strongly-convex, in Section 4.1, we prove that SSO (with a constant step-size in the target space) converges linearly to a neighbourhood of the true minimizer (Theorem 4.2). In Proposition 4.3, we prove that the size of this neighbourhood depends on the noise (σ2 z) in the stochastic gradients in the target space and an error term ζ2 that depends on the dissimilarity in the stochastic gradients at the optimal solution. In Proposition 4.4, we provide a quadratic example that shows the necessity of this error term in general. However, as the size of the mini-batch increases or the model is overparameterized enough to interpolate the data (Schmidt and Le Roux, 2013; Vaswani et al., 2019a), ζ2 t becomes smaller. In the special case when interpolation is exactly satisfied, we prove that SSO with O (log(1/ϵ)) iterations is sufficient to guarantee convergence to an ϵ-neighbourhood of the true minimizer. Finally, we argue that SSO can be more efficient than using parametric SGD for expensive gradient oracles and common loss functions (Section 4.2). Experimental evaluation: To evaluate our target optimization framework, we consider online imitation learning (OIL) as our primary example. For policy optimization in OIL, Target-based Surrogates for Stochastic Optimization computing zℓinvolves gathering data through interaction with a computationally expensive simulated environment. Using the Mujoco benchmark suite (Todorov et al., 2012) we demonstrate that SSO results in superior empirical performance (Section 5). We then consider standard supervised learning problems where we compare SSO with different choices of the target surrogate to standard optimization methods. These empirical results indicate the practical benefits of target optimization and SSO. 2. Problem Formulation We focus on minimizing functions that have a composition structure and for which the gradient is expensive to compute. Formally, our objective is to solve the following problem: minθ Θ h(θ) := ℓ(f(θ)) where Θ Rd, Z Rp, f : Θ Z and ℓ: Z R. Throughout this paper, we will assume that h is Lθ-smooth in the parameters θ and that ℓ(z) is L-smooth in the targets z. For all generalized linear models including linear and logistic regression, f = X θ is a linear map in θ and ℓis convex in z. Example: For logistic regression with features X Rn d and labels y { 1, +1}n, h(θ) = Pn i=1 log (1 + exp( yi Xi, θ )). If we target the logits1, then Z = {z|z = Xθ} Rn and ℓ(z) = Pn i=1 log (1 + exp( yizi)). In this case, Lθ is the maximum eigenvalue of X TX, whereas L = 1 4. A similar example follows for linear regression. In settings that use neural networks, it is typical for the function f mapping X to y to be non-convex, while for the loss ℓto be convex. For example in OIL, the target space is the space of parameterized policies, where ℓis the cumulative loss when using a policy distribution π := f(θ) who s density is parameterized by θ. Though our algorithmic framework can handle non-convex ℓand f, depending on the specific setting, our theoretical results will assume that ℓ(or h) is (strongly)-convex in θ and f is an affine map. For our applications of interest, computing zℓ(z) is computationally expensive, whereas f(θ) (and its gradient) can be computed efficiently. For example, in OIL, computing the cumulative loss ℓ(and the corresponding gradient zℓ(z)) for a policy involves evaluating it in the environment. Since this operation involves interactions with the environment or a simulator, it is computationally expensive. On the other hand, the cost of computing θf(θ) only depends on the policy parameterization and does not involve additional interactions with the environment. In some cases, it is more natural to consider access to a stochastic gradient oracle that returns a noisy, unbiased gradient ℓ(z) 1The target space is not unique. For example, we could directly target the classification probabilities for logistic regression resulting in different f and ℓ. such that E[ ℓ(z)] = ℓ(z). We consider the effect of stochasticity in Section 4. If we do not take advantage of the composition structure nor explicitly consider the cost of the gradient oracle, iterative first-order methods such as GD or SGD can be directly used to minimize h(θ). At iteration t [T], the parametric GD update is: θt+1 = θt η ht(θt) where η is the step-size to be selected or tuned according to the properties of h. Since h is Lθ-smooth, each iteration of parametric GD can be viewed as exactly minimizing the quadratic surrogate function derived from the smoothness condition with respect to the parameters. Specifically, θt+1 := arg min gp t (θ) where gp t is the parametric smoothness surrogate: gp t (θ) := h(θt) + h(θt), θ θt + 1 2η θ θt 2 2. The quadratic surrogate is tight at θt, i.e h(θt) = gp t (θt) and becomes looser as we move away from θt. For η 1 Lθ , the surrogate is a global (for all θ) upperbound on h, i.e. gp t (θ) h(θ). Minimizing the global upper-bound results in descent on h since h(θt+1) gp t (θt+1) gp t (θt) = ht(θt). Similarly, the parametric SGD update consists of accessing the stochastic gradient oracle to obtain h(θ), h(θ) such that E[ h(θ)] = h(θ) and E[ h(θ)] = h(θ), and iteratively constructing the stochastic parametric smoothness surrogate gpt(θ). Specifically, θt+1 = arg min gpt(θ), where gpt(θ) := h(θt) + h(θt), θ θt + 1 2ηt θ θt 2 2. Here, ηt is the iteration dependent step-size, decayed according to properties of h (Robbins and Monro, 1951). In contrast to these methods, in the next section, we exploit the smoothness of the losses with respect to the target space and propose a majorizationminimization algorithm in the deterministic setting. 3. Deterministic Setting We consider minimizing ℓ(f) in the deterministic setting where we can exactly evaluate the gradient zℓ(z). Similar to the parametric case in Section 2, we use the smoothness of ℓ(z) w.r.t the target space and define the target smoothness surrogate around zt as: ℓ(zt) + zℓ(zt), z zt + 1 2η z zt 2 2, where η is the step-size in the target space and will be determined theoretically. Since z = f(θ), the surrogate can be expressed as a function of θ as: gt(θ) := h ℓ(zt) + zℓ(zt), f(θ) zt + 1 2η f(θ) zt 2 2 i , which in general is not quadratic in θ. Example: For linear regression, f(θ) = X θ and gt(θ) = 1 2 Xθt y 2 2+ [Xθt y], X(θ θt) + 1 2η X(θ θt) 2 2. Similar to the parametric smoothness surrogate, we see that h(θt) = ℓ(f(θt)) = gt(θt). If ℓis L-smooth w.r.t the target space Z, then for η 1 L, we have gt(θ) ℓ(f(θ)) = h(θ) for all θ, that is the surrogate is a global upper-bound on h. Since gt is a global upper-bound on h, similar to GD, we can Target-based Surrogates for Stochastic Optimization minimize h by minimizing the surrogate at each iteration i.e. θt+1 = arg minθ gt(θ). However, unlike GD, in general, there is no closed form solution for the minimizer of gt, and we will consider minimizing it approximately. Algorithm 1 (Stochastic) Surrogate optimization Input: θ0 (initialization), T (number of iterations), mt (number of inner-loops), η (step-size for the target space), α (step-size for the parametric space) for t = 0 to T 1 do Access the (stochastic) gradient oracle to construct gt (θ) Initialize inner-loop: ω0 = θt for k 0 to mt 1 do ωk+1 = ωk α ω gt (ωk) end for θt+1 = ωm ; zt+1 = f(θt+1) end for Return θT This results in the following meta-algorithm: for each t [T], at iterate θt, form the surrogate gt and compute θt+1 by (approximately) minimizing gt(θ). This meta-algorithm enables the use of any black-box algorithm to minimize the surrogate at each iteration. In Algorithm 1, we instantiate this meta-algorithm by minimizing gt(θ) using m 1 steps of gradient descent. For m = 1, Algorithm 1 results in the following update: θt+1 = θt α gt(θt) = θt α h(θt), and is thus equivalent to parametric GD with step-size α. Example: For linear regression, instantiating Algorithm 1 with m = 1 recovers parametric GD on the least squares objective. On the other hand, minimizing the surrogate exactly (corresponding to m = ) to compute θt+1 results in the following update: θt+1 = θt η (X TX) 1 [X T(Xθt y)] and recovers the Newton update in the parameter space. In this case, approximately minimizing gt using m (1, ) steps of GD interpolates between a first and second-order method in the parameter space. In this case, our framework is similar to quasi-Newton methods (Nocedal and Wright, 1999) that attempt to model the curvature in the loss without explicitly modelling the Hessian. Unlike these methods, our framework does not have an additional memory overhead. In Lemma C.1 in Appendix C, we prove that Algorithm 1 with any value of m 1 and appropriate choices of α and η results in an O(1/T) convergence to a stationary point of h. Importantly, this result only relies on the smoothness of ℓand gz t , and does not require either ℓ(z) or f(θ) to be convex. Hence, this result holds when using a non-convex model such as a deep neural networks, or for problems with non-convex loss functions such as in reinforcement learning. In the next section, we extend this framework to consider the stochastic setting where we can only obtain a noisy (though unbiased) estimate of the gradient. 4. Stochastic Setting In the stochastic setting, we use the noisy but unbiased estimates ( ℓ(z), ℓ(z)) from the gradient oracle to construct the stochastic target surrogate. To simplify the theoretical analysis, we will focus on the special case where ℓis a finite-sum of losses i.e. ℓ(z) = 1 n Pn i=1 ℓi(z). In this case, querying the stochastic gradient oracle at iteration t returns the individual loss and gradient corresponding to the loss index it i.e. ℓ(z), ℓ(z) = (ℓit(z), ℓit(z)). This structure is present in the use-cases of interest, for example, in supervised learning when using a dataset of n training points or in online imitation learning where multiple trajectories are collected using the policy at iteration t. In this setting, the deterministic surrogate is gt(θ) := 1 n h P [ℓi(z) + ℓi(zt), f(θ) zt ] + 1 2ηt f(θ) zt 2 2 i In order to admit an efficient implementation of the stochastic surrogate and the resulting algorithms, we only consider loss functions that are separable w.r.t the target space, i.e. for z Z, if zi R denotes coordinate i of z, then ℓ(z) = 1 n P i ℓi(zi). For example, this structure is present in the loss functions for all supervised learning problems where Z Rn and zi = fi(θ) := f(Xi, θ). In this setting, ℓi zj = 0 for all i and j = i and the stochastic target surrogate is defined as gt(θ) := ℓit(zt) + ℓit(zt) zit fit(θ) zit t + 1 2ηt fit(θ) zit t 2, where ηt is the target space step-size at iteration t. Note that gt(θ) only depends on it and only requires access to ℓit(zt) (meaning it can be constructed efficiently). We make two observations about the stochastic surrogate: (i) unlike the parametric stochastic smoothness surrogate gp that uses it to form the stochastic gradient, gt(θ) uses it (the same random sample) for both the stochastic gradient ℓit(zt) and the regularization fit(θ) zit t 2; (ii) while Eit[ gt(θ)] = gt(θ), Eit[arg min gt(θ)] = arg min gt(θ), in contrast to the parametric case where E[arg min gp t (θ)] = arg min gp t (θ). Example: For linear regression, zi = fi(θ) = Xiθ and ℓi(zi) = 1 2(zi yi)2. In this case, the stochastic target surrogate is equal to gt(θ) = 1 2 (Xitθt yit)2 + [Xitθt yit] Xit(θ θt) + 1 2ηt Xit(θ θt) 2 2. Similar to the deterministic setting, the next iterate can be obtained by (approximately) minimizing gt(θ). Algorithmically, we can form the surrogate gt at iteration t and minimize it approximately by using any black-box algorithm. We refer to the resulting framework as stochastic surrogate optimization (SSO). For example, we can minimize gt using m steps of GD. The resulting algorithm is the same as Algorithm 1 but uses gt (the changes to the algorithm are highlighted in green). Note that the surrogate depends on Target-based Surrogates for Stochastic Optimization the randomly sampled it and is therefore random. However, once the surrogate is formed, it can be minimized using any deterministic algorithm, i.e. there is no additional randomness in the inner-loop in Algorithm 1. Moreover, for the special case of m = 1, Algorithm 1 has the same update as parametric stochastic gradient descent. Previous work like the retrospective optimization framework in Newton et al. (2021) also considers multiple updates on the same batch of examples. However, unlike Algorithm 1, which forces proximity between consecutive iterates in the target space, Newton et al. (2021) use a specific stopping criterion in every iteration and consider a growing batch-size. In order to prove theoretical guarantees for SSO, we interpret it as projected SGD in the target space. In particular, we prove the following equivalence in Appendix D.1. Lemma 4.1. The following updates are equivalent: (1) θt+1 = arg min θ gt(θ); (SSO) z(1) t+1 = f( θt+1) (2) zt+1/2 = zt ηt zℓit(zt) ; (Target-space SGD) z(2) t+1 = arg min z Z zt+1/2 z 2 Pt where, Pt Rp p is a random diagonal matrix such that Pt(it, it) = 1 and Pt(j, j) = 0 for all j = it. That is, SSO (1) and target space SGD (2), result in the same iterate in each step i.e. if zt = f(θt), then zt+1 := z(1) t+1 = z(2) t+1. The second step in target-space SGD corresponds to the projection (using randomly sampled index it) onto Z2. Using this equivalence enables us to interpret the inexact minimization of gt (for example, using m steps of GD in Algorithm 1) as an inexact projection onto Z and will be helpful to prove convergence guarantees for SSO. The above interpretation also enables us to use the existing literature on SGD (Robbins and Monro, 1951; Li et al., 2021; Vaswani et al., 2019b) to specify the step-size sequence {ηt}T t=1 in the target space, completing the instantiation of the stochastic surrogate. Moreover, alternative stochastic optimization algorithms such as follow the regularized leader (Abernethy et al., 2009) and adaptive gradient methods like Ada Grad (Duchi et al., 2011), online Newton method (Hazan et al., 2007), and stochastic mirror descent (Bubeck et al., 2015, Chapter 6) in the target space result in different stochastic surrogates (refer to Appendix B) that can then be optimized using a black-box deterministic algorithm. Next, we prove convergence guarantees for SSO when ℓ(z) is a smooth, strongly-convex function. 2For linear parameterization, zit = Xit, θ and the set Z is convex. For non-convex f, the set Z can be non-convex and the projection is not well-defined. However, gt can still be minimized, albeit without any guarantees on the convergence. 4.1. Theoretical Results For the setting where ℓ(z) is a smooth and strongly-convex function, we first analyze the convergence of inexact projected SGD in the target space (Section 4.1.1), and then bound the projection errors in in Section 4.1.2. 4.1.1. CONVERGENCE ANALYSIS For the theoretical analysis, we assume that the choice of f ensures that the projection is well-defined (e.g. for linear parameterization where f = X Tθ). We define zt+1 := f( θt+1), where θt+1 := arg min qt(θ) and qt := ℓit(zt) + ℓit(zt) zit fit(θ) zit t + 1 2η t f(θ) zt 2 2. In order to ensure that E[ qt(θ)] = gt(θ), we will set η t = ηt n. Note that qt is similar to gt, but there is no randomness in the regularization term. Analogous to zt+1, zt+1 can be interpreted as a result of projected (where the projection is w.r.t ℓ2-norm) SGD in the target space (Lemma D.1). Note that qt is only defined for the analysis of SSO. For the theoretical analysis, it is convenient to define ϵt+1 := zt+1 zt+1 2 as the projection error at iteration t. Here, zt+1 = f(θt+1) where θt+1 is obtained by (approximately) minimizing gt. Note that the projection error incorporates the effect of both the random projection (that depends on it) as well as the inexact minimization of gt, and will be bounded in Section 4.1.2. In the following lemma (proved in Appendix D.2), we use the equivalence in Lemma 4.1 to derive the following guarantee for two choices of ηt: (a) constant and (b) exponential step-size (Li et al., 2021; Vaswani et al., 2022). Theorem 4.2. Assuming that (i) ℓit is L-smooth and convex, (ii) ℓis µ-strongly convex, (iii) z = arg minz Z ℓ(z) (iv) and that for all t, ϵt ϵ, T iterations of SSO result in the following bound for z T = f(θT ), E z T +1 z YT i=t+1 ρiη t 2 1/2 where ρt = (1 µη t), σ2 := E h ℓ(z ) ℓt(z ) 2 2 i , (a) Constant step-size: When ηt = 1 2L n and for ρ = 1 µ E z T +1 z z1 z 1 1 + σ µL + 2ϵ (b) Exponential step-size: When ηt = 1 2L nαt for α = Target-based Surrogates for Stochastic Optimization E z T +1 z c1 exp T 4κ α ln(T/β) + 4κc1(ln(T/β)) T σ + 2ϵ c2 . where c1 = exp 1 4κ 2β ln(T/β) , and c2 = exp β ln(T ) 2κ ln(T/β) In the above result, σ2 is the natural analog of the noise in the unconstrained case. In the unconstrained case, ℓ(z ) = 0 and we recover the standard notion of noise used in SGD analyses (Bottou et al., 2018; Gower et al., 2019). Unlike parametric SGD, both σ2 and κ do not depend on the specific model parameterization, and only depend on the properties of ℓin the target space. For both the constant and exponential step-size, the above lemma generalizes the SGD proofs in Bottou et al. (2018) and Li et al. (2021); Vaswani et al. (2022) to projected SGD and can handle inexact projection errors similar to Schmidt et al. (2011). For constant step-size, SSO results in convergence to a neighbourhood of the minimizer where the neighbourhood depends on σ2 and the projection errors ϵ2 t. For the exponential step-sizes, SSO results in a noise-adaptive (Vaswani et al., 2022) O exp T T convergence to a neighbourhood of the solution which only depends on the projection errors. In the next section, we bound the projection errors. 4.1.2. CONTROLLING THE PROJECTION ERROR In order to complete the theoretical analysis of SSO, we need to control the projection error ϵt+1 = zt+1 zt+1 2 that can be decomposed into two components as follows: ϵt+1 zt+1 zt+1 2 + zt+1 zt+1 2, where zt+1 = f( θt+1) and θt+1 is the minimizer of gt. The first part of this decomposition arises because of using a stochastic projection that depends on it (in the definition of zt+1) versus a deterministic projection (in the definition of zt+1). The second part of the decomposition arises because of the inexact minimization of the stochastic surrogate, and can be controlled by minimizing gt to the desired tolerance. In order to bound ϵt+1, we will make the additional assumption that f is Lf-Lipschitz in θ.3 The following proposition (proved in Appendix D.3) bounds ϵt+1. Proposition 4.3. Assuming that (i) f is Lf-Lipschitz continuous, (ii) gt is µg strongly-convex and Lg smooth with κg := Lg/µg4, (iii) qt is µq strongly-convex (iv) ζ2 t := 8 min{µg,µq} ([min {Eit [ gt]} Eit [min { gt}]]) + 8 min{µg,µq} [min {Eit [ qt]} Eit [min { qt}]] , (v) 3For example, when using a linear parameterization, Lf = X This property is also satisfied for certain neural networks. 4We consider strongly-convex functions for simplicity, and it should be possible to extend these results to when g is non-convex but satisfies the PL inequality, or is weakly convex. σ2 z := minz Z {Eit [ℓit(z)]} Eit [minz Z {ℓit(z)}], if gt is minimized using mt inner-loops of GD with the appropriate step-size, then, E[ϵ2 t+1] L2 f ζ2 t + 4L2 f µg exp ( mt/κg) E[ℓ(zt) ℓ(z )] + σ2 z . Note that the definition of both ζ2 t and σ2 z is similar to that used in the SGD analysis in Vaswani et al. (2022), and can be interpreted as variance terms. In particular, using strong-convexity, min {Eit [ gt]} Eit [min { gt}] 2 µg E[ gt(θ t+1) E[ gt(θ t+1)] 2 2] which is the variance in gt(θ t+1). Similarly, we can interpret the other term in the definition of ζ2 t . The first term L2 f ζ2 t because of the stochastic versus deterministic projection, whereas the second term 4L2 f/µg exp ( mt/κg) E[ℓ(zt) ℓ(z )] + σ2 z arises because of the inexact minimization of the stochastic surrogate. Setting mt = O(log(1/ϵ)) can reduce the second term to O(ϵ). When sampling a mini-batch of examples in each iteration of SSO, both ζ2 t and σ2 z will decrease as the batch-size increases (because of the standard sampling-withreplacement bounds (Lohr, 2019)) becoming zero for the full-batch. Another regime of interest is when using overparameterized models that can interpolate the data (Schmidt and Le Roux, 2013; Ma et al., 2018; Vaswani et al., 2019a). The interpolation condition implies that the stochastic gradients become zero at the optimal solution, and is satisfied for models such as non-parametric regression (Liang and Rakhlin, 2018; Belkin et al., 2019) and over-parametrized deep neural networks (Zhang et al., 2017). Under this condition, ζ2 t = σ2 z = 0 (Vaswani et al., 2020; Loizou et al., 2021). From an algorithmic perspective, the above result implies that in cases where the noise dominates, using large m for SSO might not result in substantial improvements. However, when using a large batch-size or over-parameterized models, using large m can result in the superior performance. Given the above result, a natural question is whether the dependence on ζ2 t is necessary in the general (noninterpolation) setting. To demonstrate this, we construct an example (details in Appendix D.4) and show that even for a sum of two one-dimensional quadratics, when minimizing the surrogate exactly i.e. m = and for any sequence of convergent step-sizes, SSO will converge to a neighborhood of the solution. Proposition 4.4. Consider minimizing the sum h(θ) := h1(θ)+h2(θ) 2 of two one-dimensional quadratics, h1(θ) := 1 2(θ 1)2 and h2(θ) = 1 2 (2θ + 1/2)2, using SSO with mt = and ηt = c αt for any sequence of αt and any constant c (0, 1]. SSO results in convergence to a neighbourhood of the solution, specifically, if θ is the minimizer of h and θ1 > 0, then, E(θT θ ) min θ1, 3 In order to show the above result, we use the fact that for quadratics, SSO (with m = ) is equivalent to the subsampled Newton method and in the one-dimensional case, Target-based Surrogates for Stochastic Optimization we can recover the example in Vaswani et al. (2022). Since the above example holds for m = , we conclude that this bias is not because of the inexact surrogate minimization. Moreover, since the example holds for all step-sizes including any decreasing step-size, we can conclude that the optimization error is not a side-effect of the stochasticity. In order to avoid such a bias term, sub-sampled Newton methods use different batches for computing the sub-sampled gradient and Hessian, and either use an increasing batch-size (Bollapragada et al., 2019) or consider using over-parameterized models (Meng et al., 2020). Agarwal et al. (2020) also prove theoretical guarantees when doing multiple SGD steps on the same batch. In contrast to our work, their motivation is to analyze the performance of data-echoing (Choi et al., 2019). From a technical perspective, they consider (i) updates in the parameteric space, and (ii) their inner-loop step-size decreases as m increases. Finally, we note our framework and subsequent theoretical guarantees would also apply to this setting. 4.2. Benefits of Target Optimization In order to gain intuition about the possible benefits of target optimization, let us consider the simple case where each hi is µθ-strongly convex, Lθ-smooth and κθ = Lθ/µθ. For convenience, we define ζ2 := maxt [T ] ζ2 t . Below, we show that under certain regimes, for ill-conditioned least squares problems (where κθ >> 1), target optimization has a provable advantage over parametric SGD. Example: For the least squares setting, κθ is the condition number of the X TX matrix. In order to achieve an ϵ sub-optimality, assuming complete knowledge of σ2 θ := E hi(θ ) 2 2 and all problemdependent constants, parametric SGD requires Tparam = O max κθ log 1/ϵ , σ2 θ/µ2ϵ iterations (Gower et al., 2019, Theorem 3.1) where the first term is the bias term and the second term is the effect of the noise. In order to achieve an ϵ sub-optimality for SSO, we require that ζ2 ϵ (for example, by using a large enough batchsize) and O(κθ log(1/ϵ)) inner iterations. Similar to the parametric case, the number of outer iterations for SSO is Ttarget = O max log 1/ϵ , σ2/ϵ , since the condition number w.r.t the targets is equal to 1. In order to model the effect of an expensive gradient oracle, let us denote the cost of computing the gradient of ℓas τ and the cost of computing the gradient of the surrogate as equal to 1. When the noise in the gradient is small and the bias dominates the number of iterations for both parametric and target optimization, the cost of parametric SGD is dominated by τTparam = O(κθτ), whereas the cost of target optimization is given by Ttarget [τ + κθ log(1/ϵ)] = O(τ + κθ). Alternatively, when the noise dominates, we need to compare the O τ σ2 θ/µ2ϵ cost for the parametric case against the ϵ [τ +κθ log(1/ϵ)] cost for target optimization. Using the definition of the noise, σ2 θ = Ei X T i (Xiθ yi) 2 2 Lσ2. By replacing σ2 θ by Lσ2, we again see that the complexity of parametric SGD depends on O(κθτ), whereas that of SSO depends on O(κθ + τ). A similar property can also be shown for the logistic regression. Similarly, we could use other stochastic optimization algorithms to construct surrogates for target optimization. See Appendix B and Appendix E for additional discussion. 5. Experimental Evaluation We evaluate the target optimization framework for online imitation learning and supervised learning5. In the subsequent experiments, we use either the theoretically chosen step-size when available, or the default step-size provided by Paszke et al. (2019). We do not include a decay schedule for any experiments in the main text, but consider both the standard 1/ t schedule (Bubeck et al., 2015), as well as the exponential step-size-schedule (Vaswani et al., 2022; Orabona, 2019) in Appendix E. For SSO, since optimization of the surrogate is a deterministic problem, we use the standard back-tracking Armijo line-search (Armijo, 1966) with the same hyper-parameters across all experiments. For each experiment, we plot the average loss against the number of calls to the (stochastic) gradient oracle. The mean and the relevant quantiles are reported using three random seeds. Online Imitation Learning: We consider a setting in which the losses are generated through interaction with a simulator. In this setting, a behavioral policy gathers examples by observing a state of the simulated environment and taking an action at that state. For each state gathered through the interaction, an expert policy provides the action that it would have taken. The goal in imitation learning is to produce a policy which imitates the expert. The loss measures the discrepancy between the learned policy π and the expert policy. In this case, z = π where π is a distribution over actions given states, ℓit(z) = Est [KL(π( |st)||πexpert( |st))] where the expectation is over the states visited by the behavioral policy and f(θ) is the policy parameterization. Since computing ℓit requires the behavioural policy to interact with the environment, it is expensive. Furthermore, the KL divergence is 1-strongly convex in the ℓ1-norm, hence OIL satisfies all our assumptions. When the behavioral policy is the expert itself, we refer to the problem as behavioral cloning. When the learned policy is used to interact with the environment (Florence et al., 2022), we refer to the problem as online imitation learning (OIL) (Lavington et al., 2022; Ross et al., 2011). 5The code is available at http://github.com/ Wilder Lavington/Target-Based-Surrogates-For Stochastic-Optimization. Target-based Surrogates for Stochastic Optimization Adagrad Adam SLS SSO-1 SSO-10 SSO-100 SSO-1000 6.2 6.3 6.4 6.5 6.6 Behavioral Cloning: Online Imitation: 0 10k 20k 30k 40k 50k Total-Examples 5.4 5.7 6.0 6.3 6.6 0 10k 20k 30k 40k 50k Total-Examples (a) Hopper-v2 Environment Behavioral Cloning: Online Imitation: 0 10k 20k 30k 40k 50k Total-Examples 0 10k 20k 30k 40k 50k Total-Examples 7.0 7.3 7.6 7.9 8.2 (b) Walker2d-v2 Environment Figure 1: Comparison of log policy loss (mean-squared error between the expert labels and the mean action produced by the policy model) incurred by SGD, SLS, Adam, Adagrad, and SSO as a function of the total interactions (equal to t in Algorithm 1). SSO-m in the legend indicates that the surrogate has been minimized for m GD steps. The bottom row shows experiments where the policy is parameterized by a neural network, while the top row displays an example where the policy is parameterized by a linear model. Across all environments, behavioral policies, and model types, SSO outperforms all other online-optimization algorithms. Additionally, as m increases, so does the performance of SSO. SGD Adam SLS Adagrad SSO-1 SSO-10 SSO-20 Batch-Size: 25 Batch-Size: 125 Batch-Size: 625 0 0.5 1.0 1.5 2.0 2.5 3 3.5 4.0 Optimization-Steps (105) 0 1 2 3 4 5 6 7 8 Optimization-Steps (104) 0 2 4 6 8 10 12 14 16 Optimization-Steps (103) 0 1 2 3 4 5 Optimization-Steps (102) Figure 2: Comparison of SGD and its SSO variant (top row), SLS and it SSO variant (bottom row) over the rcv1 dataset (Chang and Lin, 2011) using a logistic loss. Adam and Adagrad are included as baselines. All plots are in log space, where the x-axis defines optimization steps (equal to t in Algorithm 1). We note that SGD with its theoretical step-size is outperformed by more sophisticated algorithms like SLS or Adam. In contrast, SSO with the theoretical step-size is competitive with both SLS and Adam with default hyper-parameters. Notably, the SSO variant of both SLS and SGD outperforms its parametric counterpart across both m and batch-size. SGD SLS SSO-1 SSO-10 SSO-100 Sample-Size: 10 Sample-Size: 100 Sample-Size: 1000 Figure 3: Comparison of run-times normalized with respect to the cost of a single SGD update) between SSO and relevant baselines. These plots illustrate that when data collection is expensive, SSO will be as fast as SGD, even for relatively large m. For Fig. 1, τ 1000, and the ratio of the time required to compute ℓvs f is approximately 0.001. In Fig. 1, we consider continuous control environments from the Mujoco benchmark suite (Todorov et al., 2012). The policy corresponds to a standard normal distribution whose mean is parameterized by either a linear function or a neural network. For gathering states, we sample from the stochastic (multivariate normal) policy in order to take actions. At each round of environment interaction 1000 states are gathered, and used to update the policy. The expert policy, defined by a normal distribution and parameterized by a two-layer MLP is trained using the Soft-Actor-Critic Algorithm (Haarnoja et al., 2018). Fig. 1 shows that SSO with the theoretical step-size drastically outperforms standard optimization algorithms in terms the log-loss as a function of environment interactions (calls to the gradient oracle). Further, we see that for both the linear and neural network parameterization, Target-based Surrogates for Stochastic Optimization the performance of the learned policy consistently improves as m increases. Runtime Comparison: In Fig. 3, we demonstrate the relative run-time between algorithms. Each column represents the average run-time required to take a single optimization step (sample states and update the model parameters) normalized by the time for SGD. We vary the number of states gathered (referred to as sample-size) per step and consider sample-sizes of 10, 100 and 1000. The comparison is performed on the Hopper-v2 environment using a two layer MLP. We observe that for small sample-sizes, the time it takes to gather states does not dominate the time it takes to update the model (for example, in column 1, SSO-100 takes almost 50 times longer than SGD). On the other hand, for large sample-sizes, the multiple model updates made by SSO are no longer a dominating factor (for example, see the right-most column where SSO-100 only takes about three times as long as SGD but results in much better empirical performance). This experiment shows that in cases where data-access is the major bottleneck in computing the stochastic gradients, target optimization can be beneficial, matching the theoretical intuition developed in Section 4.2. For applications with more expensive simulators such as those for autonomous vehicle (Dosovitskiy et al., 2017), SSO can result in further improvements. Supervised Learning: In order to explore using other optimization methods in the target optimization framework, we consider a simple supervised learning setup. In particular, we use the the rcv1 dataset from libsvm (Chang and Lin, 2011) across four different batch sizes under a logistic-loss. We include additional experiments over other data-sets, and optimization algorithms in Appendix E.6 Extensions to the Stochastic Surrogate: We consider using a different optimization algorithm stochastic linesearch (Vaswani et al., 2019b) (SLS) in the target space, to construct a different surrogate. We refer to the resulting algorithm as SSO-SLS. For SSO-SLS, at every iteration, we perform a backtracking line-search in the target space to set ηt that satisfies the Armijo condition: ℓit(zt ηt zℓit(zt)) ℓit(zt) ηt 2 zℓit(zt) 2 2. We use the chosen ηt to instantiate the surrogate gt and follow Algorithm 1. In Fig. 2, we compare both SSO-SGD (top row) and SSO-SLS (bottom row) along with its parametric variant. We observe that the SSO variant of both SGD and SLS (i) does as well as its parametric counterpart across all m, and (ii), improves it for m sufficiently large. 6We also compared SSO against SVRG (Johnson and Zhang, 2013), a variance reduced method, and found that SSO consistently outperformed it across the batch-sizes and datasets we consider. For an example of this behavior see Fig. 15 in Appendix E. 6. Discussion In the future, we aim to extend our theoretical results to a broader class of functions, and empirically evaluate target optimization for more complex models. Since our framework allows using any optimizer in the target space, we will explore other optimization algorithms in order to construct better surrogates. Another future direction is to construct better surrogate functions that take advantage of the additional structure in the model. For example, Taylor et al. (2016); Amid et al. (2022) exploit the composition structure in deep neural network models, and construct local layerwise surrogates enabling massive parallelization. Finally, we also aim to extend our framework to applications such as RL where ℓis non-convex. 7. Acknowledgements We would like to thank Frederik Kunstner and Reza Asad for pointing out mistakes in the earlier version of this paper. This research was partially supported by the Canada CIFAR AI Program, the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants RGPIN-2022-03669 and RGPIN-2022-04816. Abernethy, J. D., Hazan, E., and Rakhlin, A. (2009). Competing in the dark: An efficient algorithm for bandit linear optimization. COLT. (cited on 5) Agarwal, N., Anil, R., Koren, T., Talwar, K., and Zhang, C. (2020). Stochastic optimization with laggard data pipelines. Advances in Neural Information Processing Systems, 33:10282 10293. (cited on 7) Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation. (cited on 2) Amid, E., Anil, R., and Warmuth, M. (2022). Locoprop: Enhancing backprop via local loss optimization. In Camps Valls, G., Ruiz, F. J. R., and Valera, I., editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 9626 9642. PMLR. (cited on 9) Armijo, L. (1966). Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of mathematics, 16(1):1 3. (cited on 7) Belkin, M., Rakhlin, A., and Tsybakov, A. B. (2019). Does data interpolation contradict statistical optimality? In AISTATS. (cited on 6) Target-based Surrogates for Stochastic Optimization Bollapragada, R., Byrd, R. H., and Nocedal, J. (2019). Exact and inexact subsampled newton methods for optimization. IMA Journal of Numerical Analysis, 39(2):545 578. (cited on 7) Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311. (cited on 6) Bubeck, S. et al. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357. (cited on 5, 7) Chang, C.-C. and Lin, C.-J. (2011). Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1 27. (cited on 8, 9, 30, 36) Choi, D., Passos, A., Shallue, C. J., and Dahl, G. E. (2019). Faster neural network training with data echoing. ar Xiv preprint ar Xiv:1907.05550. (cited on 7) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1 22. (cited on 1) Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and Koltun, V. (2017). CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1 16. (cited on 9) Drusvyatskiy, D. (2017). The proximal point method revisited. ar Xiv preprint ar Xiv:1712.06038. (cited on 1) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. JMLR. (cited on 1, 5) Florence, P., Lynch, C., Zeng, A., Ramirez, O. A., Wahid, A., Downs, L., Wong, A., Lee, J., Mordatch, I., and Tompson, J. (2022). Implicit behavioral cloning. In Faust, A., Hsu, D., and Neumann, G., editors, Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 158 168. PMLR. (cited on 7) Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., and Richtárik, P. (2019). Sgd: General analysis and improved rates. In International Conference on Machine Learning, pages 5200 5209. PMLR. (cited on 6, 7) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861 1870. PMLR. (cited on 8) Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192. (cited on 5) Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, Neur IPS. (cited on 9, 36) Johnson, R. and Zhang, T. (2020). Guided learning of nonconvex models through successive functional gradient optimization. In International Conference on Machine Learning, pages 4921 4930. PMLR. (cited on 2) Kakade, S. M. (2001). A natural policy gradient. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada]. (cited on 2) Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR. (cited on 1, 30) Lavington, J. W., Vaswani, S., and Schmidt, M. (2022). Improved policy optimization for online imitation learning. ar Xiv preprint ar Xiv:2208.00088. (cited on 7, 37) Li, X., Zhuang, Z., and Orabona, F. (2021). A second look at exponential and cosine step sizes: Simplicity, adaptivity, and performance. In International Conference on Machine Learning, pages 6553 6564. PMLR. (cited on 5, 6) Liang, T. and Rakhlin, A. (2018). Just interpolate: Kernel" ridgeless" regression can generalize. ar Xiv preprint ar Xiv:1808.00387. (cited on 6) Lohr, S. L. (2019). Sampling: Design and Analysis: Design and Analysis. Chapman and Hall/CRC. (cited on 6) Loizou, N., Vaswani, S., Laradji, I. H., and Lacoste-Julien, S. (2021). Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pages 1306 1314. PMLR. (cited on 6) Ma, S., Bassily, R., and Belkin, M. (2018). The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In ICML. (cited on 6) Mairal, J. (2013). Stochastic majorization-minimization algorithms for large-scale optimization. Advances in Neural Information Processing Systems, 26. (cited on 1) Mairal, J. (2015). Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829 855. (cited on 1) Target-based Surrogates for Stochastic Optimization Meng, S. Y., Vaswani, S., Laradji, I. H., Schmidt, M., and Lacoste-Julien, S. (2020). Fast and furious convergence: Stochastic second order methods under interpolation. In International Conference on Artificial Intelligence and Statistics, pages 1375 1386. PMLR. (cited on 7) Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media. (cited on 22) Newton, D., Bollapragada, R., Pasupathy, R., and Yip, N. K. (2021). Retrospective approximation for smooth stochastic optimization. ar Xiv preprint ar Xiv:2103.04392. (cited on 5) Nguyen, L. M., Tran, T. H., and van Dijk, M. (2022). Finitesum optimization: A new perspective for convergence to a global solution. ar Xiv preprint ar Xiv:2202.03524. (cited on 2) Nocedal, J. and Wright, S. J. (1999). Numerical optimization. Springer. (cited on 4) Orabona, F. (2019). A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213. (cited on 7, 30) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc. (cited on 7, 33, 34, 35, 38) Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400 407. (cited on 1, 3, 5) Ross, S., Gordon, G., and Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. Journal of machine learning research, pages 627 635. (cited on 1, 7) Schmidt, M. and Le Roux, N. (2013). Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv preprint ar Xiv:1308.6370. (cited on 2, 6) Schmidt, M., Roux, N., and Bach, F. (2011). Convergence rates of inexact proximal-gradient methods for convex optimization. Advances in neural information processing systems, 24. (cited on 6, 28) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning (ICML), pages 1889 1897. (cited on 1) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. Co RR, abs/1707.06347. (cited on 1) Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (Neur IPS), pages 1057 1063. (cited on 1, 38) Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A., and Goldstein, T. (2016). Training neural networks without gradients: A scalable admm approach. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2722 2731, New York, New York, USA. PMLR. (cited on 9) Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. (cited on 3, 8, 37) Vaswani, S., Bach, F., and Schmidt, M. (2019a). Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pages 1195 1204. PMLR. (cited on 2, 6) Vaswani, S., Bachem, O., Totaro, S., Müller, R., Garg, S., Geist, M., Machado, M. C., Castro, P. S., and Roux, N. L. (2021). A general class of surrogate functions for stable and efficient reinforcement learning. ar Xiv preprint ar Xiv:2108.05828. (cited on 2) Vaswani, S., Dubois-Taine, B., and Babanezhad, R. (2022). Towards noise-adaptive, problem-adaptive (accelerated) stochastic gradient descent. In International Conference on Machine Learning, pages 22015 22059. PMLR. (cited on 5, 6, 7, 20, 25, 30) Vaswani, S., Laradji, I., Kunstner, F., Meng, S. Y., Schmidt, M., and Lacoste-Julien, S. (2020). Adaptive gradient methods converge faster with over-parameterization (but you should do a line-search). ar Xiv preprint ar Xiv:2006.06835. (cited on 6) Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., and Lacoste-Julien, S. (2019b). Painless stochastic gradient: Interpolation, line-search, and convergence rates. In Advances in Neural Information Processing Systems, pages 3727 3740. (cited on 5, 9, 30, 33, 34, 35, 38) Ward, R., Wu, X., and Bottou, L. (2020). Adagrad stepsizes: Sharp convergence over nonconvex landscapes. The Journal of Machine Learning Research, 21(1):9047 9076. (cited on 34) Target-based Surrogates for Stochastic Optimization Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256. (cited on 1) Woodworth, B., Mishchenko, K., and Bach, F. (2023). Two losses are better than one: Faster optimization using a cheaper proxy. ar Xiv preprint ar Xiv:2302.03542. (cited on 2) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In ICLR. (cited on 6) Target-based Surrogates for Stochastic Optimization Organization of the Appendix A Definitions B Algorithms C Proofs in the Deterministic Setting D Proofs in the Stochastic Setting E Additional Experimental Results A. Definitions Our main assumptions are that each individual function fi is differentiable, has a finite minimum f i , and is Li-smooth, meaning that for all v and w, fi(v) fi(w) + fi(w), v w + Li 2 v w 2 2 , (Individual Smoothness) which also implies that f is L-smooth, where L is the maximum smoothness constant of the individual functions. A consequence of smoothness is the following bound on the norm of the stochastic gradients, fi(w) f i 2 2L(fi(w) f i f i , w w i ). (1) We also assume that each fi is convex, meaning that for all v and w, fi(v) fi(w) + fi(w), v w , (Convexity) Depending on the setting, we will also assume that f is µ strongly-convex, meaning that for all v and w, f(v) f(w) + f(w), v w + µ 2 v w 2 2 , (Strong Convexity) B. Algorithms In this section, we will formulate the algorithms beyond the standard SGD update in the target space. We will do so in two ways (i) extending SGD to the online Newton step that uses second-order information in Appendix B.1 and (ii) extend SGD to the more general stochastic mirror descent algorithm in Appendix B.2. For both (i) and (ii), we will instantiate the resulting algorithms for the squared and logistic losses. B.1. Online Newton Step Let us consider the online Newton step w.r.t to the targets. The corresponding update is: zt+1/2 = zt ηt[ 2 zℓt(zt)] 1 zℓt(zt) ; zt+1 = arg min z Z z zt+1/2 2 Pt (2) zt+1 = f(θt+1) ; θt+1 = arg min θ zℓt(zt), f(θ) zt + 1 2ηt f(θ) zt 2 2ℓt(zt) where 2 zℓt(zt) is the Hessian of example of the loss corresponding to sample it w.r.t z. Let us instantiate this update for the squared-loss. In this case, ℓt(z) = 1 2 z yt 2 2, and hence, ℓt(z) = z yt, [ 2ℓt(z)]it,it = 1 and [ 2ℓt(z)]j,j = 0 for all j = it. Hence, for the squared loss, Eq. (2) is the same as GD in the target space. For the logistic loss, ℓt(z) = log (1 + exp ( ytz)). If it is the loss index sampled at iteration t, then, [ ℓt(z)]j = 0 for all j = it. Similarly, all entries of 2ℓt(z) except the [it, it] are zero. [ ℓt(z)]it = yt 1 + exp(yt zt) ; [ 2ℓ(z)]it,it, = 1 1 + exp(ytzt) 1 1 + exp( ytzt) = (1 pt) pt , Target-based Surrogates for Stochastic Optimization where, pt = 1 1+exp( ytzt) is the probability of classifying the example it to have the +1 label. In this case, the surrogate can be written as: gz t (θ) = yt 1 + exp(yt zt) fit(θ) zit t + (1 pt) pt fit(θ) zit t 2 (4) As before, the above surrogate can be implemented efficiently. B.2. Stochastic Mirror Descent If ϕ is a differentiable, strictly-convex mirror map, it induces a Bregman divergence between x and y: Dϕ(y, x) := ϕ(y) ϕ(x) ϕ(x), y x . For an efficient implementation of stochastic mirror descent, we require the Bregman divergence to be separable, i.e. Dϕ(y, x) = Pp j=1 Dϕj(yj, xj) = Pp j=1 ϕj(yj) ϕj(xj) ϕj(x) xj [yj xj]. Such a separable structure is satisfied when ϕ is the Euclidean norm or negative entropy. We define stochastic mirror descent update in the target space as follows, ϕ(zt+1/2) = ϕ(zt) ηt zℓt(zt) ; zt+1 = arg min z Z j=1 I(j = it)Dϕj(zj, zj t+1/2)) (5) = zt+1 = arg min z Z zℓt(zt), z zt + 1 j=1 I(j = it)Dϕj(zj, zj t+1/2) where I is an indicator function and it corresponds to the index of the sample chosen in iteration t. For the Euclidean mirror map, ϕ(z) = 1 2 z 2 2, Dϕ(z, zt) = 1 2 z zt 2 2 and we recover the SGD update. Another common choice of the mirror map is the negative entropy function: ϕ(x) = PK i=1 xi log(xi) where xi is coordinate i of the x RK. This induces the (generalized) KL divergence as the Bregman divergence, k=1 xk log xk If both x and y correspond to probability distributions i.e. PK k=1 xk = PK k=1 yk = 1, then the induced Bregman divergence corresponds to the standard KL-divergence between the two distributions. For multi-class classification, Z Rp K and each zi K where K is K-dimensional simplex. We will refer to coordinate j of zi as [zi]j. Since zi K, [zi]k 0 and PK k=1[zi]k = 1. Let us instantiate the general SMD updates in Eq. (5) when using the negative entropy mirror map. In this case, for z K, [ ϕ(z)]k = 1 + log([z]k). Denoting t := zℓt(zt) and using [ t]k to refer to coordinate k of the K-dimensional vector t. Hence, Eq. (5) can be written as: [zit t+1/2]k = [zt it]k exp ( ηt[ t]k) (7) For multi-class classification, yi {0, 1}K are one-hot vectors. If [yi]k refers to coordinate k of vector yi, then corresponding log-likelihood for n observations can be written as: k=1 [yi]k log([zi]k) . In our target optimization framework, the targets correspond to the probabilities of classifying the points into one of the classes. We use a parameterization to model the vector-valued function fi(θ) : Rd RK. This ensures that for all i, PK k=1[fi(θ)]k = 1. Hence, the projection step in Eq. (5) can be rewritten as: min z Z Dϕ(z, zt+1/2) = k=1 [fit(θ)]k log [fit(θ)]k [zit t+1/2]k where [zit t+1/2]k is computed according to Eq. (7). Since the computation of [zit t+1/2]k and the resulting projection only depends on sample it, it can be implemented efficiently. Target-based Surrogates for Stochastic Optimization C. Proofs in the Deterministic Setting Lemma C.1. Assuming that gz t (θ) is β-smooth w.r.t. the Euclidean norm and η 1 L, then, for α = 1/β, iteration t of Algorithm 1 guarantees that h(θt+1) h(θt) for any number m 1 of surrogate steps. In this setting, under the additional assumption that h is lower-bounded by h , then Algorithm 1 results in the following guarantee, min t {0,...,T 1} h(θt) 2 2 2β [h(θ0) h ] Proof. Using the update in Algorithm 1 with α = 1 β and the β-smoothness of gz t (θ), for all k [m 1], gz t (ωk+1) gz t (ωk) 1 2β gz t (ωk) 2 2 After m steps, gz t (ωm) gz t (ω0) 1 k=0 gz t (ωk) 2 2 Since θt+1 = ωm and ω0 = θt in Algorithm 1, = gz t (θt+1) gz t (θt) 1 2β gz t (θt) 2 2 k=1 gz t (ωk) 2 2 Note that h(θt) = gz t (θt) and if η 1 L, then h(θt+1) gz t (θt+1). Using these relations, h(θt+1) h(θt) 1 2β gz t (θt) 2 2 + k=1 gz t (ωk) 2 2 | {z } 0 = h(θt+1) h(θt). This proves the first part of the Lemma. Since Pm 1 k=1 gz t (ωk) 2 2 0, h(θt+1) h(θt) 1 2β gz t (θt) 2 2 = h(θt) 2 2 2β [h(θt) h(θt+1)] (Since h(θt) = gz t (θt)) Summing from k = 0 to T 1, and dividing by T, h(θt) 2 2 T 2β [h(θt) h ] T = min t {0,...,T 1} h(θt) 2 2 2β [h(θt) h ] Target-based Surrogates for Stochastic Optimization D. Proofs in the Stochastic Setting D.1. Equivalence of SSO and SGD in Target Space Lemma 4.1. The following updates are equivalent: (1) θt+1 = arg min θ gt(θ); (SSO) z(1) t+1 = f( θt+1) (2) zt+1/2 = zt ηt zℓit(zt) ; (Target-space SGD) z(2) t+1 = arg min z Z zt+1/2 z 2 Pt where, Pt Rp p is a random diagonal matrix such that Pt(it, it) = 1 and Pt(j, j) = 0 for all j = it. That is, SSO (1) and target space SGD (2), result in the same iterate in each step i.e. if zt = f(θt), then zt+1 := z(1) t+1 = z(2) t+1. Proof. Since ℓis separable, if it is the coordinate sampled from z, we can rewrite the target-space update as follows: zt+1/2 it = zt it ηt ℓit(zt) zit zt+1/2 j = zt j when j = it . Putting the above update in the projection step we have, zt+1 = arg min z Z 1 2{ zt+1/2 it zit 2 2} = arg min z Z 1 2{ zt it ηt ℓit(zt) = arg min z Z zit [zit zit t ] + 1 zit zit t 2 (Due to separability of ℓ) Since for all z Z, z = f(θ) and zi = fi(θ) for all i. Hence zt+1 = f( θt+1) such that, θt+1 = arg min θ Θ zit [fit(θ) fit(θt)] + 1 2ηt fit(θ) fit(θt) 2 2 = arg min θ Θ gz t (θ) Lemma D.1. Consider the following updates: θt+1 = arg min θ qt(θ) ; z(1) t+1 = f( θt+1) (SSO) zt+1/2 = zt η t zℓit(zt) ; (Target-space SGD) z(2) t+1 = arg min z Z zt+1/2 z 2 2 SSO and target space SGD result in the same iterate in each step i.e. if zt = f(θt), then zt+1 := z(1) t+1 = z(2) t+1. zt+1 = arg min z Z 1 2{ zt+1/2 z 2 2} = arg min z Z 1 2{ zt η t zℓit(zt) z 2 2} = arg min z Z zit [zit zit t ] + 1 2η t z zt 2 2 (Due to separability of ℓ) Target-based Surrogates for Stochastic Optimization Since for all z Z, z = f(θ). Hence zt+1 = f( θt+1) such that, θt+1 = arg min θ Θ zit [fit(θ) fit(θt)] + 1 2η t f(θ) f(θt) 2 2 = arg min θ Θ qt(θ) D.2. Proof for Strongly-convex Functions We consider the case where ℓ(z) is strongly-convex and the set Z is convex. We will focus on SGD in the target space, and consider the following updates: zt+1/2 = zt η t ℓt(zt) zt+1 = ΠZ[zt+1/2] := arg min z Z z zt+1/2 2 2 zt+1 zt+1 ϵt+1 Lemma D.2. Bounding the suboptimality (to z ) of zt+1 based on the sub-optimality of zt+1 and ϵt+1, we get that zt+1 z 2 2 zt+1 z 2 2 + 2ϵt+1 zt+1 z . (8) zt+1 z 2 2 = zt+1 zt+1 + zt+1 z 2 2 = zt+1 zt+1 2 2 + zt+1 z 2 2 + 2 zt+1 zt+1, zt+1 z = zt+1 zt+1 2 2 + zt+1 z 2 2 + 2 zt+1 zt+1, zt+1 zt+1 + zt+1 z = zt+1 zt+1 2 2 + zt+1 z 2 2 + 2 zt+1 zt+1, zt+1 z 2 zt+1 zt+1 2 2 zt+1 z 2 2 + 2 zt+1 zt+1 zt+1 z zt+1 z 2 2 + 2ϵt+1 zt+1 z Now we bound the exact sub-optimality at iteration t + 1 by the inexact sub-optimality at iteration t to get a recursion. Lemma D.3. Assuming (i) each ℓt is L-smooth and (ii) ℓis µ-strongly convex and (iii) η t 1 2L, we have E zt+1 z 2 2 (1 µη t)E zt z 2 2 + 2η t 2σ2 (9) where σ2 = E ℓ(z ) ℓt(z ) 2 2. Proof. Since z Z and optimal, z = ΠZ[z η t ℓ(z )]. zt+1 z 2 2 = ΠZ[zt η t ℓt(zt)] ΠZ[z η t ℓ(z )] 2 2 [zt η t ℓt(zt)] [z η t ℓ(z )] 2 2 (Since projections are non-expansive) = [zt η t ℓt(zt)] [z η t ℓt(z )] + η t [ ℓ(z ) ℓt(z )] 2 2 = zt z 2 2 + η t 2 ℓ(z ) ℓt(z ) 2 2 + η t 2 ℓt(zt) ℓt(z ) 2 2 + 2η t zt z , ℓ(z ) ℓt(z ) | {z } At 2η t zt z , ℓt(zt) ℓt(z ) + 2η t 2 ℓt(z ) ℓ(z ), ℓt(zt) ℓt(z ) | {z } Bt zt z 2 2 + 2η t 2 ℓ(z ) ℓt(z ) 2 2 + 2η t 2 ℓt(zt) ℓt(z ) 2 2 + At 2η t zt z , ℓt(zt) ℓt(z ) (Young inequality on Bt) Target-based Surrogates for Stochastic Optimization Taking expectation w.r.t it, knowing that EAt = 0 and using that σ2 = E ℓ(z ) ℓt(z ) 2 2. E zt+1 z 2 2 E zt z 2 2 + 2η t 2E ℓt(zt) ℓt(z ) 2 2 2η t zt z , ℓ(zt) ℓ(z ) + 2η t 2σ2 E zt z 2 2 + 4η t 2L E {ℓt(zt) ℓt(z ) ℓt(z ), zt z } 2η t zt z , ℓ(zt) ℓ(z ) + 2η 2 t σ2 (10) E zt z 2 2 + 2η t {ℓ(zt) ℓ(z ) ℓ(z ), zt z } 2η t zt z , ℓ(zt) ℓ(z ) + 2η 2 t σ2 (η t < 1 2L) E zt z 2 2 + 2η t {ℓ(zt) ℓ(z ) ℓ(zt), zt z } + 2η 2 t σ2 E zt z 2 2 µη t E zt z 2 2 + 2η 2 t σ2 (strong convexity of ℓ) (1 µη t)E zt z 2 2 + 2η 2 t σ2 where in Eq. (10) we use the smoothness of ℓt. Theorem 4.2. Assuming that (i) ℓit is L-smooth and convex, (ii) ℓis µ-strongly convex, (iii) z = arg minz Z ℓ(z) (iv) and that for all t, ϵt ϵ, T iterations of SSO result in the following bound for z T = f(θT ), E z T +1 z YT i=t+1 ρiη t 2 1/2 where ρt = (1 µη t), σ2 := E h ℓ(z ) ℓt(z ) 2 2 i , (a) Constant step-size: When ηt = 1 2L n and for ρ = 1 µ 2L, E z T +1 z z1 z 1 1 + σ µL + 2ϵ (b) Exponential step-size: When ηt = 1 2L nαt for α = ( β E z T +1 z c1 exp T 4κ α ln(T/β) + 4κc1(ln(T/β)) T σ + 2ϵ c2 . where c1 = exp 1 4κ 2β ln(T/β) , and c2 = exp β ln(T ) 2κ ln(T/β) Proof. Using Lemma D.2 and Lemma D.3 we have E zt+1 z 2 2 E zt+1 z 2 2 + 2E[ϵt+1 zt+1 z ] (1 µη t) | {z } ρt E zt z 2 2 + 2η 2 t σ2 + 2E[ϵt+1 zt+1 z ] Recursing from t = 1 to T, E[ z T +1 z 2 2] z1 z 2 2 + 2σ2 T X i=t+1 ρiη t 2 + 2 i=t+1 ρi E[ϵt+1 zt+1 z ] Target-based Surrogates for Stochastic Optimization Denote ut := E zt z . By applying Jensen s inequality we know that u2 T E[ z T z 2 2]. u2 1 + 2σ2 T X i=t+1 ρiη t 2 + 2ϵ i=t+1 ρi ut+1 Dividing both sides in the previous inequality by QT i=1 ρi leads to: u2 T +1 u2 1 + 2σ2 T Y t=1 η t 2 T Y Simplify the above inequality for a generic τ T, u2 1 + 2σ2 τ X t=1 η t 2 t Y Let vτ := (Qτ i=1 ρi) 1 2 uτ+1 and Sτ := u2 1 + 2σ2 Pτ t=1 η t 2 Qt i=1 ρi 1 . Let us also denote λt := 2ϵ Qt i=1 ρi 1 Observe that S0 = u2 1 = v2 0 and Sτ+1 = Sτ +2σ2η t 2 τ+1 Qτ+1 i=1 ρi 1 . Therefore Sτ is an increasing sequence. Re-writing the previous inequality using the new variables leads to the following inequality: Using the result from Lemma D.10 we have: b for a, b 0) Writing the inequality above using the original variables results in: u2 1 + 2σ2 τ X t=1 η t 2 t Y t=1 η t 2 t Y (a) Constant step size: Choosing a constant step size η t = η = 1 2L implies ρi = ρ = 1 µη = 1 µ 2L. Plugging this into the previous inequality leads to: uτ+1 2ϵρ τ 2 2 + u1ρ τ 2 + 2ϵ 1 ρ + u1ρ τ 2 + 2ση 1 ρ (applying the formula for finite geometric series) E z T +1 z z1 z 1 1 2 + σ µL + 2ϵ Target-based Surrogates for Stochastic Optimization (b) Exponential step size. Starting from Eq. (10) and using the proof from Vaswani et al. (2022), we have E zt+1 z 2 2 E zt z 2 2 + 4η 2 t LE {ℓt(zt) ℓt(z ) ℓt(z ), zt z } 2η t zt z , ℓ(zt) ℓ(z ) + 2η 2 t σ2 E zt z 2 2 + α2t L E {ℓt(zt) ℓt(z ) ℓt(z ), zt z } αt L zt z , ℓ(zt) ℓ(z ) + 2η 2 t σ2 ( using η t = αt E zt z 2 2 + αt L {ℓ(zt) ℓ(z ) ℓ(z ), zt z } αt L zt z , ℓ(zt) ℓ(z ) + 2η 2 t σ2 ( using αt 1) = E zt z 2 2 + αt L {ℓ(zt) ℓ(z ) ℓ(zt), zt z } + 2η 2 t σ2 E zt z 2 2 µαt 2L E zt z 2 2 + 2η 2 t σ2 = 1 1 E zt z 2 2 + 2η 2 t σ2 (using strong convexity of ℓ) Combining the above with Lemma D.2 we get: E zt+1 z 2 2 1 1 E zt z 2 2 + 2η 2 t σ2 + 2E[ϵt+1 zt+1 z ] E zt z 2 2 + σ2 2L2 α2t + 2E[ϵt+1 zt+1 z ] (1 x exp( x)) E zt z 2 2 + σ2 2L2 α2t + 2ϵE[ zt+1 z ] (ϵt+1 ϵ) Unrolling the recursion starting from t = 1 to T, denoting ut := E zt z and applying Jensen s inequality to deduce that that u2 T +1 E z T +1 z 2 2, we get that, u2 T +1 u2 1 exp 1 t=1 α2t exp 1 By multiplying both sides by exp 1 2κ PT t=1 αt 2 u2 1 + σ2 t=1 α2t exp 1 Now let us define vτ := exp 1 4κ Pτ t=1 αt uτ+1, Sτ := u2 1 + σ2 2L2 Pτ t=1 α2t exp 1 2κ Pt i=1 αi 2ϵ exp 1 4κ Pt i=1 αi . Note that Sτ is increasing and S0 = u2 1 = v2 0 0 and λt > 0, vτ > 0. By applying Lemma D.10, similar to the fixed step-size case, for τ = T, we get: u T +1 u2 1 + σ2 t=1 α2t exp 1 Target-based Surrogates for Stochastic Optimization Multiplying both sides by exp 1 4κ PT t=1 αt u T +1 exp 1 t=1 α2t exp 1 | {z } :=BT = u2 1 exp 1 1/2 + 2ϵ CT To bound A, we use Lemma D.6 and get 2κA z1 z 2 2 exp 1 2κ 2β ln(T/β) | {z } :=c2 1 2κ α ln(T/β) To bound BT we use Lemma D.7 BT 16κ2c2 1(ln(T/β))2 Finally using Lemma D.8 to bound CT we get E z T +1 z c2 1 exp T 2κ α ln(T/β) z1 z 2 2 + 16κ2c2 1(ln(T/β))2 2L2e2α2T σ2 1/2 + 2ϵ exp β ln(T) = E z T +1 z c1 exp T 4κ α ln(T/β) z1 z + 4κc1(ln(T/β)) T σ + 2ϵ c2 . Target-based Surrogates for Stochastic Optimization D.3. Controlling the Projection Error Let us recall the following definitions for the theoretical analysis: θt+1 := arg min θ gt(θ); gt(θ) := ℓit(zt) + ℓit(zt) zit fit(θ) zit t + 1 fit(θ) zit t 2 ; zt+1 = f(θt+1) θt+1 := arg min θ qt(θ) ; qt(θ) := ℓit(zt) + ℓit(zt) zit fit(θ) zit t + 1 2η t f(θ) zt 2 2 ; zt+1 = f( θt+1) θ t+1 := arg min θ gt(θ) ; gt(θ) := 1 i ℓi(zt) + ℓi(zt), f(θ) zt 2ηt f(θ) zt 2 2 zt+1 = f(θt+1) , where θt+1 is obtained by running mt iterations of GD on gt(θ). We will use these definitions to prove the following proposition to control the projection error in each iteration. Proposition 4.3. Assuming that (i) f is Lf-Lipschitz continuous, (ii) gt is µg strongly-convex and Lg smooth with κg := Lg/µg7, (iii) qt is µq strongly-convex (iv) ζ2 t := 8 min{µg,µq} ([min {Eit [ gt]} Eit [min { gt}]]) + 8 min{µg,µq} [min {Eit [ qt]} Eit [min { qt}]] , (v) σ2 z := minz Z {Eit [ℓit(z)]} Eit [minz Z {ℓit(z)}], if gt is minimized using mt inner-loops of GD with the appropriate step-size, then, E[ϵ2 t+1] L2 f ζ2 t + 4L2 f µg exp ( mt/κg) E[ℓ(zt) ℓ(z )] + σ2 z . Proof. Since we obtain θt+1 by minimizing gt(θ) using mt iterations of GD starting from θt, using the convergence guarantees of gradient descent (Nesterov, 2003), θt+1 θt+1 2 2 exp ( mt/κg) θt+1 θt 2 2 θt+1 θt+1 2 2 = θt+1 θt+1 + θt+1 θt+1 2 2 2 θt+1 θt+1 2 2 + 2 θt+1 θt+1 2 2 ( a + b 2 2 2 a 2 2 + 2 b 2 2) 2 exp ( mt/κg) θt+1 θt 2 2 + 2 θt+1 θt+1 2 µg exp ( mt/κg) [ gt(θt) gt( θt+1)] + 2 θt+1 θt+1 2 Taking expectation w.r.t it, Eit θt+1 θt+1 2 2 4 h exp ( mt/κg) Eit[ gt(θt) gt( θt+1)] i + 2E θt+1 θt+1 2 Let us first simplify E θt+1 θt+1 2 E θt+1 θt+1 2 2 = E θt+1 θ t+1 + θ t+1 θt+1 2 2E θt+1 θ t+1 2 2 + 2E θ t+1 θt+1 2 2 µg E[ gt(θ t+1) gt( θt+1)] + 4 µq E[ qt(θ t+1) qt( θt+1)] µg [min {Eit [ gt]} Eit [min { gt}]] + 4 µq [min {Eit [ qt]} Eit [min { qt}]] (Since E[ q] = E[ g] = g) 7We consider strongly-convex functions for simplicity, and it should be possible to extend these results to when g is non-convex but satisfies the PL inequality, or is weakly convex. Target-based Surrogates for Stochastic Optimization 2E θt+1 θt+1 2 2 8 min{µg, µq} ([min {Eit [ gt]} Eit [min { gt}]] + [min {Eit [ qt]} Eit [min { qt}]]) | {z } :=ζ2 t = 2E θt+1 θt+1 2 Using the above relation, Eit θt+1 θt+1 2 2 4 h exp ( mt/κg) Eit[ gt(θt) gt( θt+1)] i + ζ2 t h exp ( mt/κg) Eit[ht(θt) ht( θt+1)] i + ζ2 t (Since gt(θt) = ht(θt) and gt(θ) ht(θ) for all θ) h exp ( mt/κg) Eit[ht(θt) h t + h t ht( θt+1)] i + ζ2 t (h t := minθ ht(θ)) exp ( mt/κg) Eit[ht(θt) h t ] + ζ2 t (Since h t ht(θ) for all θ) µg [exp ( mt/κg) [Eit[ht(θt) ht(θ )] + Eit[ht(θ ) h t ]]] + ζ2 t µg [exp ( mt/κg) [Eit[ℓt(zt) ℓt(z )] + Eit[ℓt(z ) ℓ t ]]] + ζ2 t (Since h(θ) = ℓ(f(θ)) = ℓ(z)) Eit[ θt+1 θt+1 2 2] 4 µg [exp ( mt/κg) [Eit[ℓt(zt) ℓt(z )] + Eit[ℓt(z ) ℓ t ]]] + ζ2 t exp ( mt/κg) [ℓ(zt) ℓ(z )] + Eit[ℓt(z ) ℓ t ] | {z } :=σ2z (Since both zt and z are independent of the randomness in ℓt and Eit[ℓt] = ℓ) = E[ θt+1 θt+1 2 2] 4 exp ( mt/κg) ℓ(zt) ℓ(z ) + σ2 z + ζ2 t Now, we will bound E[ϵ2 t+1] by using the above inequality and the Lipschitzness of f. E[ϵ2 t+1] = zt+1 zt+1 2 2 = f(θt+1) f( θt+1) 2 2 L2 f θt+1 θt+1 2 2 (Since f is Lf-Lipschitz) = E[ϵ2 t+1] 4L2 f µg exp ( mt/κg) ℓ(zt) ℓ(z ) + σ2 z + L2 f ζ2 t Taking expectation w.r.t the randomness from iterations k = 0 to t, E[ϵ2 t+1] 4L2 f µg exp ( mt/κg) E[ℓ(zt) ℓ(z )] + σ2 z + L2 f ζ2 t Target-based Surrogates for Stochastic Optimization D.4. Example to show the necessity of ζ2 term Proposition 4.4. Consider minimizing the sum h(θ) := h1(θ)+h2(θ) 2 of two one-dimensional quadratics, h1(θ) := 1 and h2(θ) = 1 2 (2θ + 1/2)2, using SSO with mt = and ηt = c αt for any sequence of αt and any constant c (0, 1]. SSO results in convergence to a neighbourhood of the solution, specifically, if θ is the minimizer of h and θ1 > 0, then, E(θT θ ) min θ1, 3 Proof. Let us first compute θ := arg min h(θ). 4(θ 1)2 + 1 2(z 1)2 where z = θ 2(θt 1)2 + (θt 1) (θt θ) + 1 2ηt (θ θt)2 If mt = , SSO will minimize gt exactly. Since gt(θt+1) = 0, = 1 ηt (θt+1 θt) = (θt 1) = θt+1 = θt ηt (θt 1) = θt+1 = θt ηt h1(θt) Similarly, for h2, 2(z + 1/2)2 where z = 2θ 2(2θt + 1/2)2 + (2θt + 1/2) (2θt 2θ) + 1 2ηt (2θ 2θt)2 If mt = , SSO will minimize gt exactly. Since gt(θt+1) = 0, = 4 ηt (θt+1 θt) = (4θt + 1) = θt+1 = θt ηt (θt + 1/4) = θt+1 = θt ηt If it = 1 and η = c αt θt+1 = θt c αt (θt 1) = c αt + (1 cαt)θt θt+1 = θt c αt 2 4(2θt + 1 2) = (1 c αt)θt 1 Eθt+1 = (1 c αt)θt + 1 8c αt = (1 c αt)θt + 3 EθT = E(θT θ ) = (θ1 θ ) t=1 (1 c αt) + 3 t=1 2(1 c)αt i=t+1 (1 c αi) Using Lemma D.9 and the fact that c αt 1 for all t, we have that if θ1 θ = θ1 > 0, then, E(θT θ ) min θ1, 3 Target-based Surrogates for Stochastic Optimization D.5. Helper Lemmas The proofs of Lemma D.4, Lemma D.6, and Lemma D.7 can be found in (Vaswani et al., 2022). Lemma D.4. For all x > 1, 1 x 1 2 ln(x) Proof. For x > 1, we have 1 x 1 2 ln(x) ln(x) < 2x 2 Define f(x) = 2x 2 ln(x). We have f (x) = 2 1 x. Thus for x 1, we have f (x) > 0 so f is increasing on [1, ). Moreover we have f(1) = 2 2 ln(1) = 0 which shows that f(x) 0 for all x > 1 and ends the proof. Lemma D.5. For all x, γ > 0, Proof. Let x > 0. Define f(γ) = γ ex γ exp( x). We have f(γ) = exp (γ ln(γ) γ ln(ex)) exp( x) f (γ) = γ 1 γ + ln(γ) ln(ex) exp (γ ln(γ) γ ln(ex)) f (γ) 0 1 + ln(γ) ln(ex) 0 γ exp (ln(ex) 1) = x So f is decreasing on (0, x] and increasing on [x, ). Moreover, x exp( x) = 1 x exp( x) = 0 and thus f(γ) 0 for all γ > 0 which proves the lemma. Lemma D.6. Assuming α < 1 we have t=1 αt αT ln(T/β) 2β ln(T/β) t=1 αt = α αT +1 1 α = α 1 α αT +1 1 α = αβ T(1 α) = β T 2 ln(1/α) = β T 2 1 T ln(T/β) = 2β ln(T/β) (11) Target-based Surrogates for Stochastic Optimization where in the inequality we used Lemma D.4 and the fact that 1/α > 1. Plugging back into A we get, A α 1 α 2β ln(T/β) α ln(1/α) 2β ln(T/β) (1 x ln( 1 = αT ln(T/β) 2β ln(T/β) Lemma D.7. For α = β T 1/T and any κ > 0, t=1 α2t exp 16κ2c2(ln(T/β))2 where c2 = exp 1 2κ 2β ln(T/β Proof. First, observe that, i=t+1 αi = αt+1 αT +1 1 α = αβ T(1 α) = β T 2 ln(1/α) = β T 2 1 T ln(T/β) = 2β ln(T/β) where in the inequality we used D.4 and the fact that 1/α > 1. These relations imply that, i=t+1 αi αt+1 1 α 2β ln(T/β) 2κ 2β ln(T/β) We then have t=1 α2t exp t=1 α2t exp 1 t=1 α2t 2(1 α)2κ 2 (Lemma D.5) e2α2 T(1 α)2 e2α2 T(ln(1/α))2 = 16κ2c2(ln(T/β))2 Target-based Surrogates for Stochastic Optimization Lemma D.8. For α = β T 1/T and any ζ > 0, exp 2β ln(T) Proof. First, observe that, i=t αi = αt αT +1 1 α = αβ T(1 α) = β T 2 ln(1/α) = β T 2 1 T ln(T/β) = 2β ln(T/β) where in the inequality we used Lemma D.4 and the fact that 1/α > 1. Using the above bound we have t=1 exp 2β ζ ln(T/β) = exp 2β ln(T) Lemma D.9. For any sequence αt t=1 (1 αt) + i=t+1 (1 αi) = 1 Proof. We show this by induction on T. For T = 1, (1 α1) + α1 = 1 Induction step: t=1 (1 αt) + i=t+1 (1 αi) = (1 αT +1) t=1 (1 αt) + i=t+1 (1 αi) = (1 αT +1) t=1 (1 αt) + αT +1 + (1 αT +1) i=t+1 (1 αi) = (1 αT +1) t=1 (1 αt) + i=t+1 (1 αi) (Induction hypothesis) = (1 αT +1) + αT +1 = 1 Target-based Surrogates for Stochastic Optimization Lemma D.10. (Schmidt et al., 2011, Lemma 1). Assume that the non-negative sequence {vτ} for τ 1 satisfies the following recursion: where {Sτ} is an increasing sequence, such that S0 v2 0 and λt 0. Then for all τ 1, Proof. We prove this lemma by induction. For τ = 1 we have v2 1 S1 + λ1v1 = (v1 λ1 2 )2 S1 + λ2 1 4 = v1 Inductive hypothesis: Assume that the conclusion holds for τ {1, . . . , k}. Specifically, for τ = k, Now we show that the conclusion holds for τ = k + 1. Define ψk := Pk t=1 λt. Note that ψk+1 ψk since λt 0 for all t. Hence, vk 1 2ψk + Sk + 1 2ψk 2 1/2 . Define k := arg maxτ {0,...,k} vτ. Using the main assumption for τ = k + 1, v2 k+1 Sk+1 + = v2 k+1 λk+1vk+1 Sk+1 + = vk+1 λk+1 2 Sk+1 + λ2 k+1 Sk+1 + λ2 k+1 t=1 λt (since vk is the maximum) = Sk+1 + λ2 k+1 4 + vk ψk (based on the definition for ψk) Since k k, by the inductive hypothesis, 2 Sk+1 + λ2 k+1 = Sk+1 + λ2 k+1 Sk+1 + λ2 k+1 (Since {ψτ} is non-decreasing and k k) Target-based Surrogates for Stochastic Optimization Furthermore, since {Sτ} is increasing and k < k + 1, 2 Sk+1 + λ2 k+1 Sk+1 + λ2 k+1 (Since {ψτ} is non-decreasing and k < k + 1) = Sk+1 + λ2 k+1 4ψ2 k+1 + ψk (a2 + b2 (a + b)2 for a = ψk > 0 and b = λk+1 > 0) 4ψ2 k+1 | {z } :=x2 | {z } =2xy 4ψ2 k |{z} :=y2 = vk+1 λk+1 where replacing ψk+1 with its definition gives us the required result. Target-based Surrogates for Stochastic Optimization E. Additional Experimental Results Supervised Learning: We evaluate our framework on the Lib SVM benchmarks (Chang and Lin, 2011), a standard suite of convex-optimization problems. Here consider two datasets mushrooms, and rcv1, two losses squared loss and logistic loss, and four batch sizes {25, 125, 625, full-batch} for the linear parameterization (f = X θ). Each optimization algorithm is run for 500 epochs (full passes over the data). E.1. Stochastic Surrogate Optimization Comparisons of SGD, SLS, Adam, Adagrad, and SSO evaluated on three SVMLib benchmarks mushrooms, ijcnn, and rcv1, two losses squared loss and logistic loss, and four batch sizes {25, 125, 625, full-batch}. Each run was evaluated over three random seeds following the same initialization scheme. All plots are in log-log space to make trends between optimization algorithms more apparent. All algorithms and batch sizes are evaluated for 500 epochs and performance is represented as a function of total optimization steps. We compare stochastic surrogate optimization (SSO), against SGD with the standard theoretical 1/2Lθ step-size, SGD with the step-size set according to a stochastic line-search (Vaswani et al., 2019b) SLS, and finally Adam (Kingma and Ba, 2015) using default hyper-parameters. Since SSO is equivalent to projected SGD in the target space, we set η (in the surrogate definition) to 1/2L where L is the smoothness of ℓw.r.t z. For squared loss, L is therefore set to 1, while for logistic it is set to 2. These figures show (i) SSO improves over SGD when the step-sizes are set theoretically, (ii) SSO is competitive with SLS or Adam, and (iii) as m increases, on average, the performance of SSO improves as projection error decreases. For further details see the attached code repository. Below we include three different step-size schedules: constant, 1 t (Orabona, 2019), and (1/T)t/T (Vaswani et al., 2022). 103 104 105 100 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 100 batch-size: 625 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 7 10 2 8 10 2 9 10 2 102 103 104 100 101 102 103 104 105 10 3 103 104 105 10 3 102 103 104 10 3 100 101 102 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 4: Constant step-size: comparison of optimization algorithms under a mean squared error loss. We note, SSO significantly outperforms its parametric counterpart, and maintains performance which is on par with both SLS and Adam. Additionally we note that taking additional steps in the surrogate generally improves performance. Target-based Surrogates for Stochastic Optimization 103 104 105 100 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 100 batch-size: 625 100 101 102 104 105 2 10 1 3 10 1 4 10 1 103 104 105 2 10 1 3 10 1 4 10 1 102 103 104 100 101 102 2 10 1 3 10 1 4 10 1 103 104 105 103 104 105 102 103 104 100 101 102 10 1 2 10 1 3 10 1 4 10 1 6 10 1 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 5: Constant step-size: comparison of optimization algorithms under a average logistic loss. We note, SSO significantly outperforms its parametric counterpart, and maintains performance which is on par with both SLS and Adam. Additionally we note that taking additional steps in the surrogate generally improves performance 103 104 105 100 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 100 batch-size: 625 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 7 10 2 8 10 2 9 10 2 102 103 104 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 10 3 103 104 105 10 3 102 103 104 10 3 100 101 102 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 6: Decreasing step-size: comparison of optimization algorithms under a mean squared error loss. We compare examples which include a decaying step size of 1 t alongside both SSO as well as SGD and SLS. Adam (and Adagrad). Again, we note that taking additional steps in the surrogate generally improves performance. Additionally the decreasing step-size seems to help maintain strict monotonic improvement. Target-based Surrogates for Stochastic Optimization 103 104 105 100 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 batch-size: 625 100 101 102 104 105 2 10 1 3 10 1 4 10 1 103 104 105 2 10 1 3 10 1 4 10 1 102 103 104 100 101 102 2 10 1 3 10 1 4 10 1 103 104 105 103 104 105 102 103 104 10 2 100 101 102 2 10 1 3 10 1 4 10 1 6 10 1 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 7: Decreasing step-size: comparison of optimization algorithms under a logistic loss. We compare examples which include a decaying step size of 1 t alongside both SSO as well as SGD and SLS. Adam (and Adagrad). Again, we note that taking additional steps in the surrogate generally improves performance. Additionally the decreasing step-size seems to help maintain strict monotonic improvement. 103 104 105 100 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 100 batch-size: 625 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 7 10 2 8 10 2 9 10 2 102 103 104 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 10 3 103 104 105 10 3 102 103 104 10 3 100 101 102 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 8: Exponential step-size: comparison of optimization algorithms under a mean squared error loss. We compare examples which include a decaying step size of ( 1 T )t/T alongside both SSO as well as SGD and SLS. Adam. Again, we note that taking additional steps in the surrogate generally improves performance. Additionally the decreasing step-size seems to help maintain strict monotonic improvement. Lastly, because of a less aggressive step size decay, the optimization algorithms make more progress then their stochastic 1 t counterparts. Target-based Surrogates for Stochastic Optimization 103 104 105 100 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 100 batch-size: 625 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 7 10 2 8 10 2 9 10 2 102 103 104 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 10 3 103 104 105 10 3 102 103 104 10 3 100 101 102 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 9: Exponential step-size: comparison of optimization algorithms under a logistic loss. We compare examples which include a decaying step size of ( 1 T )t/T alongside both SSO as well as SGD and SLS. Adam remains the same as aboves. Again, we note that taking additional steps in the surrogate generally improves performance. Additionally the decreasing step-size seems to help maintain strict monotonic improvement. Lastly, because of a less aggressive step size decay, the optimization algorithms make more progress then their stochastic 1 t counterparts. E.2. Stochastic Surrogate Optimization with a Line-search Comparisons of SGD, SLS, Adam, Adagrad, and SSO-SLS evaluated on three SVMLib benchmarks mushrooms, ijcnn, and rcv1. Each run was evaluated over three random seeds following the same initialization scheme. All plots are in log-log space to make trends between optimization algorithms more apparent. As before in all settings, algorithms use either their theoretical step-size when available, or the default as defined by (Paszke et al., 2019). The inner-optimization loop are set according to line-search parameters and heuristics following Vaswani et al. (2019b). All algorithms and batch sizes are evaluated for 500 epochs and performance is represented as a function of total optimization steps. 103 104 105 100 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 100 batch-size: 625 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 7 10 2 8 10 2 9 10 2 102 103 104 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 10 3 103 104 105 10 3 102 103 104 10 3 100 101 102 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 10: Constant step-size: comparison of optimization algorithms under a mean squared error loss. We note, SSO-SLS outperforms its parametric counterpart, and maintains performance which is on par with both SLS and Adam. Additionally we note that taking additional steps in the surrogate generally improves performance, especially in settings with less noise (full-batch and batch-size 625). Target-based Surrogates for Stochastic Optimization 103 104 105 100 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 batch-size: 625 100 101 102 104 105 2 10 1 3 10 1 4 10 1 103 104 105 2 10 1 3 10 1 4 10 1 102 103 104 100 101 102 2 10 1 3 10 1 4 10 1 103 104 105 103 104 105 102 103 104 100 101 102 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 11: Constant step-size: comparison of optimization algorithms under a average logistic loss. We note, SSO-SLS outperforms its parametric counterpart, and maintains performance which is on par with both SLS and Adam. Additionally we note that taking additional steps in the surrogate generally improves performance. E.3. Combining Stochastic Surrogate Optimization with Adaptive Gradient Methods Comparisons of SGD, SLS, Adam, Adagrad, and SSO-Adagrad evaluated on three SVMLib benchmarks mushrooms, ijcnn, and rcv1. Each run was evaluated over three random seeds following the same initialization scheme. All plots are in log-log space to make trends between optimization algorithms more apparent. As before in all settings, algorithms use either their theoretical step-size when available, or the default as defined by (Paszke et al., 2019). The inner-optimization loop are set according to line-search parameters and heuristics following Vaswani et al. (2019b). All algorithms and batch sizes are evaluated for 500 epochs and performance is represented as a function of total optimization steps. Here we update the η according to the same schedule as scalar Adagrad (termed Ada Grad-Norm in Ward et al. (2020)). Because Adagrad does not have an easy to compute optimal theoretical step size, for our setting we set the log learning rate (the negative of log η) to be 2.. For further details see the attached coding repository. 103 104 105 100 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 100 batch-size: 625 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 7 10 2 8 10 2 9 10 2 102 103 104 100 101 102 7 10 2 8 10 2 9 10 2 103 104 105 10 3 103 104 105 10 3 102 103 104 10 3 100 101 102 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 12: Comparison in terms of average MSE loss of SGD, SLS, Adam, and SSO-Adagrad evaluated under a mean squared error loss. These plots show that SSO-Adagrad outperforms its parametric counterpart, and maintains performance which is on par with both SLS and Adam. Additionally, we again find that taking additional steps in the surrogate generally improves performance. Target-based Surrogates for Stochastic Optimization 103 104 105 100 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 100 batch-size: 625 100 101 102 104 105 2 10 1 3 10 1 4 10 1 103 104 105 2 10 1 3 10 1 4 10 1 102 103 104 100 101 102 2 10 1 3 10 1 4 10 1 103 104 105 103 104 105 102 103 104 100 101 102 2 10 1 3 10 1 4 10 1 6 10 1 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 13: Comparison in terms of average MSE loss of SGD, SLS, Adam, and SSO-Adagrad evaluated under a logistic loss. These plots show that SSO-Adagrad outperforms its parametric counterpart, and maintains performance which is on par with both SLS and Adam. Additionally, we again find that taking additional steps in the surrogate generally improves performance. E.4. Combining Stochastic Surrogate Optimization With Online Newton Steps Comparisons of SGD, SLS, Adam, Adagrad, and SSO-Newton evaluated on three SVMLib benchmarks mushrooms, ijcnn, and rcv1. Each run was evaluated over three random seeds following the same initialization scheme. All plots are in log-log space to make trends between optimization algorithms more apparent. As before, in all settings, algorithms use either their theoretical step-size when available, or the default as defined by (Paszke et al., 2019). The inner-optimization loop are set according to line-search parameters and heuristics following Vaswani et al. (2019b). All algorithms and batch sizes are evaluated for 500 epochs and performance is represented as a function of total optimization steps. Here we update the η according to the same schedule as Online Newton. We omit the MSE example as SSO-Newton in this setting is equivalent to SSO. In the logistic loss setting however, which is displayed below, we re-scale the regularization term by (1 p)p where p = σ(f(x)) where σ is this sigmoid function, and f is the target space. This operation is done per-data point, and as can be seen below, often leads to extremely good performance, even in the stochastic setting. In the plots below, if the line vanishes before the maximum number of optimization steps have occurred, this indicates that the algorithm has converged to the minimum and is no longer executed. Notably, SSO-Newton achieves this in for multiple data-sets and batch sizes. 103 104 105 batch-size: 25 102 103 104 100 batch-size: 125 101 102 103 batch-size: 625 100 101 102 2 10 1 3 10 1 4 10 1 6 10 1 103 104 105 2 10 1 3 10 1 4 10 1 102 103 104 100 101 102 103 104 105 103 104 105 102 103 104 100 101 102 10 2 SGD Adam SLS Adagrad SSO-1 SSO-5 SSO-10 SSO-20 Figure 14: Comparison in terms of average logistic loss of SGD, SLS, Adam, and SSO-Newton evaluated on the logistic loss. This plot displays that significant improvement can be made at no additional cost by re-scaling the regularization term correctly. Note that in the case of mushrooms, SSO-Newton in all cases for m = 20 reaches the stopping criteria before the 500th epoch. Second, even in many stochastic settings, SSO-Newton outperforms both SLS and Adam. Target-based Surrogates for Stochastic Optimization E.5. Comparison with SVRG Below we include a simple supervised learning example in which we compare the standard variance reduction algorithm SVRG (Johnson and Zhang, 2013) to the SSO algorithm in the supervised learning setting. Like the examples above, we test the optimization algorithms using the Chang and Lin (2011) rcv1 dataset, under an MSE loss. This plot shows that SSO can outperform SVRG for even small batch-size settings. For additional details on the implementation and hyper-parameters used, see https://github.com/Wilder Lavington/ Target-Based-Surrogates-For-Stochastic-Optimization. We note that it is likely, that for small enough batch sizes SVRG may become competitive with SSO, however we leave such a comparison for future work. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Optimization-Steps (105) batch-size: 25 0 1 2 3 4 5 6 7 8 Optimization-Steps (104) batch-size: 125 0 2 4 6 8 10 12 14 16 Optimization-Steps (103) batch-size: 625 0 1 2 3 4 5 Optimization-Steps (102) SGD SVRG SSO-1 SSO-10 SSO-20 Figure 15: Comparison of the average mean-squared error between SGD, SVRG, and SSO-SGD. This plot displays that even in settings where the batch-size is small, SSO can even improve over variance reduction techniques like SVRG. Target-based Surrogates for Stochastic Optimization E.6. Imitation Learning 0 10000 20000 30000 40000 50000 Behavioral Cloning: 0 10000 20000 30000 40000 50000 Log-Policy-Loss Online Imitation: 0 10000 20000 30000 40000 50000 Total-Examples 0 10000 20000 30000 40000 50000 Total-Examples Policy-Return (a) Walker2d-v2 Environment 0 10000 20000 30000 40000 50000 Behavioral Cloning: 0 10000 20000 30000 40000 50000 Log-Policy-Loss Online Imitation: 0 10000 20000 30000 40000 50000 Total-Examples 0 10000 20000 30000 40000 50000 Total-Examples Policy-Return Adagrad Adam SLS SSO-1 SSO-10 SSO-100 SSO-1000 (b) Hopper-v2 Environment Figure 16: Comparison of policy return, and log policy loss incurred by SGD, SLS, Adam, Adagrad, and SSO as a function of the total interactions. Unlike Section 5, the mean of the policy is parameterized by a neural network model. In both environments, for both behavioral policies, SSO outperforms all other online-optimization algorithms. Additionally, as m in increases, so to does the performance of SSO in terms of both the return as well as the loss. 0 10000 20000 30000 40000 50000 7.6 Behavioral Cloning: 0 10000 20000 30000 40000 50000 Policy-Loss Online Imitation: 0 10000 20000 30000 40000 50000 Total-Examples 0 10000 20000 30000 40000 50000 Total-Examples Policy-Return (a) Walker2d-v2 Environment 0 10000 20000 30000 40000 50000 Behavioral Cloning: 0 10000 20000 30000 40000 50000 Policy-Loss Online Imitation: 0 10000 20000 30000 40000 50000 Total-Examples 0 10000 20000 30000 40000 50000 Total-Examples Policy-Return Adagrad Adam SLS SSO-1 SSO-10 SSO-100 SSO-1000 (b) Hopper-v2 Environment Figure 17: Comparison of policy return, and log policy loss incurred by SGD, SLS, Adam, Adagrad, and SSO as a function of the total interactions. Unlike Section 5, the mean of the policy is parameterized by a linear model. In both environments, for both behavioral policies, SSO outperforms all other online-optimization algorithms. Additionally, as m in increases, so to does the performance of SSO in terms of both the return as well as the loss. Comparisons of Adagrad, SLS, Adam, and SSO evaluated on two Mujoco (Todorov et al., 2012) imitation learning benchmarks (Lavington et al., 2022), Hopper-v2, and Walker-v2. In this setting training and evaluation proceed in rounds. At every round, a behavioral policy samples data from the environment, and an expert labels that data. The goal is guess at the next stage (conditioned on the sampled states) what the expert will label the examples which are gathered. Here, unlike the supervised learning setting, we receive a stream of new data points which can be correlated and drawn from following different distributions through time. Theoretically this makes the optimization problem significantly more difficult, and because we must interact with a simulator, querying the stochastic gradient can be expensive. Like the example in Appendix E, in this setting we will interact under both the experts policy distribution (behavioral cloning), as well as the policy distribution induced by the agent (online imitation). We parameterize a standard normal distribution whose mean is learned through a mean squared error loss between the expert labels and the mean of the agent policy. Again, the expert is trained following soft-actor-critic. All experiments were run using an NVIDIA Ge Force RTX 2070 graphics card. with a AMD Ryzen 9 3900 12-Core Processor. Target-based Surrogates for Stochastic Optimization 0 100k 200k 300k 400k 500k Total-Examples Behavioral Cloning: 0 100k 200k 300k 400k 500k Total-Examples Online Imitation: Adagrad Adam SLS SSO-1 SSO-10 SSO-100 SSO-1000 Figure 18: Comparison of log policy loss incurred by SGD, SLS, Adam, Adagrad, and SSO as a function of the total interactions for the Walker2d gym environment. Unlike Section 5, the mean of the policy is parameterized by a linear model. Unlike Fig. 17 and Fig. 16,this plot displays 500 thousand iterations instead of only 50. We again see that SSO outperforms all other online-optimization algorithms, even for a very large number of iterations. Additionally, as m in increases, we again see the performance of SSO log loss improves as well. In this this setting we evaluate two measures: the per-round log-policy loss, and the policy return. The log policy loss is as described above, while the return is a measure of how well the imitation learning policy actually solves the task. In the Mujoco benchmarks, this reward is defined as a function of how quickly the agent can move in space, as well as the power exerted to move. Imitation learning generally functions by taking a policy which has a high reward (e.g. can move through space with very little effort in terms of torque), and directly imitating it instead of attempting to learn a cost to go function as is done in RL (Sutton et al., 2000). Each algorithm is evaluated over three random seeds following the same initialization scheme. As before, in all settings, algorithms use either their theoretical step-size when available, or the default as defined by (Paszke et al., 2019). The inner-optimization loop is set according to line-search parameters and heuristics following Vaswani et al. (2019b). All algorithms and batch sizes are evaluated for 50 rounds of interaction (accept in the case of Fig. 18, which is evaluated for 500) and performance is represented as a function of total interactions with the environment. Below we learn a policy which is parameterized by a two layer perception with 256 hidden units and relu activations (as was done in the main paper). For further details, please see the attached code repository.