# nonasymptotic_analysis_of_biased_adaptive_stochastic_approximation__8d9f3f70.pdf Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation Sobihan Surendran1,2 , Adeline Fermanian2, Antoine Godichon-Baggioni1, Sylvain Le Corff1 1Sorbonne Université, CNRS, Laboratoire de Probabilités, Statistique et Modélisation, Paris, France 2LOPF, Califrais Machine Learning Lab, Paris, France Stochastic Gradient Descent (SGD) with adaptive steps is widely used to train deep neural networks and generative models. Most theoretical results assume that it is possible to obtain unbiased gradient estimators, which is not the case in several recent deep learning and reinforcement learning applications that use Monte Carlo methods. This paper provides a comprehensive non-asymptotic analysis of SGD with biased gradients and adaptive steps for non-convex smooth functions. Our study incorporates time-dependent bias and emphasizes the importance of controlling the bias of the gradient estimator. In particular, we establish that Adagrad, RMSProp, and AMSGRAD, an exponential moving average variant of Adam, with biased gradients, converge to critical points for smooth non-convex functions at a rate similar to existing results in the literature for the unbiased case. Finally, we provide experimental results using Variational Autoenconders (VAE) and applications to several learning frameworks that illustrate our convergence results and show how the effect of bias can be reduced by appropriate hyperparameter tuning. 1 Introduction Stochastic Gradient Descent (SGD) algorithms are standard methods to train statistical models based on deep architectures. Consider a general optimization problem: θ argmin θ Rd V (θ) , (1) where V is the objective function. Then, Gradient Descent methods produce a sequence of parameter estimates as follows: θ0 Rd and for all n N, θn+1 = θn γn+1 V (θn) , where V denotes the gradient of V and for all n 1, γn > 0 is the learning rate. In many cases, it is not possible to compute the exact gradient of the objective function, hence the introduction of vanilla Stochastic Gradient Descent, defined for all n N by: θn+1 = θn γn+1d V (θn) , where d V (θn) is an estimator of V (θn). For example, in deep learning, stochasticity emerges with the use of mini-batches. While these algorithms have been extensively studied, both theoretically and practically, see, e.g., [10], many questions remain open. In particular, most results are based on the case where the estimator d V is unbiased. Although this assumption is valid in the case of vanilla SGD, it breaks down in many common applications. For example, zeroth-order methods used Corresponding author: sobihan.surendran@sorbonne-universite.fr 38th Conference on Neural Information Processing Systems (Neur IPS 2024). to optimize black-box functions [61] in generative adversarial networks [58, 16] have access only to noisy biased realizations of the objective functions. Furthermore, in reinforcement learning algorithms such as Q-learning [42], policy gradient [5], and temporal difference learning [8, 52, 18], gradient estimators are often obtained using a Markov chain with state-dependent transition probability. These estimators are then biased [69, 23]. Other examples of biased gradients can be found in the field of generative modeling with Markov Chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) [34, 13]. In particular, the Importance Weighted Autoencoder (IWAE) proposed by [12], which is an extension of the standard Variational Autoencoder (VAE) [48], yields biased estimators. Finally, this is also the case in Bilevel Optimization [43, 36, 41] and Conditional Stochastic Optimization [40, 39]. Moreover, in practical applications, vanilla SGD shows difficulties in calibrating the step sequences. Therefore, modern variants of SGD employ adaptive steps that use past stochastic gradients or Hessians to avoid saddle points and deal with ill-conditioned problems. The idea of adaptive steps was first proposed in the online learning literature by [4] and later adopted in stochastic optimization, with the Adagrad algorithm of [27]. In this paper, we give non-asymptotic convergence guarantees for modern variants of SGD where both the estimators are biased and the steps are adaptive. To our knowledge, existing results consider either adaptive steps but unbiased estimators [27, 77, 67, 74, 19], or biased estimators with non-adaptive steps [70, 44, 2, 22, 21]. More precisely, our contributions are summarized as follows. We provide convergence guarantees for the Biased Adaptive Stochastic Approximation framework, under weak assumptions on the bias. To the best of our knowledge, these are the first convergence results to incorporate adaptive steps in biased Stochastic Approximation. In particular, we establish that Adagrad, RMSProp, and AMSGRAD, an exponential moving average variant of Adam, with a biased gradient, converge to a critical point for non-convex smooth functions with a convergence rate of O(log n/ n + bn), where bn is related to the bias at iteration n. However, we achieve an improved linear convergence rate with the Polyak-Łojasiewicz (PL) condition. Finally, we show how our theoretical results apply to several applications with biased gradients. In particular, we show that our hypotheses hold for Stochastic Bilevel Optimization and Conditional Stochastic Optimization, but also for Self-Normalized Importance Sampling estimators or Coordinate Sampling. We also propose a first non-asymptotic bound on the bias of IWAE, which allows us to illustrate through several experiments the effect of bias on the convergence of the optimization, and to show how this effect can be reduced by an appropriate choice of hyperparameters. Organization of the paper. In Section 2, we introduce the setting of the paper and relevant related works. In Section 3, we present the Adaptive Stochastic Approximation framework and the main assumptions. In Section 4, we present our principal results, i.e., convergence rates for the risk when the PL condition is assumed, and on the gradient norm without this hypothesis. We illustrate our results in Section 5. All proofs are postponed to the appendix. 2 Setting and Related Works Stochastic Approximation. Stochastic Approximation (SA) methods go far beyond SGD. They consist of sequential algorithms designed to find the zeros of a function when only noisy observations are available. Indeed, [68] introduced the Stochastic Approximation algorithm as an iterative recursive algorithm to solve the following integration equation: h(θ) = Eπ [Hθ(X)] = Z X Hθ(x)π(x)dx = 0 , (2) where h is the mean field function, X is a random variable taking values in a measurable space (X, X), and Eπ is the expectation under the distribution π. In this context, Hθ can be any arbitrary function. If Hθ(X) is an unbiased estimator of the gradient of the objective function, then h(θ) = V (θ). As a result, the minimization problem (1) is then equivalent to solving problem (2), and we can note that SGD is a specific instance of SA. SA methods are then defined as follows: θn+1 = θn γn+1Hθn (Xn+1) , where the term Hθn (Xn+1) is the n-th stochastic update, also known as the drift term, and is a potentially biased estimator of V (θn). It depends on a random variable Xn+1 which takes its values in (X, X). In machine learning, V typically represents the theoretical risk, θ the model parameters, and Xn+1 the data. Adaptive Stochastic Gradient Descent. SGD can be traced back to [68], and its averaged counterpart was proposed by [66]. The non-asymptotic analysis of SGD in both convex and strong convex cases can be found in [59]. [32] prove the convergence of a random iterate of SGD for nonconvex smooth functions, which was already suggested by the results of [9]. They show that SGD with constant or decreasing stepsize γn = 1/ n converges to a stationary point of a non-convex smooth function V at a rate of O(1/ n) where n is the number of iterations. Most adaptive first-order methods, such as Adam [47], Adadelta [78], RMSProp [72], and NADA [25], are based on the blueprint provided by the Adagrad family of algorithms. The first known work on adaptive steps for non-convex stochastic optimization, in the asymptotic case, was presented by [50]. [74] proved that Adagrad converges to a critical point for non-convex objectives at a rate of O(log n/ n) when using a scalar adaptive step. In addition, [79] extended this proof to multidimensional settings. More recently, [19] focused on the convergence rates for Adagrad and Adam. Furthermore, several modified versions of Adam have been proposed, such as AMSGRAD [77] and YOGI [67]. Biased Stochastic Approximation. The asymptotic results of Biased SA have been studied by [70]. The non-asymptotic analysis can be found in the reinforcement learning literature, especially in the context of temporal difference (TD) learning, as explored by [8, 52, 18]. The case of non-convex smooth functions has been studied by [44]. The authors establish convergence results for the mean field function at a rate of O(log n/ n + b), where b corresponds to the bias and n to the number of iterations. For strongly convex functions, the convergence of SGD with biased gradients can be found in [2], specifically addressing the case of Martingale noise with a constant step size. [46, 21] introduce a novel assumption, known as Expected Smoothness , which is the weakest assumption compared to the existing literature on biased SGD that we extend to cover the adaptive case. The authors provide convergence results in the case of non-convex smooth functions. Convergence results with assumptions on the control of bias and MSE can be found in [56, 22]. Applications of biased gradients can be found in Bilevel Optimization [43, 36, 41] and Conditional Stochastic Optimization [40, 39]. Moreover, biased gradients are also used in various other applications [38, 54, 6, 56]. Finally, [3] studied convergence results of biased gradients with Adagrad in the Markov chain case, focusing on the norm of the gradient of the Moreau envelope while assuming the boundedness of the objective function. Our analysis provides non-asymptotic results in a more general setting, for a wide variety of objective functions and adaptive algorithms and treating both the Martingale and Markov chain cases. 3 Adaptive Stochastic Approximation 3.1 Framework Consider the optimization problem (1) where the objective function V is assumed to be differentiable. In this paper, we focus on the following SA algorithm with adaptive steps: θ0 Rd and for all n N, θn+1 = θn γn+1An Hθn (Xn+1) , (3) where γn+1 > 0 and An is a sequence of symmetric and positive definite matrices. In a context of biased gradient estimates, choosing An = δId + 1 n + 1 k=0 Hθk(Xk+1)Hθk(Xk+1) 1/2 can be assimilated to the full Adagrad algorithm [27]. However, computing the square root of the inverse becomes expensive in high dimensions, so in practice, Adagrad is often used with diagonal matrices. This approach has been shown to be particularly effective in sparse optimization settings. Denoting by Diag(A) the matrix formed with the diagonal terms of A and setting all other terms to 0, Adagrad with diagonal matrices is defined in our context as: An = h δId + Diag Hn(X1:n+1, θ0:n) i 1/2 , (4) Hn(X1:n+1, θ0:n) = 1 n + 1 k=0 Hθk(Xk+1)Hθk(Xk+1) . In RMSProp [72], Hn(X1:n+1, θ0:n) in (4) is an exponential moving average of the past squared gradients, defined by: Hn(X1:n+1, θ0:n) = (1 ρ) k=0 ρn k Hθk(Xk+1)Hθk(Xk+1) , where ρ is the moving average parameter. Furthermore, when An is a recursive estimate of the inverse Hessian, it corresponds to the Stochastic Newton algorithm [11]. 3.2 Assumptions Consider the following assumptions. H1 There exists a constant µ > 0 such that for all θ Rd, 2µ V (θ) V (θ ) V (θ) 2 . H1 corresponds to the Polyak-Łojasiewicz condition, which is weaker than strong convexity and remains satisfied even when the function is non-convex. It ensures uniqueness of the minimizer θ . The PL condition has been extensively studied theoretically [45] and has been verified empirically in many applications, such as over-parameterized deep networks [26] and Linear Quadratic Regulator models [29]. H2 The objective function V is L-smooth. For all (θ, θ ) Rd Rd, V (θ) V (θ ) L θ θ . This assumption is crucial to obtain our convergence rate and is very common see, e.g., [59, 10]. Under this assumption, for all (θ, θ ) Rd Rd, V (θ) V (θ ) + V (θ ) , θ θ + L 2 θ θ 2 . (5) H3 (i) Biased Gradients: There exist two non-increasing positive sequences (λn)n 1 and (rn)n 1 such that for all n N, E V (θn) , An Hθn (Xn+1) λn+1 E V (θn) 2 rn+1 . (ii) Expected Smoothness: there exists a non-increasing non-negative sequence (σ2 n)n 1, and positive constants σ1, σ2 such that for all n N, E Hθn(Xn+1) 2 σ2 n + σ1E V (θn) 2 + σ2E V (θn) V (θ ) . In this assumption, rn+1 represents an additive bias term, generally of the order of the square of the bias, and λn+1 may depend on the minimum eigenvalue of An. In [21, Theorem 2], it has been demonstrated that this assumption is weaker than the alternatives used in the literature on biased SGD. We have extended these assumptions to the adaptive case. It is important to note that the first point of H3 depends on the application (objective function V ) and on the adaptive algorithm (matrix An) that we want to use. The purpose of this assumption is to provide a more general framework that covers all possible applications and adaptive algorithms. In the biased SGD setting, if the bias term E[Hθn(Xn+1) | Fn] V (θn) is bounded by bn+1, we can easily verify the first point of H3 by considering λn+1 = 1/2 and rn+1 = b2 n+1. We show in Section 4.3 that this assumption is also verified in algorithms such as Adagrad and RMSProp. The second point of H3 is a weaker assumption compared to bounding the variance of the noise term. Applications where we can verify these assumptions are discussed in Appendix D. We finally consider an additional assumption on An. Let A be the spectral norm of a matrix A. H4 There exists (βn)n 1 such that for all n N, An := λmax(An) βn+1 . In our setting, since An is assumed to be a symmetric matrix, the spectral norm is equal to the largest eigenvalue. H4 plays a crucial role, as the estimates may diverge when this assumption is not satisfied. Given a sequence (βn)n 1, one way to ensure that H4 is satisfied is to replace the random matrices An with An = min{ An , βn+1} An An . (6) It is then clear that An βn+1. Furthermore, in most cases, especially for Adagrad, RMSProp and Stochastic Newton, control of λmax (An) in H4 is satisfied. For example, in Adagrad and RMSProp, in (4), we have λmax (An) δ 1/2. 4 Convergence Results 4.1 Convergence under the PL condition In this section, we study the convergence rate of SGD with biased gradients and adaptive steps under the PL condition. We give below a simplified version of the bound we obtain on the risk and refer to Theorem A.2 in the appendix for a formal statement with explicit constants. Theorem 4.1. Assume that H1 - H4 hold. Let θn Rd be the n-th iterate of the recursion (3) and γn = Cγn γ, βn = Cβnβ, λn = Cλn λ with Cγ > 0, Cβ > 0, and Cλ > 0. Assume that γ, β, λ 0 and γ + λ < 1. Then, E [V (θn) V (θ )]= O n γ+2β+λ + rn . (7) The rate obtained is classical and shows the tradeoff between a term coming from the adaptive steps (with a dependence on γ, β, λ) and a term rn which depends on the control of the bias. To minimize the right hand-side of (7), we would like to have β = λ = 0. For example, it is verified in the case of Adagrad and RMSProp if the gradients are bounded, as will be discussed later. We stress that Theorem 4.1 applies to any adaptive algorithm of the form (3), with the only assumption being H4. Without any information on these eigenvalues, the choice that βn nβ and λn n λ allows us to remain very general, which can even be seen as a worst-case scenario. Finally, note that non-adaptive SGD is a particular case of Theorem 4.1. Thus, our theorem gives new results also in the non-adaptive case with generic step sizes and biased gradients with decreasing bias. 4.2 Convergence without the PL condition In the non-convex smooth case, theoretical results are generally based on a randomized version of SA, as described in [60, 32, 44]. Instead of considering the final parameter θn, we introduce a random variable R, which takes its values in {1, . . . , n}, and the quantity of interest becomes θR. Note that this procedure is a technical tool, in practical applications we use classical SA. The following theorem provides a bound in expectation on the gradient of the objective function V , which is the best we can have given that no assumption is made about the existence of a global minimum of V . Theorem 4.2. Assume that H2 - H4 hold. Assume also that for all k N, we have γk+1 λk+1/( σ1Lβ2 k+1). For any n 1, let R {0, . . . , n} be a discrete random variable such that: P(R = k) := wk+1γk+1λk+1 Pn j=0 wj+1γj+1λj+1 , where wk+1 = Qk+1 j=1(1 + σ2δj) 1 with δj = Lγ2 j β2 j /2. Then, E h V (θR) 2i 2 V + α1,n + α2,n Pn j=0 wj+1γj+1λj+1 , k=0 wk+1γk+1λk+1rk+1 , α2,n = k=0 wk+1δk+1σ2 k, and V = E[V (θ0) V (θ )] . If σ2 = 0, Theorem 4.2 recovers the asymptotic convergence rate obtained by [44] with respect to the hyperparameters γ, β, and λ, and to the bias. We can observe that if γ λ + 2β, the condition on (γk)k 1 can be met simply by tuning Cγ. In particular, if An = Id, the requirement on the step sizes can be expressed as γk+1 1/( σ1L). We give below the convergence rates obtained from Theorem 4.2 under the same assumptions on γn, βn, and λn as in Theorem 4.1. Corollary 4.3. Assume that H2-H4 hold. Let γn = Cγn γ, βn = Cβnβ, λn = Cλn λ with Cγ > 0, Cβ > 0, and Cλ > 0. Assume that γ, β, λ 0 and γ + λ < 1. Then, if σ2 = 0, we have: E h V (θR) 2 i = O n γ+λ+2β + bn if γ β < 1/2 , O nγ+λ 1 + bn if γ β > 1/2 , O nγ+λ 1log n + bn if γ β = 1/2 , where the bias term bn can be constant or decreasing. In the latter case, writing rn = Crn r, we have: O (n r) if r + λ + γ < 1 , O nγ+λ 1 if r + λ + γ > 1 , O nγ+λ 1 log n if r + λ + γ = 1 . In practice, the value of r is known in advance while the other parameters can be tuned to achieve the optimal rate of convergence. In any scenario, we can never achieve a bound of O(1/ n + bn), and the best rate we can reach is O(log n/ n + bn) when γ = 1/2, β = 0, and λ = 0. In this case, all eigenvalues of An must be bounded from both below and above. Note that we could also have obtained such a rate by taking λn = n 1/2 and βn = n 1/2 while keeping γn constant. However, the assumption that βn = n 1/2 is too strong (fast decay of the eigenvalues of An), hence our choice of βn = Cβnβ. Finally, for a decreasing bias, if r 1/2, the bias term contributes to the convergence rate of the algorithm. Otherwise, the other term is the leading term of the upper bound. In both cases, the best achievable bound is O(log n/ n) if r 1/2. Bounded Gradient Case. Now, we analyze the convergence of Randomized Adaptive Stochastic Approximation when the stochastic updates are bounded, as given by the following assumption. H5 There exists M 0 such that for all n N, Hθn (Xn+1) M. Boundedness of the stochastic gradient of the objective function is a classical assumption in adaptive stochastic optimization [67, 74, 19, 73]. Corollary 4.4. Assume that H2-H5 hold. Let γn = Cγn γ, βn = Cβnβ, λn = Cλn λ with Cγ > 0, Cβ > 0, and Cλ > 0. Assume that γ, β, λ 0 and γ + λ < 1. For any n 1, let R {0, . . . , n} be a uniformly distributed random variable. Then, E h V (θR) 2i V + α 1,n + LM 2α 2,n/2 n , where α 1,n = Pn k=0 γk+1λk+1rk+1, α 2,n = Pn k=0 γ2 k+1β2 k+1, and V = E[V (θ0) V (θ )]. Importantly, in Corollary 4.4, there are no assumptions on the step sizes, and we obtain a better bound than in Theorem 4.2. 4.3 Application to Adagrad and RMSProp We give a convergence analysis of Adagrad and RMSProp with a biased gradient estimator. First, note that, under H5, for all eigenvalues λ of An, the adaptive matrix in Adagrad or RMSProp, it holds that (M 2 + δ) 1/2 λ δ 1/2, i.e., H4 is satisfied with λ = 0 and β = 0. Corollary 4.5. Assume that H2 and H5 hold. Let γn = cγn 1/2 and An denote the adaptive matrix in Adagrad or RMSProp. For any n 1, let R {0, . . . , n} be a uniformly distributed random variable. Suppose that for any n 1, there exist positive constants α and Cα such that: E [Hθn (Xn+1) |Fn] V (θn) Cαn α . (8) E h V (θR) 2i = O log n n + bn where the bias bn is explicitly given in Appendix A.5. In the case of an unbiased gradient, we obtain the same bound of O(log n/ n) as in [74, 79, 19] under the same assumptions. If the bias is of the order O(n 1/4), the algorithm achieves the same convergence rate as in the case of an unbiased gradient. 4.4 AMSGRAD with Biased Gradients Algorithm 1 AMSGRAD with Biased Gradients Input: Initial point θ0, maximum number of iterations n, step sizes {γk}k 1, momentum parameters ρ1, ρ2 [0, 1) and regularization parameter δ 0. Set m0 = 0, V0 = 0 and ˆV0 = 0 for k = 0 to n 1 do Compute the stochastic update Hθk (Xk+1) mk = ρ1mk 1 + (1 ρ1)Hθk(Xk+1) Vk = ρ2Vk 1+(1 ρ2)Hθk(Xk+1)Hθk(Xk+1) ˆVk = max ˆVk 1, Diag(Vk) Ak = δId + ˆVk 1/2 θk+1 = θk γk+1Akmk end for Output: (θk)0 k n Finally, we show the convergence of AMSGRAD [67] with a biased gradient estimator. At each iteration, AMSGRAD uses an exponential moving average of past gradients instead of the current gradient as in Equation (3), which is detailed in Algorithm 1. The key difference between Adam and AMSGRAD lies in their handling of the second moment estimate. Specifically, AMSGRAD uses the updated term ˆVk = max( ˆVk 1, Diag(Vk)) instead of directly using Vk, with the maximum taken coordinate-wise. This approach is crucial, as it ensures that the eigenvalues of An decrease at each iteration. The following theorem provides a bound in expectation on the gradient of the objective function V using randomized iterations with AMSGRAD. Theorem 4.6. Assume that H2, H3 (i), and H5 hold. Let γn = cγn 1/2, An denote the adaptive matrix of AMSGRAD in Algorithm 1, and ρ1, ρ2 [0, 1). For any n 1, let R {0, . . . , n} be a uniformly distributed random variable. Then, E h V (θR) 2i = O log n n + bn where bn corresponds to the bias which comes from rn in H3(i). Choosing rn = Crn r, we get: O (n r) if r < 1/2 , O n 1/2 if r > 1/2 , O n 1/2 log n if r = 1/2 . If the bias is of the order O(n 1/4), we achieve a convergence rate of O(log n/ n), which is the same as that of an unbiased gradient [19] and similar to that of Adagrad and RMSProp. It is worth noting that our results are also applicable to SGD momentum by taking An = Id in Algorithm 1. 4.5 Convergence Results in i.i.d. and Markov Chain cases For illustrative purposes, in this subsection we give the form of the bias of the gradient estimator, denoted by bn, in two simple scenarios, i.e., when {Xn, n N} is either an i.i.d. sequence or a Markov chain. For Adagrad, RMSProp, and AMSGRAD, bounding the bias of the gradient estimator is a sufficient condition for verifying H3(i), which in turn enables us to derive convergence results in each scenario. I.i.d. case. Assume that {Xn, n N} are i.i.d. random variables. If the mean field function h (θn) = E [Hθn (Xn+1) | Fn] aligns with the true gradient, then the estimator is unbiased. Otherwise, the bias of the gradient estimator is bn+1 = h(θn) V (θn) . Markov Chain case. Assume now that {Xn, n N} is a Markov Chain. The bias consists of two parts: the difference between the mean field function and the true gradient, and a term due to the Markov chain dynamics. For all T 0, we define the stochastic update as follows: Hθk (Xk+1) = 1 i=1 Hθk X(i) k+1 , where X(i) k+1 represents the i-th sample generated at iteration k + 1. This multi-sample estimator is commonly used in applications such as Reinforcement Learning, Markov Chain Monte Carlo, and Sequential Monte Carlo methods, effectively reducing the variance of the gradient estimator. The mixing time τmix of a Markov chain with stationary distribution π and transition kernel P is characterized as: τmix := inf t ; sup x DTV(P t(x, ), π) 1 where DTV denotes the total variation distance. For an ergodic Markov chain with stationary distribution π, the bias of this gradient estimator when using T samples per step is bn+1 = h(θn) V (θn) + M p where h(θ) = R Hθ(x)π(dx). If the general optimization problem reduces to the following stochastic optimization problem with Markov noise, as considered in most of the literature [28, 24, 7]: min θ Rd V (θ) := Ex π[f(θ; x)], where θ 7 f(θ; x) is a loss function, and π is some stationary data distribution of the Markov Chain and Hθk(X(i) k+1) = f(θk; X(i) k+1), then bn+1 = M p τmix/T, similar to SGD with Markov Noise [24]. 5 Applications and Experiments 5.1 Bilevel and Conditional Stochastic Optimization We can now apply our theoretical results in various settings where biased gradients are involved. In particular, they apply to the fields of Stochastic Bilevel Optimization and Conditional Stochastic Optimization. Stochastic Bilevel Optimization consists of minimizing an objective function V with respect to θ, where V is itself a function of ϕ (θ) and ϕ (θ) is obtained by solving another minimization problem. Conditional Stochastic Optimization focuses on optimizing the expected value of a function that contains a nested conditional expectation on a random variable η. We provide in Table 1 a summary of the assumptions satisfied in these settings, which allow to apply the results of Section 4 and to obtain a O(log n/ n + bn) convergence rate in both cases, and explicit forms for bn. To our knowledge, these are the first convergence rates obtained in these settings. We refer to Appendix D for other examples in which the bias of the estimator can be controlled, in particular Self-Normalized Importance Sampling (Appendix D.1), Sequential Monte Carlo Methods (Appendix D.2), Policy Gradient (Appendix D.3), Zeroth-Order Gradient (Appendix D.4), and Coordinate Sampling (Appendix D.5). 5.2 Experiments with IWAE and BR-IWAE In this section, we illustrate our theoretical results in the context of deep VAE. The experiments were conducted using Py Torch [65], and the source code can be found here2. In generative models, 2https://github.com/Sobihan Surendran/Adaptive-SA Table 1: Bilevel and Conditional Stochastic Optimization with our Biased Adaptive SA framework. Applications Stochastic Bilevel Optimization Conditional Stochastic Optimization Problem min θ Rd Eξ [f(θ, ϕ (θ); ξ)] min θ Rd Eξ fξ Eη|ξ [gη(θ, ξ)] s.t. ϕ (θ) argmin ϕ Rq Eζ [g(θ, ϕ; ζ)] Lipchitz Constant H2 Lemma C.2 Lemma C.6 Bias Control H3 Lemma C.3 Lemma C.5 Gradient Bound H5 Lemma C.3 Lemma C.6 Convergence Theorem C.4 Theorem C.7 the objective is to maximize the marginal likelihood log pθ(x), which is the marginalization of (x, z) 7 pθ(x, z), where x represents the observations and z is the latent variable. Under some simple technical assumptions, by Fisher s identity, we have: θ log pθ(x) = Z θ log pθ(x, z)pθ(z | x)dz . (9) However, in most cases, the conditional density z 7 pθ(z | x) is intractable and can only be sampled. Variational Autoencoders introduce an additional parameter ϕ and a family of variational distributions z 7 qϕ(z | x) to approximate the true posterior distribution. Parameters are estimated by maximizing the Evidence Lower Bound (ELBO): log pθ(x) Eqϕ( |x) log pθ(x, Z) =: LELBO(θ, ϕ; x) . The Importance Weighted Autoencoder (IWAE) [12] is a variant of the VAE that incorporates importance weighting to obtain a tighter ELBO. The IWAE objective can be written as follows: LIWAE k (θ, ϕ; x) = Eq k ϕ ( |x) pθ(x, Z(ℓ)) qϕ(Z(ℓ) | x) where k corresponds to the number of samples drawn from the variational posterior distribution. The estimator of the gradient of ELBO in IWAE is a biased estimator of θ log pθ(x). In Theorem B.1, we establish that the bias of this estimator is of order O(1/k), thereby allowing us to derive a convergence rate for IWAE. Since bias has an impact on convergence rates, we propose to use one of the bias reduction techniques, the Biased Reduced Importance Weighted Autoencoder (BR-IWAE) [14], which is detailed in Appendix B. Dataset and Model. We conduct our experiments on the CIFAR-10 dataset [51] and use a Convolutional Neural Network (CNN) architecture with the Rectified Linear Unit (Re LU) activation function for both the encoder and the decoder. The latent space dimension is set to 100. We estimate the log-likelihood using VAE, IWAE, and BR-IWAE models, all of which are trained for 100 epochs. 0 20 40 60 80 100 Epochs Test loss (x10e3) VAE (Adagrad) IWAE (Adagrad) BR-IWAE (Adagrad) VAE (RMSProp) IWAE (RMSProp) BR-IWAE (RMSProp) VAE (Adam) IWAE (Adam) BR-IWAE (Adam) Figure 1: Negative Log-Likelihood on the test set for Different Generative Models with Adagrad, RMSProp, and Adam on CIFAR-10. Bold lines represent the mean over 5 independent runs. Training is conducted using Adagrad, RMSProp, and Adam with a decaying learning rate. Although AMSGRAD is analyzed in our theoretical results, we use Adam for the experiments due to its widespread use in practice. Additional details are provided in Appendix E. First, we set k = 5 samples in both IWAE and BR-IWAE. The test losses are presented in Figure 1. We show the negative log-likelihood on the test dataset for VAE, IWAE, and BR-IWAE with Adagrad, RMSProp, and Adam. As expected, we observe that IWAE outperforms VAE, while BR-IWAE outperforms IWAE by reducing bias in all cases. Then, we illustrate empirically the convergence rates obtained in Corollary 4.5 and Theorem 4.6 for IWAE. Since the bias of the estimator of the gradient in IWAE is of the order O(1/k), choosing a bias of order O(n α) is equivalent to using nα samples at iteration n to estimate the gradient. We plot in Figure 2 the gradient squared norm V (θn) 2 and the Negative Log-Likelihood is given in Appendix E.2. Note that all figures are with respect to epochs, whereas here, n represents the number of updates of the gradient. The dashed curves correspond to the expected convergence rate O(n 1/4) for α = 1/8 and O(log n/ n) for α = 1/4 and α = 1/2. 0 20 40 60 80 100 Epochs = 1/8 = 1/4 = 1/2 0 20 40 60 80 100 Epochs = 1/8 (RMSProp) = 1/4 (RMSProp) = 1/2 (RMSProp) = 1/8 (Adam) = 1/4 (Adam) = 1/2 (Adam) Figure 2: Value of V (θn) 2 in IWAE with Adagrad (on the left), RMSProp, and Adam (on the right). Bold lines represent the mean over 5 independent runs. Figures are plotted on a logarithmic scale for better visualization. Both figures have the same scale, so we have not shown the dashed theoretical curves on the right for better clarity. We observe that the algorithms converge at the expected theoretical rates, and even faster. In Appendix E.2, we have included an additional experiment on the Fashion MNIST dataset [76], which shows similar behavior, but the convergence is closer to the expected rates, suggesting that our upper bounds may be tight. We see similar convergence rates for Adagrad, RMSProp, and Adam, although, as expected, Adam performs slightly better. Moreover, it is clear that convergence is faster with a larger α but beyond a certain threshold for α the rate of convergence does not change significantly. Since choosing a larger α induces an additional computational cost, it is crucial to choose an appropriate value that achieves fast convergence without being too computationally expensive. Choosing an optimal number of samples at each iteration remains an open problem depending on the chosen generative model. 6 Discussion This paper provides a non-asymptotic analysis of Biased Adaptive Stochastic Approximation with and without the PL condition in the non-convex smooth setting. We derive a convergence rate of O(log n/ n + bn) for non-convex smooth functions, where bn corresponds to the time-dependent decreasing bias, and an improved linear convergence rate with the Polyak-Łojasiewicz (PL) condition. We also establish that Adagrad, RMSProp, and AMSGRAD with biased gradients converge to critical points for non-convex smooth functions. Our results provide insights on hyper-parameters tuning to achieve fast convergence and reduce computational time. A natural extension of this work is the analysis of the assumptions, the bias and convergence rates for specific deep learning architectures. A theoretical analysis of the Monte Carlo effort required at each iteration to obtain an optimal convergence rate is another interesting perspective. Acknowledgements The Ph.D. of Sobihan Surendran was funded by the Paris Region Ph D Fellowship Program of Région Ile-de-France. We would like to thank SCAI (Sorbonne Center for Artificial Intelligence) for providing the computing clusters. We also express our gratitude to the reviewers for their insightful comments and suggestions, which have helped improve this paper. [1] Sergios Agapiou, Omiros Papaspiliopoulos, Daniel Sanz-Alonso, and Andrew M Stuart. Importance sampling: Intrinsic dimension and computational cost. Statistical Science, pages 405 431, 2017. [2] Ahmad Ajalloeian and Sebastian U Stich. On the Convergence of SGD with Biased Gradients. ar Xiv preprint ar Xiv:2008.00051, 2020. [3] Ahmet Alacaoglu and Hanbaek Lyu. Convergence of first-order methods for constrained nonconvex optimization with dependent data. In International Conference on Machine Learning, pages 458 489. PMLR, 2023. [4] Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48 75, 2002. [5] Jonathan Baxter and Peter L Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001. [6] Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression for distributed learning. Journal of Machine Learning Research, 24(276):1 50, 2023. [7] Aleksandr Beznosikov, Sergey Samsonov, Marina Sheshukova, Alexander Gasnikov, Alexey Naumov, and Eric Moulines. First order methods with markovian noise: from acceleration to variational inequalities. In Advances in Neural Information Processing Systems, volume 36, 2024. [8] Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Conference On Learning Theory, pages 1691 1692. PMLR, 2018. [9] Léon Bottou. Une approche théorique de l apprentissage connexioniste; applications à la reconnaissance de la parole. Ph D thesis, Paris 11, 1991. [10] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. [11] Claire Boyer and Antoine Godichon-Baggioni. On the asymptotic rate of convergence of stochastic newton algorithms and their weighted averaged versions. Computational Optimization and Applications, 84(3):921 972, 2023. [12] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In International Conference on Learning Representations, 2016. [13] Gabriel Cardoso, Yazid Janati El Idrissi, Sylvain Le Corff, Eric Moulines, and Jimmy Olsson. State and parameter learning with Pa RIS particle Gibbs. In International Conference on Machine Learning, pages 3625 3675. PMLR, 2023. [14] Gabriel Cardoso, Sergey Samsonov, Achille Thin, Eric Moulines, and Jimmy Olsson. BRSNIS: bias reduced self-normalized importance sampling. In Advances in Neural Information Processing Systems, volume 35, pages 716 729, 2022. [15] Congliang Chen, Li Shen, Fangyu Zou, and Wei Liu. Towards practical adam: Non-convexity, convergence theory, and mini-batch acceleration. Journal of Machine Learning Research, 23(229):1 47, 2022. [16] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15 26, 2017. [17] Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. In Advances in Neural Information Processing Systems, volume 34, pages 25294 25307, 2021. [18] Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6144 6160, 2018. [19] Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of Adam and Adagrad. ar Xiv preprint ar Xiv:2003.02395, 2020. [20] Pierre Del Moral, Arnaud Doucet, and Sumeetpal S Singh. A backward particle interpretation of Feynman-Kac formulae. ESAIM: Mathematical Modelling and Numerical Analysis, 44(5):947 975, 2010. [21] Yury Demidovich, Grigory Malinovsky, Igor Sokolov, and Peter Richtárik. A guide through the zoo of biased sgd. In Advances in Neural Information Processing Systems, volume 36, 2024. [22] Aymeric Dieuleveut, Gersende Fort, Eric Moulines, and Hoi-To Wai. Stochastic approximation beyond gradient for signal processing and machine learning. IEEE Transactions on Signal Processing, 2023. [23] Thinh T Doan, Lam M Nguyen, Nhan H Pham, and Justin Romberg. Finite-time analysis of stochastic gradient descent under markov randomness. ar Xiv preprint ar Xiv:2003.10973, 2020. [24] Ron Dorfman and Kfir Yehuda Levy. Adapting to mixing time in stochastic optimization with markovian data. In International Conference on Machine Learning, pages 5429 5446. PMLR, 2022. [25] Timothy Dozat. Incorporating Nesterov momentum into Adam. In ICLR Workshop track, 2016. [26] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675 1685. PMLR, 2019. [27] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011. [28] John C Duchi, Alekh Agarwal, Mikael Johansson, and Michael I Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549 1578, 2012. [29] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467 1476. PMLR, 2018. [30] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126 1135. PMLR, 2017. [31] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pages 1568 1577. PMLR, 2018. [32] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. [33] Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018. [34] Pierre Gloaguen, Sylvain Le Corff, and Jimmy Olsson. A pseudo-marginal sequential Monte Carlo online smoothing algorithm. Bernoulli, 28(4):2606 2633, 2022. [35] Antoine Godichon-Baggioni and Pierre Tarrago. Non asymptotic analysis of adaptive stochastic gradient algorithms and applications. ar Xiv preprint ar Xiv:2303.01370, 2023. [36] Riccardo Grazzi, Massimiliano Pontil, and Saverio Salzo. Bilevel optimization with a lowerlevel contraction: Optimal sample complexity without warm-start. Journal of Machine Learning Research, 24(167):1 37, 2023. [37] Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actorcritic. SIAM Journal on Optimization, 33(1):147 180, 2023. [38] Bin Hu, Peter Seiler, and Laurent Lessard. Analysis of biased stochastic gradient descent using sequential semidefinite programs. Mathematical programming, 187:383 408, 2021. [39] Yifan Hu, Xin Chen, and Niao He. On the bias-variance-cost tradeoff of stochastic optimization. In Advances in Neural Information Processing Systems, volume 34, pages 22119 22131, 2021. [40] Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. In Advances in Neural Information Processing Systems, volume 33, pages 2759 2770, 2020. [41] Feihu Huang, Junyi Li, and Shangqian Gao. Biadam: Fast adaptive bilevel optimization methods. ar Xiv preprint ar Xiv:2106.11396, 2021. [42] Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, volume 6, 1993. [43] Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, pages 4882 4892. PMLR, 2021. [44] Belhal Karimi, Blazej Miasojedow, Eric Moulines, and Hoi-To Wai. Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory, pages 1944 1974. PMLR, 2019. [45] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximalgradient methods under the polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pages 795 811. Springer, 2016. [46] Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world. ar Xiv preprint ar Xiv:2002.03329, 2020. [47] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [48] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014. [49] Vijaymohan R Konda and Vivek S Borkar. Actor-critic type learning algorithms for markov decision processes. SIAM Journal on control and Optimization, 38(1):94 123, 1999. [50] Milena Kresoja, Zorana Lužanin, and Irena Stojkovska. Adaptive stochastic approximation algorithm. Numerical Algorithms, 76(4):917 937, 2017. [51] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical Report, 2009. [52] Chandrashekar Lakshminarayanan and Csaba Szepesvari. Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics, pages 1347 1355. PMLR, 2018. [53] Rémi Leluc and François Portier. Sgd with coordinate sampling: Theory and practice. The Journal of Machine Learning Research, 23(1):15470 15516, 2022. [54] Qiang Li and Hoi-To Wai. State dependent performative prediction with stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pages 3164 3186. PMLR, 2022. [55] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019. [56] Yin Liu and Sam Davanloo Tajbakhsh. Adaptive stochastic optimization algorithms for problems with biased oracles. ar Xiv preprint ar Xiv:2306.07810, 2023. [57] Don Mc Leish. A general method for debiasing a Monte Carlo estimator. Monte Carlo methods and applications, 17(4):301 315, 2011. [58] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1765 1773, 2017. [59] Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, volume 24, 2011. [60] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009. [61] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527 566, 2017. [62] Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. In International Conference on Learning Representations, 2018. [63] Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, and Hoyt Koepke. Coordinate descent converges faster with the gauss-southwell rule than random selection. In International Conference on Machine Learning, pages 1632 1641. PMLR, 2015. [64] Jimmy Olsson and Johan Westerborn. Efficient particle-based online smoothing in general hidden Markov models: the Pa RIS algorithm. Bernoulli, 23(3):1951 1996, 2017. [65] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in Py Torch. In NIPS Autodiff Workshop, 2017. [66] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992. [67] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. [68] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407, 1951. [69] Tao Sun, Yuejiao Sun, and Wotao Yin. On Markov chain gradient descent. In Advances in Neural Information Processing Systems, volume 31, 2018. [70] Vladislav B Tadi c and Arnaud Doucet. Asymptotic bias of stochastic gradient search. In 2011 50th IEEE Conference on Decision and Control and European Control Conference, pages 722 727. IEEE, 2011. [71] Yee Teh, David Newman, and Max Welling. A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, volume 19, 2006. [72] Tijmen Tieleman, Geoffrey Hinton, et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2):26 31, 2012. [73] Qianqian Tong, Guannan Liang, and Jinbo Bi. Calibrating the adaptive learning rate to improve convergence of adam. Neurocomputing, 481:333 356, 2022. [74] Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. The Journal of Machine Learning Research, 21(1):9047 9076, 2020. [75] Xidong Wu, Jianhui Sun, Zhengmian Hu, Junyi Li, Aidong Zhang, and Heng Huang. Federated conditional stochastic optimization. In Advances in Neural Information Processing Systems, volume 36, 2024. [76] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. [77] Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. In Advances in neural information processing systems, volume 31, 2018. [78] Matthew D Zeiler. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012. [79] Fangyu Zou, Li Shen, Zequn Jie, Ju Sun, and Wei Liu. Weighted adagrad with unified momentum. ar Xiv preprint ar Xiv:1808.03408, 2018. Supplementary Material for Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation Table of Contents A Convergence Proofs 17 A.1 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.3 Proof of Corollary 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.4 Proof of Corollary 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.5 Proof of Corollary 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.6 Proof of Theorem 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A.7 The Impact of regularization parameter δ in Adam . . . . . . . . . . . . . . . . 25 B IWAE / BR-IWAE 26 B.1 Importance Weighted Autoencoder (IWAE) . . . . . . . . . . . . . . . . . . . . 26 B.2 BR-IWAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B.3 Some Other Techniques for Reducing Bias . . . . . . . . . . . . . . . . . . . . 29 C Application of Our Theorem to Bilevel and Conditional Stochastic Optimization 30 C.1 Stochastic Bilevel Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 30 C.2 Conditional Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . 32 D Some Other Examples of Biased Gradients with Control on Bias 34 D.1 Self-Normalized Importance Sampling . . . . . . . . . . . . . . . . . . . . . . 34 D.2 Sequential Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . 34 D.3 Policy Gradient for Average Reward over Infinite Horizon . . . . . . . . . . . . 35 D.4 Zeroth-Order Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 D.5 Compressed Stochastic Approximation: Coordinate Sampling . . . . . . . . . . 36 E Experiment details and supplementary results 37 E.1 Experiment with a Synthetic Time-Dependent Bias . . . . . . . . . . . . . . . . 37 E.2 Additional Experiments of IWAE . . . . . . . . . . . . . . . . . . . . . . . . . 37 A Convergence Proofs A.1 Proof of Theorem 4.1 We first establish a technical lemma which is essential for the proof. Lemma A.1. Let (δn)n 0 , (γn)n 1 , (ηn)n 1, and (vn)n 1 be some positive sequences satisfying the following assumptions. The sequence δn follows the recursive relation: δn (1 2ωγn + ηnγn) δn 1 + vnγn , with δ0 0 and ω > 0. Let n0 = inf {n 1 : ηn ω}, then for all n n0 + 1, we assume that ωγn 1. Then, for all n N, δ0 + 2 max 1 k n vk ηk ω max n/2 k n vk. The proof is given in [35, Proposition 6.1] Theorem A.2. Assume that H1 - H4 hold. Let θn Rd be the n-th iterate of the recursion (3). Then, E [V (θn) V (θ )] E [V (θ0) V (θ )] + 2 σ max 1 k n λk+1vk β2 k+1γk+1 k=n/2 λk+1γk+1 k=1 Ckβ2 k+1γ2 k+1 µ max n/2 k n vk , 2 + σ1L2, Ck = max σ, µ2λ2 k+1 4β2 k+1 and vk = rk+1 + Lσ2 k 2 β2 k+1 λk+1 γk+1 . with the convention Ck = 1 if σ1 = σ2 = 0. Proof. Since V is L-smooth (Assumption H2) and using the recursion (3) of Adaptive SA, we obtain: V (θn+1) V (θn) + V (θn) , θn+1 θn + L 2 θn+1 θn 2 V (θn) γn+1 V (θn) , An Hθn (Xn+1) + Lγ2 n+1 2 An 2 Hθn (Xn+1) 2 . Writing Vn = V (θn) V (θ ), we get Vn+1 Vn γn+1 V (θn) , An Hθn (Xn+1) + L 2 γ2 n+1β2 n+1 Hθn (Xn+1) 2 . Then, using H3, E [Vn+1] E [Vn] γn+1E V (θn) , An Hθn (Xn+1) + L 2 γ2 n+1β2 n+1σ2 n 2 γ2 n+1β2 n+1 σ1E[ V (θn) 2] + σ2E[Vn] 2 β2 n+1γ2 n+1 2 γn+1β2 n+1 E[ V (θn) 2] + γn+1λn+1rn+1 + Lσ2 n 2 γ2 n+1β2 n+1 . Given the smoothness condition (5), with θ = θn and θ = θn 1 L V (θn), we derive: V (θ ) V (θn) 1 L V (θn) 2 + 1 2L V (θn) 2 , V (θn) 2 2LVn . Using the above inequality and the Polyak-Łojasiewicz condition (H1), we obtain: E [Vn+1] 1 µλn+1γn+1 + σ2L 2 + σ1L2 β2 n+1γ2 n+1 E [Vn] + γn+1λn+1rn+1 + Lσ2 n 2 γ2 n+1β2 n+1 . By choosing γn+1 = λn+1γn+1, we get: E [Vn+1] 1 µ γn+1 + σ2L 2 + σ1L2 β2 n+1 λ2 n+1 γ2 n+1 E [Vn] + γn+1rn+1 + Lσ2 n 2 β2 n+1 λn+1 γn+1γn+1 . In order to satisfy the assumptions of Lemma A.1, consider Cn = max σ, µ2λ2 n+1 4β2 n+1 with σ = σ2L Then, since Cn σ, we have: E [Vn+1] 1 µ γn+1 + Cnβ2 n+1 λ2 n+1 γ2 n+1 E [Vn] + γn+1rn+1 + Lσ2 n 2 β2 n+1 λn+1 γn+1γn+1 . Now, using lemma A.1 by choosing: δn = E [Vn] , ηn = Cnβ2 n+1 λ2 n+1 γn+1 , ω = µ 2 , vn = rn+1 + Lσ2 n 2 β2 n+1 λn+1 γn+1 , E [V (θn) V (θ )] E [V (θ0) V (θ )] + 2 σ max 1 k n vkλ2 k+1 β2 k+1 γk+1 k=1 Ckβ2 k+1 γ2 k+1/λ2 k+1 + 2 µ max n/2 k n {vk} , which concludes the proof by taking γn+1 = λn+1γn+1. A.2 Proof of Theorem 4.2 By H2, using (5), we obtain: V (θk+1) V (θk) + V (θk) , θk+1 θk + L 2 θk+1 θk 2 , which, using the recursion (3) of Adaptive SA and H4, yields: V (θk+1) V (θk) γk+1 V (θk) , Ak Hθk (Xk+1) + δk+1 Hθk (Xk+1) 2 , with δk+1 = Lγ2 k+1β2 k+1/2. Using Assumption H3, we have: E[V (θk+1)] E[V (θk)] γk+1λk+1E V (θk) 2 + γk+1λk+1rk+1 + δk+1σ2 k + δk+1 σ1E[ V (θk) 2] + σ2E [V (θk) V (θ )] . γk+1 λk+1 L σ1 2 γk+1β2 k+1 E V (θk) 2 (1 + σ2δk+1) E [V (θk)] V (θ ) E [V (θk+1)] V (θ ) + γk+1λk+1rk+1 + δk+1σ2 k . Let us now consider the sequence of weights wk defined by w0 = 1 and wk = Qk j=1 (1 + σ2δj) 1. Then, wk+1γk+1 λk+1 L σ1 2 γk+1β2 k+1 E V (θk) 2 wk E[V (θk)] V (θ ) wk+1 E [V (θk+1)] V (θ ) + wk+1γk+1λk+1rk+1 + wk+1δk+1σ2 k. In the sequel, let us denote Vn = V (θn) V (θ ), so that k=0 wk+1γk+1λk+1 2λk+1 γk+1β2 k+1 E h V (θk) 2i w0E [V0] wn+1E [Vn+1] + 1 k=0 wk+1γk+1λk+1rk+1 + k=0 wk+1δk+1σ2 k. Then, given that γk+1 λk+1/(L σ1β2 k+1), we have k=0 wk+1γk+1λk+1 V (θk) 2 # w0E [V0] wn+1E [Vn+1] k=0 wk+1γk+1λk+1rk+1 + k=0 wk+1δk+1σ2 k . Consequently, by definition of the discrete random variable R, E h V (θR) 2i = wk+1γk+1λk+1 Pn j=0 wj+1γj+1λj+1 E h V (θk) 2i 2E [V0] wn+1E [Vn+1] + Pn k=0 wk+1γk+1rk+1 + Pn k=0 wk+1δk+1σ2 k Pn j=0 wj+1γj+1λj+1 , which concludes the proof by noting that V (θn+1) V (θ ). A.3 Proof of Corollary 4.3 The proof is a direct consequence of the fact that for a sufficiently large n: O n s+1 if 0 s < 1 , O (1) if s > 1 , O (log n) if s = 1 . A.4 Proof of Corollary 4.4 By H2, using (5), we obtain: V (θk+1) V (θk) + V (θk) , θk+1 θk + L 2 θk+1 θk 2 V (θk) γk+1 V (θk) , Ak Hθk (Xk+1) + Lγ2 k+1 2 Ak 2 Hθk (Xk+1) 2 , which, using H4 and H5 yields: V (θk+1) V (θk) γk+1 V (θk) , Ak Hθk (Xk+1) + L 2 γ2 k+1β2 k+1M 2. E[V (θk+1)|Fk] V (θk) γk+1λn+1 V (θk) 2 + γk+1λk+1rk+1 + LM 2 2 γ2 k+1β2 k+1 . γk+1λk+1 V (θk) 2 V (θk) E [V (θk+1) |Fk] + γk+1λk+1rk+1 + LM 2 2 γ2 k+1β2 k+1 , k=0 γk+1λk+1E h V (θk) 2i E [V (θ0) V (θn+1)] + k=0 γk+1λk+1rk+1 k=0 γ2 k+1β2 k+1 . Consequently, by definition of the discrete random variable R, E h V (θR) 2i = 1 k=0 E h V (θk) 2i γk+1λk+1 n E h V (θk) 2i V0,n + Pn k=0 γk+1λk+1rk+1 + LM 2 Pn k=0 γ2 k+1β2 k+1/2 n , where V0,n = E[V (θ0) V (θn+1)], which conclude the proof by noting that V (θn+1) V (θ ). A.5 Proof of Corollary 4.5 Here, we consider the case where the regularization is non-increasing, i.e., where δ = β 2 n+1. The constant case is strictly analogous. To verify H4, we demonstrate that the control of the maximum and minimum eigenvalues is satisfied for Adagrad and RMSProp. Lower bound for the smallest eigenvalue of An. By assumption H5, we have: 1 n + 1 k=0 Hθk(Xk+1)Hθk(Xk+1) M 2 . This implies that: λmin(An) = λmax β 2 n+1Id + Diag k=0 Hθk(Xk+1)Hθk(Xk+1) !! 1/2 (β 2 1 + M 2) 1/2. Upper bound for the largest eigenvalue of An. λmax(An) = λmin β 2 n+1Id + Diag k=0 Hθk(Xk+1)Hθk(Xk+1) !! 1/2 Therefore, by setting λn+1 = (β 2 1 + M 2) 1/2 and βn = Cβnβ, we have λ = 0 and one can arbitrarily choose β (one can take β = 0 for the constant regularization case). Lower bound for the smallest eigenvalue of An. By assumption H5, we have: k=1 ρn k Hθk(Xk+1) 2 M 2(1 ρ) k=1 ρn k M 2 , where we used the fact that Pn k=1 ρn k (1 ρ) 1. This implies that: λmin(An) = λmax β 2 n+1Id + Diag (Vn) 1/2 (β 2 1 + M 2) 1/2 . Upper bound for the largest eigenvalue of An. Note that λmax(An) = λmin β 2 n+1Id + Diag (Vn) 1/2 βn+1 . Therefore, under H2, H3(i), and H5, we can conclude that E h V (θR) 2i = O log n n + bn where bn corresponds to the bias which comes from rn in H3(i). Choosing rn = Crn r, we get: O (n r) if r < 1/2 , O n 1/2 if r > 1/2 , O n 1/2 log n if r = 1/2 . Now, we show that under the control of bias, i.e., E [Hθn (Xn+1) |Fn] V (θn) Cαn α , we can verify H3(i) with a similar bound on the bias, where r = 2α. This yields the bias term bn as follows: O n 2α if α < 1/4 , O n 1/2 if α > 1/4 , O n 1/2 log n if α = 1/4 . Verifying Assumption H3 (i) for Adagrad. Using the tower property, we have: E [ V (θn) , An Hθn (Xn+1) ] = E [E [ V (θn) , An Hθn (Xn+1) |Fn]] , where (Fn)n 0 represents the filtration generated by the random variables (θ0, {Xk}k n). Let An be an adaptive Fn-measurable matrix. Then, E [ V (θn) , An Hθn (Xn+1) |Fn] = D V (θn) , An E [Hθn (Xn+1) |Fn] E | {z } Treated as in SGD but with λmin( An) + E h D V (θn) , (An An)Hθn (Xn+1) E |Fn i | {z } Control error between An and An We only verify Assumption H3(i) for Adagrad algorithm since it is analogous to RMSProp. Consider An given by: β 2 n+1Id + 1 n + 1 k=0 Hθk (Xk+1) Hθk (Xk+1) !! 1/2 First, writing β 2 n+1Id + 1 n + 1 k=0 Hθk (Xk+1) Hθk (Xk+1) !! 1/2 and denoting by A[i] the i-th element of the diagonal of a matrix A, we have An[i] An[i] = u 1/2 n v1/2 n u1/2 n v 1/2 n 0 , un = β 2 n+1 + 1 n + 1 k=0 (Hθk (Xk+1) [i])2 and vn = β 2 n+1 + 1 n + 1 k=0 (Hθk (Xk+1) [i])2 . Then, since un vn, An[i] An[i] = vn un unvn un + vn 1 n + 1 (Hθn (Xn+1) [i])2 1 2vn3/2 β3 n+1 n + 1 (Hθn (Xn+1) [i])2 . Since the bias of Hθn (Xn+1) is bounded by bn := Cαn α, E [ V (θn) , An Hθn (Xn+1) |Fn] = D V (θn) , An E [Hθn (Xn+1) |Fn] E + E h D V (θn) , (An An)Hθn (Xn+1) E |Fn i λmin An V (θn) 2 λmax An V (θn) bn V (θn) β3 n+1 n + 1E h Hθn (Xn+1) 3 |Fn i . As Hθn (Xn+1) and the gradient of V are uniformly bounded by M, λmin( An) (β 2 1 + M 2) 1/2, so that E [ V (θn) , An Hθn (Xn+1) |Fn] 1 q β 2 1 + M 2 V (θn) 2 βn+1M bn M 4 β3 n+1 n + 1 , and Assumption H3(i) is satisfied with λn+1 = (β 2 1 + M 2) 1/2 and rn+1 = Mβ2 n+1 b2 n/λn+1 + M 4β3 n+1/(n + 1). A.6 Proof of Theorem 4.6 The proof of this theorem is inspired by [67] and [73], considering biased gradient estimators and decreasing step sizes. We define the operation max(D1, D2) for diagonal matrices D1 and D2 as the matrix formed by taking the maximum between the diagonal elements of D1 and D2. We say that the sequence (An)n 1 of diagonal matrices is decreasing if all diagonal terms are decreasing, in other words, if all eigenvalues are decreasing. Let θk+1 = θk+1 + κ (θk+1 θk), for k 1, κ [0, 1) and mk = ρ1mk 1 + (1 ρ1)gk with gk = Hθk(Xk+1). Using the recursion of AMSGRAD, we have: θk+1 θk = (1 + κ)θk+1 (1 + 2κ)θk + κθk 1 = (1 + κ) (θk+1 θk) κ (θk θk 1) = (1 + κ)γk+1Akmk + κγk Ak 1mk 1 . Choosing κ = ρ1/(1 ρ1), we can rewrite it as: θk+1 θk = κ (γk Ak 1 γk+1Ak) mk 1 γk+1Akgk . By Assumption H2, V is L-smooth, using the recursion of AMSGRAD together with a Taylor expansion with θk, we obtain: V ( θk+1) V θk + D V θk , θk+1 θk E + L V θk γk+1 D V θk , Akgk E + κ D V θk , (γk Ak 1 γk+1Ak) mk 1 E + Lγ2 k+1 Akgk 2 + Lκ2 (γk Ak 1 γk+1Ak) mk 1 2 V θk + T1,k + T2,k + T3,k + T4,k , where T1,k = γk+1 V (θk) , Akgk + Lγ2 k+1 Akgk 2 , T2,k = γk+1 D V θk V (θk) , Akgk E , T3,k = κ D V θk , (γk Ak 1 γk+1Ak) mk 1 E , T4,k = Lκ2 (γk Ak 1 γk+1Ak) mk 1 2 . Note first that n X k=1 E [T1,k] = k=1 γk+1E [ V (θk) , Akgk ] + L k=1 γ2 k+1E h Akgk 2i k=1 γk+1E h V (θk) 2i + Cλ k=1 γk+1rk+1 + L k=1 γ2 k+1E h Akgk 2i , where Cλ = (δ + M 2) 1/2. For the second term, using the inequality xy x2/2 + y2/2 for all x, y, and the smoothness of V , we get: n X k=1 E [T2,k] = k=1 E h D V θk V (θk) , γk+1Akgk Ei k=1 E V θk V (θk) 2 + 1 k=1 E h γk+1Akgk 2i k=1 E θk θk 2 + 2 E h Akgk 2i k=1 E h θk θk 1 2i + 2 E h Akgk 2i k=1 γ2 k E h Ak 1mk 1 2i + 2 E h Akgk 2i . For the third term, using the boundedness of the gradient of V and the fact that mk M by Lemma A.3, we have: n X k=1 E [T3,k] = κ k=1 E h D V θk , (γk Ak 1 γk+1Ak) mk 1 Ei k=1 E [γk Ak 1[i] γk+1Ak[i]] i=1 E [γ1A0[i] γn+1An[i]] κM 2d Cγ , where in the second inequality, we used the fact that γk and Ak are decreasing since we use ˆVk = max( ˆVk 1, Vk). For the last term, using the boundedness of the gradient of V yields: n X k=1 E [T4,k] = Lκ2 n X k=1 E h (γk Ak 1 γk+1Ak) mk 1 2i k=1 E h (γk Ak 1[i] γk+1Ak[i])2i k=1 E h (γk Ak 1[i])2 (γk+1Ak[i])2i Lκ2M 2d C2 γ , where we used the inequality (x y)2 x2 y2 when x y in the second last inequality. Combining all these terms, we finally obtain: k=1 γk+1E h V (θk) 2i k=1 γk+1rk+1 + L k=1 γ2 k+1E h Akgk 2i + 2 E h Akgk 2i k=1 γ2 k E h Ak 1mk 1 2i + κM 2d Cγ + Lκ2M 2d C2 γ , where V = E[V (θ0) V (θ )] E[V (θ0) V ( θn+1)]. Choosing γn = n 1/2 and using Lemma A.3 and [15, Lemma 24] yields k=1 γ2 k+1E h Akmk 2i (1 ρ1) k=1 γ2 k+1E h Akgk 2i (1 ρ1)d C2 γ log 1 + n M 2 = O (d log n) . Therefore, by dividing both sides by Cλn 1/2, we obtain k=1 E h V (θk) 2i = O 1 n + d log n n + d n + bn which concludes the proof. Lemma A.3. Let γk+1 γk for all k N, and let Ak be the adaptive matrix defined in Algorithm 1. Assume that ρ1 [0, 1). Then, for all k N: k=1 γ2 k+1E h Akmk 2i (1 ρ1) k=1 γ2 k+1 Akgk 2 . Proof. For the first inequality, we have: ℓ=1 ρk ℓ 1 gℓ ℓ=1 ρk ℓ 1 gℓ M(1 ρ1) X ℓ 0 ρℓ 1 M , where we used the fact that P ℓ 0 ρℓ 1 = 1/(1 ρ1). For the second inequality, using the fact that γk and Ak are decreasing (in the sense that all eigenvalues of Ak are decreasing), since we use ˆVk = max( ˆVk 1, Vk), we can write: k=1 γ2 k+1 Akmk 2 = ℓ=1 ρk ℓ 1 gℓ (1 ρ1)2 n X ℓ=1 ρk ℓ 1 Aℓgℓ 2 (1 ρ1)2 n X ℓ=1 ρk ℓ 1 γ2 ℓ+1 Aℓgℓ 2 (1 ρ1)2 n X k=ℓ ρk ℓ 1 γ2 ℓ+1 Aℓgℓ 2 , which concludes the proof. A.7 The Impact of regularization parameter δ in Adam In our case, we have a dependence on δ in the logarithm, which is common for adaptive algorithms. The regularization parameter δ, originally introduced to avoid the zero denominator issue when Vk approaches 0, is often overlooked. However, it has been empirically observed that the performance of adaptive methods can be sensitive to the choice of this parameter, especially when a very small δ is used, which has resulted in performance issues in some applications. In practice, δ is typically chosen as 10 8. In our convergence rate analysis, even though the logarithm of δ 1 is small, it still impacts the convergence rate. A larger δ will lead to a better convergence rate, while a smaller δ will preserve stronger adaptivity. We need to find a better compromise between the convergence rate and the adaptivity to choose δ. In [77, 67, 73], it was shown that by choosing δ between 10 3 and 10 1, better results were obtained in some applications of deep learning. Furthermore, several modified versions of Adam have been proposed, such as AMSGRAD [77] and YOGI [67] with the discussion of the regularization parameter δ. The authors of [73] proposed a new modified version of Adam called SADAM to represent the calibrated ADAM using the softplus function. In this algorithm, they define ˆVk = softplus Vk while other terms remain unchanged. Since we have: ˆVk = softplus p b log 1 + eb Vk 1 b log eb Vk = p where b is the parameter to control for achieving a better convergence rate. In this case, we have λmax(Ak) b/ log 2, which is similar to δ 1/2 in Adagrad and Adam. Additionally, they demonstrate that b 50 appears to be a good choice based on the empirical observations. B IWAE / BR-IWAE B.1 Importance Weighted Autoencoder (IWAE) In this section, we elaborate on the IWAE procedure within our framework to illustrate its convergence rate. The IWAE objective function is defined as: LIWAE k (θ, ϕ; x) = Eq k ϕ ( |x) pθ(x, Z(ℓ)) qϕ(Z(ℓ) | x) where k corresponds to the number of samples drawn from the encoder s approximate posterior distribution. Denoting V as the objective function, i.e., V (θ) = log pθ(x), the gradient of V and the estimator of the gradient of the ELBO of the IWAE objective are given by: θV (θ) = θ log pθ(x) = Epθ( |x) [ θ log pθ(x, z)] , b θLIWAE k (θ, ϕ; x) = w(ℓ) Pk ℓ=1 w(ℓ) θ log pθ(x, z(ℓ)) , (10) where w(ℓ) = pθ(x, z(ℓ))/qϕ(z(ℓ)|x) the unnormalized importance weights. Theorem B.1 provides an upper bound for the bias of this estimator. Theorem B.1. Let X Rdx and Z Rdz denote the data space and the latent space, respectively. Assume that there exists M such that for all θ Θ Rd, x X and z Z, θ log pθ(x, z) M(x). Then, there exists a constant C > 0 such that for all θ Θ, ϕ Φ and x X, Eq k ϕ ( |x) h b θLIWAE k (θ, ϕ; x) θV (θ) i C where θV (θ) and b θLIWAE k (θ, ϕ; x) are defined in (10). Proof. The proof is adapted from [1, Theorem 2.1]. By definition, b θLIWAE k (θ, ϕ; x) θV (θ) = Pk ℓ=1 w(ℓ) θ log pθ(x, z(ℓ)) Epθ( |x) [ θ log pθ(x, z)] Pk ℓ=1 w(ℓ) . Writing H(x, z(ℓ)) = θ log pθ(x, z(ℓ)) Epθ( |x) [ θ log pθ(x, z)], yields ˆ θLIWAE k (θ, ϕ; x) θV (θ) = Pk ℓ=1 w(ℓ) H(x, z(ℓ)) Pk ℓ=1 w(ℓ) . Since Eqϕ[w H(x, z)] = 0, we have: b θLIWAE k (θ, ϕ; x) θV (θ) = 1 k Pk ℓ=1 w(ℓ) H(x, z(ℓ)) Eqϕ h w H(x, z) i 1 k Pk ℓ=1 w(ℓ) . As Pk ℓ=1 w(ℓ) H(x, z(ℓ))/k is an unbiased estimator of Eqϕ h w H(x, z) i , Eqϕ h ˆ θLIWAE k (θ, ϕ; x) θV (θ) i 1 k Pk ℓ=1 w(ℓ) 1 Eqϕ [w] ℓ=1 w(ℓ) H(x, z(ℓ)) Eqϕ h w H(x, z) i!# Eqϕ h ˆ θLIWAE k (θ, ϕ; x) θV (θ) i 1 k Pk ℓ=1 w(ℓ) H(x, z(ℓ)) Eqϕ h w H(x, z) i Eqϕ [w] 1 k Pk ℓ=1 w(ℓ) k Pk ℓ=1 w(ℓ) Therefore, Eqϕ h ˆ θLIWAE k (θ, ϕ; x) θV (θ) i A1 + A2, A1 = Eqϕ h ˆ θLIWAE k (θ, ϕ; x) θV (θ) 1{ 2 k Pk ℓ=1 w(ℓ)>Eqϕ[w]} A2 = Eqϕ h b θLIWAE k (θ, ϕ; x) θV (θ) 1{ 2 k Pk ℓ=1 w(ℓ) Eqϕ[w]} ℓ=1 w(ℓ) H(x, z(ℓ)) Eqϕ h w H(x, z) i! ℓ=1 w(ℓ) !# Eqϕ [w]2 Eqϕ ℓ=1 w(ℓ) H(x, z(ℓ)) Eqϕ h w H(x, z) i ℓ=1 w(ℓ) Eqϕ [w] Eqϕ [w]2 Eqϕ ℓ=1 w(ℓ) H(x, z(ℓ)) Eqϕ h w H(x, z) i ℓ=1 w(ℓ) Eqϕ [w] where we used Cauchy-Schwarz inequality in the last inequality. On the other hand, ℓ=1 w(ℓ) Eqϕ [w] ℓ=1 w(ℓ) H(x, z(ℓ)) Eqϕ h w H(x, z) i ℓ=1 w(ℓ) H(x, z(ℓ)) 4d M 2 Eqϕ w2 Finally, we deduce that k Eqϕ w2 1/2 2 k Eqϕ w2 1/2 = Eqϕ w2 Using the assumption on the boundedness of θ log pθ(x, z) and the Markov inequality, we obtain: ℓ=1 w(ℓ) Eqϕ [w] ℓ=1 w(ℓ) Eqϕ [w] ℓ=1 w(ℓ) Eqϕ [w] Eqϕ [w]2 8M which concludes the proof. Algorithm 2 Adaptive Stochastic Approximation for IWAE Input: Initial point θ0, maximum number of iterations n, step sizes {γk}k 1 and a hyperparameter α 0 to control the bias and MSE. for k = 0 to n 1 do Compute the stochastic update θ,ϕLIWAE kα (θk, ϕk; Xk+1) using kα samples from the variational posterior distribution and adaptive steps Ak. Set θk+1 = θk γk+1Ak θLIWAE kα (θk, ϕk; Xk+1). Set ϕk+1 = ϕk γk+1Ak ϕLIW AE kα (θk, ϕk; Xk+1). end for Output: (θk)0 k n B.2 BR-IWAE In this section, we provide additional details on the Biased Reduced Importance Weighted Autoencoder (BR-IWAE). In IWAE, instead of estimating the gradient of the ELBO with respect to θ via the Monte Carlo method, we estimate the gradient of the true objective function Epθ( |x) [ θ log pθ(x, z)] using the BR-SNIS estimator [14]. This estimator aims to reduce the bias of self-normalized importance sampling estimators without increasing the variance. Algorithm 3 BR-IWAE Gradient Estimator Input: Maximum number of iterations tmax for MCMC and number of samples k from the variational distribution qϕ( | x). Initialization: Draw z0 from the variational distribution qϕ( | x). for t = 0 to tmax 1 do Draw It+1 {1, . . . , k} uniformly at random and set z It+1 t+1 = zt. Draw z1:k\{It+1} t+1 independently from the variational distribution qϕ( | x). Compute the unnormalized importance weights: w(ℓ) t+1 = pθ(x, z(ℓ) t+1) qϕ(z(ℓ) t+1 | x) ℓ {1, . . . , k}. Normalize importance weights: ω(ℓ) t+1 = w(ℓ) t+1 PN ℓ=1 w(ℓ) t+1 ℓ {1, . . . , k}. Select zt+1 from the set z1:k t+1 by choosing zℓ t+1 with probability ω(ℓ) t+1. end for Output: z1:k t 1 t tmax and ω1:k t The BR-SNIS estimator of Epθ( |x) [ θ log pθ(x, z)] is given by: b θ log pθ(x, z1:k t0:tmax) = 1 tmax t0 ℓ=1 ω(ℓ) t θ log pθ(x, zℓ t) , where t0 corresponds to a burn-in period. By [14, Theorem 4] the bias of this estimator decreases exponentially with t0. The BR-IWAE algorithm proceeds in two steps, which are repeated during optimization: Update the parameter ϕ as in the IWAE algorithm, that is, for all n 1: ϕn+1 = ϕn γn+1An ϕLIWAE k (θn, ϕn; Xn+1) . Update the parameter θ by estimating (9) using BR-SNIS as detailed in Algorithm 3: ϕn+1 = ϕn γn+1An b θ log pθ(Xn+1, z1:k t0:tmax) . B.3 Some Other Techniques for Reducing Bias In the previous section, we discussed one technique for reducing bias, BR-IWAE. Here, we provide an overview of some other bias reduction techniques within our context. First, the jackknife biascorrected estimator [62] is defined as: LJackknife(θ, ϕ; x) = k LIWAE k (θ, ϕ; x) (k 1)LIWAE k 1 (θ, ϕ; x) , which achieves a reduced bias of O(k 2). This can also be generalized to have a bias of order O(k m) for some m 1 by considering: LJackknife k,m = j=0 c(k, m, j)LIWAE k j , where the coefficients c(k, m, j) are given as c(k, m, j) = ( 1)j (k j)m The Delta method Variational Inference (DVI) [71] is defined by: LDV I k = Eq k ϕ ( |x) ℓ=1 w(ℓ) + s2 k 2k wk w(ℓ) = pθ(x, z(ℓ)) qϕ(z(ℓ) | x), wk = 1 ℓ=1 w(ℓ) and s2 k = 1 k 1 ℓ=1 (w(ℓ) wk)2 . The Monte Carlo estimator of the Delta method Variational Inference objective achieves a reduced bias of O(k 2). Some other techniques for reducing bias include the iterated bootstrap for bias correction, the debiasing lemma [57], and Multi-Level Monte Carlo and its variants [39]. C Application of Our Theorem to Bilevel and Conditional Stochastic Optimization C.1 Stochastic Bilevel Optimization We consider the Stochastic Bilevel Optimization problem given by: min θ Rd V (θ) = Eξ [f(θ, ϕ (θ); ξ)] (upper-level) (11) subject to ϕ (θ) argmin ϕ Rq Eζ [g(θ, ϕ; ζ)] (lower-level) where the upper and inner level functions f and g are both jointly continuously differentiable and ξ and ζ are random variables. The goal of equation (11) is to minimize the objective function V with respect to θ, where ϕ (θ) is obtained by solving the lower-level minimization problem. This bilevel problem involves many machine learning problems with a hierarchical structure, which include hyper-parameter optimization [31], metalearning [30], policy optimization [37] and neural network architecture search [55]. The gradient of the objective function V is given by: V (θ) = θf(θ, ϕ (θ)) θϕg(θ, ϕ (θ))v , where v is the solution of the following linear system: 2 ϕg(θ, ϕ (θ))v = ϕf(θ, ϕ (θ)) . Instead of computing v , the solution of the linear system above, [43, 17] proposes a method to estimate v . This estimation introduces bias in the gradient of the objective function. Consider the following assumptions. H6 For all θ Rd, g(θ, ϕ) is strongly convex with respect to ϕ with parameter µg > 0. H7 (Regularity Lipschitz condition) Assume that f, f, g, 2g are respectively Lipschitz continuous with Lipschitz constants ℓf,0, ℓf,1, ℓg,1 and ℓg,2. Assumptions H6 and H7 are the same assumptions used in [17] to obtain the convergence results with SGD. Furthermore, these two assumptions ensure that the firstand second-order derivatives of f and g, as well as the solution mapping ϕ (θ), are well-behaved. Proposition C.1. ([17, Lemma 2.2]) Under Assumption 6, we have: V (θ) = θf (θ, ϕ (θ)) 2 θϕg (θ, ϕ (θ)) 2 ϕg (θ, ϕ (θ)) 1 ϕf (θ, ϕ (θ)) . (12) Due to the dependence of the minimizer of the lower-level problem ϕ (θ), obtaining an unbiased estimate of V (θ) is challenging. To address this, we replace ϕ (θ) in the gradient with ϕ and define θf(θ, ϕ) := θf(θ, ϕ) 2 θϕg(θ, ϕ) 2 ϕg(θ, ϕ) 1 ϕf(θ, ϕ) . Furthermore, by estimating h 2 ϕg(θ, ϕ) i 1 , we define the stochastic update Hk [17] as follows: Hk = θf (θk, ϕk+1; ξk) 2 θϕg θk, ϕ; ζ(0) k b G ϕf (θk, ϕk+1; ξk) , (13) where b G = N ℓg,1 QN i=1 I 1 ℓg,1 2 ϕg θk, ϕk+1; ζ(i) k with N is drawn from {1, . . . , N} uni- formly at random and n ζ(1), . . . , ζ(N )o are i.i.d. samples. In Algorithm 4, we perform T steps of SGD on the lower-level variable ϕk before updating the upper-level variable θk using adaptive methods such as Adagrad, RMSProp, or AMSGRAD. Lemma C.2. ([33, Lemma 2.2]) Under Assumptions H6 and H7, for all (θ, θ ) Rd Rd, V (θ) V (θ ) LV θ θ , with the constant LV is given by LV = ℓf,1 + ℓg,1 (ℓf,1 + Lf) ℓg,2 + ℓg,1ℓg,2 and Lf is defined as Lf = ℓf,1 + ℓg,1ℓf,1 ℓg,2 + ℓg,1ℓg,2 Algorithm 4 Stochastic Bilevel Optimization Input: Initial points θ0, ϕ0, maximum number of iterations for the upper-level n and for the lower-level T, step sizes {γk, γk}k 1, momentum parameters ρ1, ρ2 [0, 1) and regularization parameter δ 0. Set m0 = 0, V0 = 0 and ˆV0 = 0 for k = 0 to n 1 do Set ϕk,0 = ϕk. for t = 0 to T 1 do ϕk,t+1 = ϕk,t γk+1 ϕg (θk, ϕk,t; ζk,t) end for Set ϕk+1 = ϕk,T . Compute the stochastic update Hk using ϕk+1. mk = ρ1mk 1 + (1 ρ1)Hk Vk = ρ2Vk 1 + (1 ρ2)Hk H k ˆVk = max ˆVk 1, Diag(Vk) Ak = δId + ˆVk 1/2 θk+1 = θk γk+1Akmk end for Output: (θk, ϕk)0 k n Lemma C.3. Under Assumptions H6 and H7, the following inequalities hold: V (θk) E [Hk | Fk] 2 2L2 f ϕk+1 ϕ (θk) 2 + 2 b2 k , θf(θ, ϕ) ℓf,0 + ℓg,1ℓf,0 where Lf = ℓf,1 + ℓg,1ℓf,1 ℓg,2 + ℓg,1ℓg,2 and bk = ℓg,1ℓf,1 1 Proof. For the bias term, since V (θk) = f (θk, ϕ (θk)), we have: V (θk) E [Hk | Fk] 2 = f (θk, ϕ (θk)) f (θk, ϕk+1) + f (θk, ϕk+1) E [Hk | Fk] 2 2 f (θk, ϕ (θk)) f (θk, ϕk+1) 2 + 2 f (θk, ϕk+1) E [Hk | Fk] 2 2L2 f ϕk+1 ϕ (θk) 2 + 2 b2 k , where we used [33, Lemma 2.2] for the first term and [37, Lemma 11] for the second term. For the second inequality, we have: θf(θ, ϕ) = θf(θ, ϕ) 2 θϕg(θ, ϕ) 2 ϕg(θ, ϕ) 1 ϕf(θ, ϕ) θf(θ, ϕ) + 2 θϕg(θ, ϕ) 2 ϕg(θ, ϕ) 1 ϕf(θ, ϕ) ℓf,0 + ℓg,1ℓf,0 Theorem C.4. Assume that H6 and H7 hold. Let θn Rd be the n-th iterate of Algorithm 4, γn = cγn 1/2 and γn = c γn 1/2/T. For any n 1, let R {0, . . . , n} be a uniformly distributed random variable. Assume the boundedness of the variance of the estimators of f, g, and 2g. Then, E h V (θR) 2i = O log n n + bn Proof. By using Lemma C.3, V is smooth and Lemma C.3, the bias and the gradient of V are bounded. Using our Corollary 4.5, we obtain: E h V (θR) 2i = O log n n + bn Pn k=0 γk+1 b2 k + Pn k=0 γk+1 ϕk+1 ϕ (θk) 2 Then, with [33, Lemma 2.3] and [17, Lemma 3], we derive: Pn k=0 γk+1 b2 k n + 1 n We achieve a classical convergence rate of O(log n/ n) for Stochastic Bilevel Optimization problems. Two types of bias emerge in this context: firstly, the challenge of directly computing ϕ (θ), and secondly, the necessity of estimating [ 2 ϕg(θ, ϕ)] 1. Our results extend those of [17] to the adaptive case, particularly Adagrad, RMSProp, and AMSGRAD. This provides convergence guarantees for the Alternating Stochastic Gradient Descent (ALSET) method. We can apply our convergence analysis to Stochastic Min-Max and Compositional Problems, as well as to the Actor-Critic method with linear value function approximation [49], which can be viewed as a special case of the Stochastic Bilevel algorithm. C.2 Conditional Stochastic Optimization We now consider a class of Conditional Stochastic Optimization: min θ Rd V (θ) := Eξ fξ Eη|ξ [gη(θ, ξ)] , (14) where fξ( ) : Rq R depends on the random vector ξ and gη( , ξ) : Rd Rq is a vector-valued function dependent on both random vectors ξ and η. The inner expectation is taken with respect to the conditional distribution of η given ξ. Given certain conditions on the regularity of these functions, the gradient of V as defined in (14) can be expressed as: V (θ) = Eξ h Eη|ξ [ gη(θ, ξ)] fξ Eη|ξ [gη(θ, ξ)] i . (15) Constructing an unbiased stochastic estimator of this gradient can be both costly and, in some cases, impractical. Instead, we opt for a biased estimator of V (θ), using just one sample ξ and m i.i.d. samples {ηj}m j=1 from the conditional distribution of η given ξ: b V (θ; ξ, {ηj}m j=1) := j=1 gηj(θ, ξ) j=1 gηj(θ, ξ) H8 For all ξ and η, assume that fξ( ), fξ( ), gη( , ξ), and gη( , ξ) are respectively Lipschitz continuous with Lipschitz constants ℓf,0, ℓf,1, ℓg,0 and ℓg,1. H9 For all θ and ξ, we assume that Eη|ξ h gη(θ, ξ) Eη|ξ [gη(θ, ξ)] 2i σ2 g. Lemma C.5. ([40, Lemma 2.2]) Under Assumptions H8 and H9, the following holds: E h b V (θ; ξ, {ηj}m j=1) i V (θ) 2 ℓ2 g,0ℓ2 f,1σ2 g m . Lemma C.6. Under Assumption H8, we have: V (θ) V (θ ) ℓg,1ℓf,0 + ℓ2 g,0ℓf,1 θ θ , V (θ) ℓg,0ℓf,0 . Proof. Denoting Gθ = Eη|ξ [gη(θ, ξ)] and Gθ = Eη|ξ [ gη(θ, ξ)], we establish the smoothness of V and Boundness of V . Smoothness of V : V (θ) V (θ ) = Eξ GT θ fξ (Gθ) Eξ GT θ fξ (Gθ ) Eξ GT θ fξ (Gθ) Eξ GT θ fξ (Gθ) + Eξ GT θ fξ (Gθ) Eξ [ Gθ fξ (Gθ )] Eξ h ( Gθ Gθ )T fξ (Gθ) i + Eξ GT θ ( fξ (Gθ) fξ (Gθ )) Eξ [ Gθ Gθ fξ (Gθ) ] + Eξ [ Gθ fξ (Gθ) fξ (Gθ ) ] ℓg,1ℓf,0 θ θ + ℓg,0ℓf,1Eξ [ Gθ Gθ ] ℓg,1ℓf,0 θ θ + ℓ2 g,0ℓf,1 θ θ . Boundness of V : V (θ) = Eξ GT θ fξ (Gθ) Eξ [ Gθ fξ (Gθ) ] ℓg,0ℓf,0 . Theorem C.7. Assume that H8 and H9 hold. Let γn = cγn 1/2, An denote the adaptive matrix in AMSGRAD and ρ1, ρ2 [0, 1). For any n 1, let R {0, . . . , n} be a uniformly distributed random variable. Then, E h V (θR) 2i = O log n n + bn where bn is defined by writing mk as the number of conditional samples at iteration k: Proof. This is an immediate implication of Theorem 4.6 using Lemmas C.5 and C.6. These results can also be extended to the Federated Conditional Stochastic Optimization problem [75], which is defined by: min θ Rd V (θ) = 1 ℓ=1 Eξℓ f ℓ ξℓ Eηl|ξℓ gℓ ηℓ(θ, ξℓ) , where Eξℓf ℓ ξℓ( ) : Rq R is the outer-layer function on the ℓ-th device with the randomness ξℓ, and Eηℓ|ξℓgℓ ηℓ( , ξℓ) : Rd Rq is the inner-layer function on the ℓ-th device with respect to the conditional distribution of ηℓgiven ξℓ. If the functions f ℓ ξℓ( ) and gℓ ηℓ( , ξℓ) for all L devices verify Assumptions H8 and H9, we obtain the same convergence rate. The following Table 2 provides a comprehensive summary of the key points, including the verification of our assumptions and the convergence results obtained in both Stochastic Bilevel Optimization and Conditional Stochastic Optimization. Table 2: Bilevel and Conditional Stochastic Optimization with our biased adaptive SA framework. Applications Stochastic Bilevel Optimization Conditional Stochastic Optimization Problem min θ Rd Eξ f(θ, ϕ (θ); ξ) min θ Rd Eξ h fξ Eη|ξ gη(θ, ξ) i s.t. ϕ (θ) argmin ϕ Rq Eζ [g(θ, ϕ; ζ)] Gradient θf(θ, ϕ (θ)) θϕg(θ, ϕ (θ))v Eξ Eη|ξ gη(θ, ξ) T fξ Eη|ξ gη(θ, ξ) Lipchitz Constant H2 ℓf,1 + ℓg,1 ℓf,1+Lf ℓg,2 + ℓg,1ℓg,2 ℓg,1ℓf,0 + ℓ2 g,0ℓf,1 Bias Control H3 ℓg,1ℓf,1 1 µg N ℓ2 g,0ℓ2 f,1σ2 g m Gradient Bound H5 ℓf,0 + ℓg,1ℓf,0 µg ℓg,0ℓf,0 Convergence O log n n + bn O log n n + bn D Some Other Examples of Biased Gradients with Control on Bias In this section, we explore examples of applications using biased gradient estimators while having control over the bias. D.1 Self-Normalized Importance Sampling Let π be a probability measure on a measurable space (X, X). The objective is to estimate π(f) = Eπ[f(X)] for a measurable function f : X Rd such that π(|f|) < . Assume that π(dx) w(x)λ(dx), where w is a positive weight function and λ is a proposal probability distribution, and that λ(w) = R w(x)λ(dx) < . For a function f : X Rd such that π(|f|) < , the identity π(f) = λ(ωf) λ(ω) , (17) leads to the Self-Normalized Importance Sampling (SNIS) estimator: i=1 ωi Nf Xi , ωi N = w Xi PN ℓ=1 w (Xℓ) , where X1:N = X1, . . . , XN are independent draws from λ and the ωi N are called the normalized weights. [1] shows that the bias of the SNIS estimator can be expressed as: E ΠNf X1:N π(f) 12 This particular type of estimator can be found in the domain of Monte Carlo methods, particularly in the context of Bayesian inference and Sequential Monte Carlo methods. D.2 Sequential Monte Carlo Methods We focus here in the task of estimating the parameters, denoted as θ, in Hidden Markov Models. In this context, the hidden Markov chain is denoted by (Xt)t 0. The distribution of X0 has density χ with respect to the Lebesgue measure µ and for all t 0, the conditional distribution of Xt+1 given X0:t has density mθ(Xt, ). It is assumed that this state is partially observed through an observation process (Yt)0 t T . The observations Y0:t are assumed to be independent conditionally on X0:t and, for all 0 t T, the distribution of Yt given X0:t depends on Xt only and has density gθ(Xt, ) with respect to the Lebesgue measure. The joint distribution of hidden states and observations is given by pθ(x0:T , y0:T ) = χ(x0)gθ(x0, y0) t=0 mθ(xt, xt+1)gθ(xt+1, yt+1) . Our objective is to maximize the likelihood of the model: pθ(y0:T ) = Z pθ(x0:T , y0:T ) dx0:T . To use a gradient-based method for this maximization problem, we need to compute the gradient of the objective function. Under simple technical assumptions, by Fisher s identity, θ log pθ(y0:T ) = Z θ log pθ(x0:T , y0:T )pθ(x0:T |y0:T )dx0:T = Ex0:T pθ(.|y0:T ) [ θ log pθ(x0:T , y0:T )] = Ex0:T pθ(.|y0:T ) t=0 st,θ (xt, xt+1) where st,θ (x, x ) = θ log{mθ (x, x ) gθ (x, yt+1)} for t > 0 and by convention s0,θ (x, x ) = θ log gθ (x, y0). Given that the gradient of the log-likelihood represents the smoothed expectation of an additive functional, one may opt for Online Smoothing algorithms to mitigate computational costs. The estimation of the gradient θ log pθ(y0:T ) is given by: Hθ (y0:T ) = ωi T ΩT τ i T,θ , where {τ i T,θ}N i=1 are particle approximations obtained using particles { ξi T , ωi T }N i=1 targeting the filtering distribution ϕT , i.e. the conditional distribution of x T given y0:T . In the Forward-only implementation of FFBSm [20], the particle approximations {τ i T,θ}N i=1 are computed using the following formula, with an initialization of τ i 0 = 0 for all i J1, NK: τ i t+1,θ = ωj t mθ(ξj t , ξi t+1) P ℓ=1 ωℓ tmθ(ξℓ t, ξi t+1) n τ j t,θ + st,θ(ξj t , ξi t+1) o , t N . The estimator of the gradient Hθ (y0:T ) computed by the Forward-only implementation of FFBSm is biased. The bias and MSE of this estimator are of order O (1/N) [20], where N corresponds to the number of particles used to estimate it. Using alternative recursion methods to compute {τ i T,θ}N i=1 results in different algorithms, such as the particle-based rapid incremental smoother (PARIS) [64] and its pseudo-marginal extension [34] and Parisian particle Gibbs (PPG) [13]. In such cases, one can also control the bias and MSE of the estimator. D.3 Policy Gradient for Average Reward over Infinite Horizon Consider a finite Markov Decision Process (MDP) denoted as (S, A, R, P), where S represents the state space, A denotes the action space, R : S A [0, Rmax] is a reward function, and P is the transition model. The agent s decision-making process is characterized by a parametric family of policies {πθ}θ Rd, employing the soft-max parameterization. The reward function is given by: V (θ) := E(S,A) vθ [R(S, A)] = X (s,a) S A vθ(s, a)R(s, a) , where vθ represents the unique stationary distribution of the state-action Markov Chain sequence {(St, At)}t 1 generated by the policy πθ. Let λ (0, 1) be a discount factor and T be sufficiently large, the estimator of the gradient of the objective function V is given by: Hθ (S1:T , A1:T ) = R (ST , AT ) i=0 λi log πθ (AT i; ST i) , where (S1:T , A1:T ) := (S1, A1, . . . , ST , AT ) is a realization of state-action sequence generated by the policy πθ. It s important to note that this gradient estimator is biased, and the bias is of order O(1 λ) [44]. D.4 Zeroth-Order Gradient Consider the problem of minimizing the objective function V . The zeroth-order gradient method is particularly valuable in scenarios where direct access to the gradient of the objective function is challenging or computationally expensive. The zeroth-order gradient oracle obtained by Gaussian smoothing [61] is given by: Hθ (X) = V (θ + τX) V (θ) where τ > 0 is a smoothing parameter and X N(0, Id) a random Gaussian vector. [61, Lemma 3] provide the bias of this estimator: E [Hθ (X)] V (θ) τ 2L(d + 3)3/2 . (19) The application of these zeroth-order gradient methods can be found in generative adversarial networks [58, 16]. D.5 Compressed Stochastic Approximation: Coordinate Sampling The coordinate descent method is based on the iteration: θn+1 = θn γn+1Hθn (Xn+1)jn ejn , where {e1, . . . , ed} is the canonical basis of Rd and Hθn (Xn+1)j is the j-th coordinate of the gradient. The randomized coordinate selection rule chooses jn uniformly from the set {1, 2, . . . , d}. Alternatively, the Gauss-Southwell selection rule [63] uses: jn+1 := argmax j {1,...,d} |Hθn (Xn+1)j | . This corresponds to a greedy selection procedure since at each iteration we choose the coordinate with the largest directional derivative. Another approach to choosing jn is Coordinate Sampling [53], a variant of the stochastic gradient descent algorithm that incorporates a selection step by sampling to perform random coordinate descent. The distribution of ζn+1, which selects the coordinate, is characterized by the probability weights vector (w(1) n , . . . , w(d) n ) defined as: w(j) n = P(ζn+1 = j|Fn), j {1, . . . , d} . This distribution of ζn+1 is referred to as the coordinate sampling policy. The Stochastic Coordinate Gradient Descent algorithm is defined by: θn+1 = θn γn+1D(ζn+1)Hθn (Xn+1) , where D(k) = eke k Rd d has its entries equal to 0 except for the (k, k) entry, which is 1. Observe that the distribution of the random matrix D(ζn+1) is fully characterized by the matrix Dn = E[D(ζn+1)|Fn] = Diag(w(1) n , . . . , w(d) n ). In this context, An represents a diagonal matrix Dn where the diagonal terms characterize the probability weights for sampling each coordinate. These weights typically depend on preceding iterations and even on current gradients. In this case, we always have βn+1 1 and to control the minimum eigenvalue, we only require a lower bound on the probability weights. This method can be easily extended to incorporate biased gradients and adaptive steps by introducing An = Dn An, where An represents the adaptive matrix as before, and Dn is the matrix of probability weights. E Experiment details and supplementary results E.1 Experiment with a Synthetic Time-Dependent Bias To illustrate Theorem 4.1 and the impact of bias, we consider in Figure 3 a simple least squares objective function V (θ) = Aθ 2/2 in dimension d = 10. We artificially add to every gradient a zero-mean Gaussian noise with variance σ2 = 0.01 and a bias term rn = Crn r at each iteration n. We use Adagrad with a learning rate γ = 1/2, β = 0 and λ = 0. Then, the bound of Theorem 4.1 is of the form O(n 1/2 + n r). 0 2000 4000 6000 8000 10000 Epochs V( n) V( *) 0 2000 4000 6000 8000 10000 Epochs Figure 3: Value of V (θn) V (θ ) (on the left) and V (θn) 2 (on the right) with Adagrad for different values of rn = n r and a learning rate γn = n 1/2. The dashed curve corresponds to the expected convergence rate O(n 1/4) for r = 1/4 and O(n 1/2) for r 1/2. We explore different values of rn {1, n 1/4, n 1/2, n 1, n 2, 0}, where rn = 1 corresponds to constant bias, rn = 0 for an unbiased gradient, and the others exhibit decreasing bias. First, note that the impact of a constant bias term (rn = 1) on the risk and the norm of gradients never vanishes. From rn = 1 to rn = n 1/2, the effect of the bias decreases until a threshold is reached where there is no significant improvement. The convergence rate in the case rn = n 1/2 is then the same as in the case without bias, illustrating the fact that in this case the dominating term comes from the learning rate. E.2 Additional Experiments of IWAE In this section, we provide detailed information about the experiments on CIFAR-10. We also conduct additional experiments on the Fashion MNIST dataset. For all experiments, we use Adagrad, RMSProp, and Adam with a learning rate decay given by γn = Cγ/ n, where Cγ = 0.01 for Adagrad and Cγ = 0.001 for RMSProp and Adam. The momentum parameters are set to ρ1 = 0.9 and ρ2 = 0.999, and the regularization parameter δ is fixed at 5 10 2. The impact of this regularization parameter will be illustrated later. Datasets. We conduct our experiments on two datasets: Fashion MNIST [76] and CIFAR-10. The Fashion MNIST dataset is a variant of MNIST and consists of 28x28 pixel images of various fashion items, with 60,000 images in the training set and 10,000 images in the test set. CIFAR-10 consists of 32x32 pixel images categorized into 10 different classes. The dataset is divided into 60,000 images in the training set and 10,000 images in the test set. Models. For Fashion MNIST, we use a fully connected neural network with a single hidden layer consisting of 400 hidden units and Re LU activation functions for both the encoder and the decoder. The latent space dimension is set to 20. We use 256 images per iteration (235 iterations per epoch). For CIFAR-10 and CIFAR-100, we use a Convolutional Neural Network (CNN) architecture with 3 Convolutional layers and 2 fully connected layers with Re LU activation functions. The latent space dimension is set to 100. For both datasets, we use 256 images per iteration (196 iterations per epoch). We estimate the log-likelihood using the VAE, IWAE, and BR-IWAE models, all of which are trained for 100 epochs. Training is conducted using the SGD, SGD with momentum, Adagrad, RMSProp, and Adam algorithms with a decaying learning rate, as mentioned before. For SGD, we employ the clipping method to clip the gradients to prevent excessively large steps. For this experiment, we set k = 5 samples in both IWAE and BR-IWAE, while restricting the maximum iteration of the MCMC algorithm to 5 and the burn-in period to 2 for BR-IWAE. For comparison, we estimate the Negative Log-Likelihood using these three models with SGD, SGD with momentum, Adagrad, RMSProp, and Adam, and the results are presented in Table 3. Similar to the case of CIFAR-10, we observe that IWAE outperforms VAE, while BR-IWAE outperforms IWAE by reducing bias in all cases. The adaptive methods surpass SGD, and momentum further improves their performances. Consequently, Adam excels among all algorithms due to its adaptive steps and momentum. Table 3: Comparison of Negative Log-Likelihood on the Fashion MNIST Test Set (Lower is Better). Algorithm VAE IWAE BR-IWAE SGD 247.2 244.9 244.0 SGD with momentum 244.6 240.2 238.4 Adagrad 245.8 241.4 240.5 RMSProp 242.6 239.3 237.8 Adam 240.3 237.8 236.1 Similarly, as we did in the case of CIFAR-10, we incorporate a time-dependent bias that decreases by choosing a bias of order O(n α) at iteration n. We vary the value of α for both Fashion MNIST and CIFAR-100. 0 20 40 60 80 100 Epochs = 1/8 = 1/4 = 1/2 0 20 40 60 80 100 Epochs Test loss (x10e2) = 1/8 = 1/4 = 1/2 Figure 4: IWAE on the Fashion MNIST Dataset with Adagrad for different values of α. Bold lines represent the mean over 5 independent runs. 0 20 40 60 80 100 Epochs = 1/8 = 1/4 = 1/2 0 20 40 60 80 100 Epochs Test loss (x10e2) = 1/8 = 1/4 = 1/2 Figure 5: IWAE on the Fashion MNIST Dataset with RMSProp for different values of α. Bold lines represent the mean over 5 independent runs. 0 20 40 60 80 100 Epochs = 1/8 = 1/4 = 1/2 0 20 40 60 80 100 Epochs Test loss (x10e2) = 1/8 = 1/4 = 1/2 Figure 6: IWAE on the Fashion MNIST Dataset with Adam for different values of α. Bold lines represent the mean over 5 independent runs. All figures are plotted on a logarithmic scale for better visualization and with respect to the number of epochs. The dashed curve corresponds to the expected convergence rate O(n 1/4) for α = 1/8, and O(log n/ n) for α = 1/4, as well as for α = 1/2, just as in the case of CIFAR-10. We can clearly observe that for all cases, convergence is achieved when n is sufficiently large. In the case of the Fashion MNIST dataset, the bound seems tight, and the convergence rate of O(n 1/2) does not seem to be possible to reach, in contrast to the case of CIFAR-10 where the curves corresponding to α = 1/4 and α = 1/2 approach the O(n 1/2) convergence rate. For all figures, with a larger α, the convergence in both the squared gradient norm and negative log-likelihood occurs more rapidly. Additional Experiments on CIFAR-10 Dataset. 0 20 40 60 80 100 Epochs Test loss (x10e3) = 1/8 (Adagrad) = 1/4 (Adagrad) = 1/2 (Adagrad) = 1/8 (RMSProp) = 1/4 (RMSProp) = 1/2 (RMSProp) = 1/8 (Adam) = 1/4 (Adam) = 1/2 (Adam) Figure 7: Negative Log-Likelihood on the test set on the CIFAR-10 Dataset for IWAE with Adagrad, RMSProp, and Adam. Bold lines represent the mean over 5 independent runs. The effect of Cγ. Figure 8 illustrates the convergence in both the squared gradient norm and the negative log-likelihood for Cγ = 0.001 and Cγ = 0.01 in Adagrad. In the case of the squared gradient norm, we have only plotted the results for Cγ = 0.001 for better visualization, and the plot for Cγ = 0.01 was already presented in Figure 2. It is clear that when Cγ is set to 0.001, the convergence of the negative log-likelihood is slower. Similarly, the convergence in the squared gradient norm for Cγ = 0.001 achieves convergence, but it is slower compared to the case of Cγ = 0.01. 0 20 40 60 80 100 Epochs = 1/8 = 1/4 = 1/2 0 20 40 60 80 100 Epochs Test loss (x10e3) = 1/8, C = 0.001 = 1/4, C = 0.001 = 1/2, C = 0.001 = 1/8, C = 0.01 = 1/4, C = 0.01 = 1/2, C = 0.01 Figure 8: IWAE on the CIFAR-10 Dataset with Adagrad for different values of α and Cγ. Bold lines represent the mean over 5 independent runs. The Impact of regularization parameter δ. In Section A.7, we discussed the impact of the regularization parameter δ in Adam. It has been empirically observed that the performance of adaptive methods can be sensitive to the choice of this parameter. Here, we illustrate the impact of this regularization parameter in IWAE. To achieve this, we plot the test loss for different sets of values for δ {10 8, 10 5, 10 3, 10 2, 5 10 2, 10 1} in Figure 9. 20 40 60 80 100 Epochs Test loss (x10e3) = 1e-08 = 1e-05 = 0.001 = 0.01 = 0.05 = 0.1 Figure 9: IWAE on the CIFAR-10 Dataset with Adam for different values of δ. Lines represent the mean over 5 independent runs. Our experimental results align with prior work [77, 67, 73], affirming the consistent impact of δ. Notably, we find that employing δ = 5 10 2 yields improved performance in IWAE. The Impact of Bias over Time. Our experiments illustrate the negative log-likelihood with respect to epochs, and we observed that a higher value of α leads to faster convergence. The key point to consider when tuning α is that while convergence may be faster in terms of iterations, it may lead to higher computational costs. To illustrate this, we set a fixed time limit of 1000 seconds and tested different values of α, plotting the test loss as a function of time in Figure 10. It is clear that with α = 1/8, the convergence is always slower, whereas choosing α = 1/4 achieves faster convergence than α = 1/2. While the difference may seem small here, with more complex models, the disparity becomes significant. Therefore, it is essential to tune the value of α to attain fast convergence and reduce computational time. In this paper, all simulations were conducted using the Nvidia Tesla T4 GPU. The total computing hours required for the results presented in this paper are estimated to be around 100 to 200 hours of GPU usage. 0 200 400 600 800 1000 Time (s) Test loss (x10e3) = 1/8 = 1/4 = 1/2 0 200 400 600 800 1000 Time (s) Test loss (x10e3) = 1/8 = 1/4 = 1/2 Figure 10: Negative Log-Likelihood on the test set of the CIFAR-10 Dataset for IWAE with Adagrad (on the left) RMSProp (on the right) for Different Values of α over time (in seconds). Bold lines represent the mean over 5 independent runs. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The paper s contributions are mentioned in the abstract and clearly detailed in the introduction. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In our paper, we discuss the limitations of our results. Furthermore, for each assumption, we also discuss where it can be verified and provide the limitations of these assumptions. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The paper provides a comprehensive set of assumptions with discussions, along with complete proofs for each theoretical result in the Appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Our work is primarily theoretical, and the experiments are conducted to illustrate our results. The paper provides detailed descriptions of the experimental setup, parameters, and methodologies, ensuring that the main results can be reproduced and verified. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code supporting our experiments is available on Git Hub, as referenced in the Experiments section. All data used in our study is publicly accessible (CIFAR dataset). Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The paper provides comprehensive details on the experimental setup, including hyperparameters, all optimizer algorithms, and other relevant algorithms with pseudo code, ensuring the reproducibility of the results. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: All algorithms used to illustrate the results are run 5 times, and the mean and significance of all results are plotted, ensuring an appropriate representation of statistical significance. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The paper provides detailed information on the computer resources required for all experiments, including the type of compute workers (CPU or GPU), and approximate time of execution. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conducted in the paper aligns with the Neur IPS Code of Ethics, ensuring adherence to ethical guidelines throughout the study. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Given that the paper is primarily theoretical and the experiments are conducted solely for illustrative purposes, the work does not involve significant broader societal impacts to discuss. The focus of the paper is on theoretical contributions rather than practical applications with societal implications. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper is theoretical and does not involve the release of data or models that have a high risk for misuse. Therefore, no safeguards were necessary. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing assets; therefore, no creators or original owners need to be credited, and no license or terms of use need to be mentioned or respected. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not introduce new assets; therefore, there is no documentation provided alongside any assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing experiments or research with human subjects, so there are no instructions provided to participants or details about compensation. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve research with human subjects; thus, there are no potential risks, disclosure of risks, or IRB approvals to describe. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.