# adam_optimization_with_adaptive_batch_selection__82e41cf8.pdf Published as a conference paper at ICLR 2025 Adam Optimization with Adaptive Batch Selection Gyu Yeol Kim Seoul National University Seoul, South Korea gyuyeolkim@snu.ac.kr Min-hwan Oh Seoul National University Seoul, South Korea minoh@snu.ac.kr Adam is a widely used optimizer in neural network training due to its adaptive learning rate. However, because different data samples influence model updates to varying degrees, treating them equally can lead to inefficient convergence. To address this, a prior work proposed adapting the sampling distribution using a bandit framework to select samples adaptively. While promising, the bandit-based variant of Adam suffers from limited theoretical guarantees. In this paper, we introduce Adam with Combinatorial Bandit Sampling (Adam CB), which integrates combinatorial bandit techniques into Adam to resolve these issues. Adam CB is able to fully utilize feedback from multiple samples at once, enhancing both theoretical guarantees and practical performance. Our regret analysis shows that Adam CB achieves faster convergence than Adam-based methods including the previous bandit-based variant. Numerical experiments demonstrate that Adam CB consistently outperforms existing methods. 1 Introduction Adam (Kingma & Ba, 2015) is one of the most widely used optimizers for training neural networks, primarily due to its ability to adapt learning rates. The standard version of Adam and its numerous variants treat each training sample equally by employing uniform sampling over the dataset. In practice, however, different data samples can influence model updates to varying degrees. As a result, simply performing full dataset sweeps or sampling data with equal weighting may lead to inefficient convergence and, consequently, unnecessary computational overhead when aiming to satisfy a given convergence criterion. To address these challenges, Liu et al. (2020) introduced a dynamic approach called Adam BS, which adapts the sampling distribution during training using a multi-armed bandit (MAB) framework. In this method, each training sample is treated as an arm in the MAB, allowing more important samples to be selected with higher probability and having a greater influence on model updates. This approach was intended to improve both the adaptability and efficiency of the optimization process, presenting a promising direction for further advancements. However, despite its potential benefits, critical issues remain: Specifically, the convergence issue previously identified in the original analysis of Adam (initially reported by Reddi et al. 2018 and later resolved by Zhang et al. 2022) also affects its bandit-based variant, Adam BS, as newly discovered in our work. Consequently, the existing theoretical guarantees regarding the efficiency and effectiveness of Adam BS are invalid (see Section 2.5.3). As a result, to the best of our knowledge, there is no existing Adam-based method that can adaptively sample while providing rigorous performance guarantees. This raises a critical question: is it possible to design an algorithm that adaptively adjusts the sampling distribution while ensuring both provable guarantees and practical performance improvements? In this paper, we propose a new optimization method, Adam with Combinatorial Bandit Sampling (Adam CB), which addresses the limitation in both the analysis and implementation of Adam BS by incorporating a combinatorial bandit approach into the sample selection process. In this approach, batch selection is formulated as a combinatorial action, where multiple arms (samples) are selected simultaneously. This combinatorial bandit framework can take advantage of feedback from multiple samples at once, significantly enhancing the adaptivity of the optimizer. For the first time, we provide provable performance guarantees for adaptive batch selection in Adam-based methods, leading to Published as a conference paper at ICLR 2025 faster convergence and demonstrating both theoretical and practical improvements over existing approaches. Our main contributions are summarized as follows: We propose Adam with Combinatorial Bandit Sampling (Adam CB), a novel optimization algorithm that integrates the Adam method with a combinatorial bandit approach for sample selection. To the best of our knowledge, Adam CB is not only the first algorithm to successfully combine combinatorial bandit techniques with the Adam framework, but also the first to correctly adapt any bandit techniques to Adam, significantly enhancing its adaptability. We provide a rigorous regret analysis of the proposed Adam CB algorithm, demonstrating that it achieves a sharper regret bound compared to both the original Adam (which uses uniform sampling) and its bandit-based variant, Adam BS (Liu et al., 2020). Additionally, we correct the theoretical errors in the analysis of Adam BS and present a revised regret bound (see Table 1 for comparisons). We perform empirical evaluations across multiple datasets and models, showing that Adam CB consistently outperforms existing Adam-based optimization methods in terms of both convergence rate and practical performance. Our results establish Adam CB as the first Adam-based algorithm to offer both provable convergence guarantees and practical efficiency for banditbased Adam optimization methods. 2 Preliminaries 2.1 Notations We denote by [n] the set {1, 2, . . . n} for a positive integer n. For a vector x Rd, we denote by x the vector s Euclidean norm. For two positive sequences {an} n=1 and {bn} n=1, an = O(bn) implies that there exists an absolute constant C > 0 such that an Cbn holds for all n 1. Similarly, an = o(bn) indicates that limn an 2.2 Expected Risk and Empirical Risk Expected Risk. In many machine learning problems, the primary goal is to develop a model with robust generalization performance. By generalization, we mean that while models are trained on a finite sample of data points, we aim for them to perform well on the entire population of data. To achieve this, we focus on minimizing a quantity known as the expected risk. The expected risk is the average loss across the entire population data distribution, reflecting the model s anticipated error if it had access to the complete set of possible data samples. Formally, the expected risk is defined as: E(x,y) P [ℓ(θ; x, y)] := Z ℓ(θ; x, y)d P(x, y) (1) where θ Rd is the model parameter, ℓ(θ; x, y) is the loss function that measures the error of the model on a single data sample (x, y), and P is the true distribution of the data. The gold standard goal is to find the θ that minimizes the expected risk in Eq.(1), ensuring that the model generalizes well to all data drawn from P. Empirical Risk. In practice, however, the true distribution P is typically unknown. Instead, we only work with a finite dataset D consisting of n samples, which is denoted as D := {(xi, yi)}n i=1. To approximate the expected risk, we use the empirical distribution ˆP derived from the dataset D. For this empirical distribution ˆP to be a reliable approximation, we assume that the dataset D is representative of the true distribution P. This requires that each sample in the dataset D is equally likely and independently drawn from the true distribution P (i.e., the samples (xi, yi) are i.i.d. according to P). The empirical distribution ˆP can be expressed as: ˆP(x, y; D) = 1 i=1 δ(x = xi, y = yi) (2) where δ is the Dirac-delta function. With the empirical distribution at hand, the empirical risk is the average loss over the given finite dataset D. The empirical risk serves as an estimate of the expected Published as a conference paper at ICLR 2025 risk and is formally defined as: E(x,y) ˆ P [ℓ(θ; x, y)] := Z ℓ(θ; x, y)d ˆP(x, y; D) = 1 i=1 ℓ(θ; xi, yi). (3) However, if the dataset is non-uniformly distributed, some samples may be over-represented or underrepresented, leading to a biased estimate of the expected risk. To address this issue, one can use importance sampling (Katharopoulos & Fleuret, 2018), which adjusts the sample weights to ensure the empirical risk remains an unbiased estimate of the expected risk. 2.3 Objective Function and Mini-Batches Objective Function. In the context of optimizing machine learning models, the objective function f(θ; D) is often the empirical risk shown in Eq.(3). Given a dataset D = {(xi, yi)}n i=1, the objective function f(θ; D) is defined as, f(θ; D) := 1 n Pn i=1 ℓ(θ; xi, yi). As studied in the relevant literature of Adam optimization (Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2015; Dozat, 2016; Reddi et al., 2018), we focus on the problem setting where f is convex (i.e., ℓis convex). Then, the goal of the optimization problem is to find a parameter θ Rd that minimizes the objective function f(θ; D). This problem is known as empirical risk minimization: θ arg min θ Rd f(θ; D) . The gradient of the objective function f with respect to θ is denoted by g := θf(θ; D) = 1 n Pn i=1 gi, where gi := θℓ(θ; xi, yi) is the gradient of the loss based on the i-th data sample in D. Mini-Batches. When the full dataset D = {(xi, yi)}n i=1 is very large (i.e., large n), computing the gradient over the entire dataset for each optimization iteration becomes computationally expensive. To address this, mini-batches smaller subsets of the full dataset are commonly used to reduce computational overhead per iteration. Consider the sequence of mini-batches D1, D2, . . . , DT D used for training, with corresponding objective functions ft(θ) := f(θ, Dt) for each t {1, . . . , T}. Let K be the size of the mini-batch Dt for all t, then Dt := {(x J1 t , y J1 t ), (x J2 t , y J2 t ), . . . , (x JK t , y JK t )}, where Jt := {J1 t , J2 t , . . . , JK t } [n] is the set of indices of the samples in the mini-batch Dt. The objective function ft(θ) for the mini-batch Dt is defined as the expected risk over this mini-batch: ft(θ) = f(θ; Dt) := Z ℓ(θ; x, y)d ˆP(x, y; Dt) (4) where ˆP(x, y; Dt) is the empirical distribution derived from the mini-batch Dt. The gradient of the objective function ft with respect to θ is denoted as gt := θft. Note that the sequence of mini-batches {Dt}T t=1 can be selected adaptively. Adaptive selection involves choosing mini-batches based on results observed during previous optimization steps, potentially adjusting the importance assigned to specific samples. The empirical distribution ˆP(x, y; Dt) is significantly influenced by the method used to select the mini-batch Dt from the full dataset D. 2.4 Regret Minimization Cumulative Regret. An online optimization method can be analyzed within the framework of regret minimization. Consider an online optimization algorithm π that generates a sequence of model parameters θ1, . . . , θT over T iterations. The performance of π can be compared to the optimal parameter θ arg minθ Rd f(θ; D), which minimizes the objective function over the full dataset D. The cumulative regret after T iterations is defined as: t=1 f(θt; D) T min θ Rd f(θ; D) where the expectation is taken with respect to any stochasticity in data sampling and parameter estimation. For the optimization algorithm π to converge to optimality, we require the cumulative regret Rπ(T) to grow slower than the number of iterations T, specifically Rπ(T) = o(T). Published as a conference paper at ICLR 2025 Online Regret. An alternative notion of regret is the online regret, defined over a sequence of mini-batch datasets {Dt}T t=1, or equivalently, over the sequence of functions {ft}T t=1. Specifically, the online regret of the optimization algorithm π after T iterations is given by: Rπ online(T) := E t=1 ft(θt) min θ Rd where the expectation is again taken over any stochasticity in the optimization process. It is important to note that the primary focus should not solely be on minimizing the online regret. An algorithm might select Dt D in a way that allows π to perform well on {Dt}T t=1, but it may perform poorly on the full dataset D. Therefore, our ultimate goal remains minimizing the cumulative regret Rπ(T). Later, in the proof of Theorem 1, we demonstrate how minimizing the cumulative regret Rπ(T) in Eq.(5) relates to minimizing the online regret Rπ online(T) with respect to the sequence {ft}T t=1. 2.5 Related Work: Adam and Technical Issues in Convergence Guarantees 2.5.1 Adam Optimizer Adam (Kingma & Ba, 2015) is a widely used first-order gradient-based optimization method that computes adaptive learning rates for each parameter by using both the first and second moment estimates of the gradients. In each iteration t, Adam maintains the accumulated gradients mt β1,tmt 1 + (1 β1,t)gt and the accumulated squared gradients vt β2vt 1 + (1 β2)g2 t , where gt is the gradient at iteration t and g2 t represents the element-wise square of gradient gt. The hyperparameters β1, β2 [0, 1) control the decay rates of mt and vt, respectively. Since these moment estimates are initially biased towards zero, the estimates are corrected as ˆmt mt/(1 βt 1) and ˆvt vt/(1 βt 2). The Adam algorithm then updates the parameters using θt θt 1 αt ˆmt ˆvt+ϵ, where ϵ is a small positive constant added to prevent division by zero. The key characteristic of Adam lies in its use of exponential moving average for both the gradient estimates (first-order) and the element-wise squares of gradients (second-order). This approach has shown empirical effectiveness for optimizing deep neural networks and has led to many follow-up works, such as Reddi et al. (2018), Loshchilov & Hutter (2019), Chen et al. (2020), Alacaoglu et al. (2020), and Chen et al. (2023). 2.5.2 Convergence of Adam After Adam was first introduced, there was considerable debate regarding its convergence properties. In particular, Reddi et al. (2018) provided a counterexample demonstrating that Adam might fail to converge under certain conditions (see Section3 of Reddi et al. 2018). In response, numerous variants of adaptive gradient methods have been proposed, such as AMSGrad (Reddi et al., 2018), Adam W (Loshchilov & Hutter, 2019), and Ada Belief (Zhuang et al., 2020), to address this issue and ensure convergence. However, recent studies (Zhang et al., 2022; D efossez et al., 2022) indicate that the standard Adam algorithm itself can achieve convergence with appropriate hyperparameter choices, thereby resolving its earlier theoretical concerns. These recent works provide alternative convergence proofs for the original Adam algorithm without requiring modifications to its update rules, contingent upon hyperparameter conditions being satisfied. 2.5.3 Technical Issues in Adam with Bandit Sampling (Liu et al., 2020) The work most closely related to ours is by Liu et al. (2020), who proposed Adam BS, an extension of the original Adam algorithm and proof framework incorporating a bandit sampling approach. However, the initial convergence issue present in the original proof of Adam, discussed in Section 2.5.2, also affects Adam BS, thus invalidating the convergence guarantee provided by Liu et al. (2020). Moreover, even if the alternative convergence proofs of Adam are adapted to the Adam BS framework, several other critical shortcomings persist. We summarize these remaining issues as follows: Adam BS unfortunately fails to provide guarantees on convergence despite its claims, both on the regret bound and on the effectiveness of the adaptive sample selection via the bandit approach. Specifically, the claimed regret bound in Theorem 1 of Liu et al. (2020) is Published as a conference paper at ICLR 2025 incorrect. In particular, Eq.(7) on Page 3 of the supplemental material of Liu et al. (2020) contains an error in the formula expansion.1 This technical error is critical to their claims regarding the convergence rate of Adam BS and its dependence on the mini-batch size K Their problem setting is also limited and impractical, even if the analysis were corrected. The analysis assumes that feature vectors follow a doubly heavy-tailed distribution, a strong and restrictive condition that may not hold in practical scenarios. Importantly, no analysis is provided for bounded or sub-Gaussian (light-tailed) distributions, which are more commonly encountered in real-world applications. Despite their claim regarding mini-batch selection of size K, their algorithm design allows the same sample to be selected multiple times within a single mini-batch. This occurs because the bandit algorithm they employ is based on single-action selection rather than a combinatorial bandit approach. As a result, their method may repeatedly sample the same data points within a mini-batch. Moreover, due to this limitation, their method fails to achieve performance gains with increasing mini-batch size K, contradicting their claim. Numerical evaluations (in Section 5) demonstrate poor performance of Adam BS. Our numerical experiments across various models and datasets reveal that Adam BS exhibits poor and inconsistent performance. Additionally, an independent evaluation by a separate group has also reported inconsistent results for Adam BS (Bansal et al., 2022). 3 Proposed Algorithm: Adam CB 3.1 Adam CB Algorithm Algorithm 1: Adam with Combinatorial Bandit Sampling (Adam CB) Input: learning rate {αt}T t=1, decay rates {β1,t}T t=1, β2, batch size K, exploration parameter γ [0, 1) Initialize: model parameters θ0, first moment estimate m0 0, second moment estimate v0 0, ˆv0 0, sample weights wi,0 1 for all i [n] 1 for t = 1 to T do 2 Jt, pt, Snull,t Batch-Selection(wt 1, K, γ) (Algorithm 2) 3 Compute unbiased gradient estimate gt with respect to Jt using Eq.(7) 4 mt β1,tmt 1 + (1 β1,t)gt 5 vt β2vt 1 + (1 β2)g2 t 6 ˆv1 v1, ˆvt max n (1 β1,t)2 (1 β1,t 1)2 ˆvt 1, vt o if t 2 7 θt+1 θt αt mt ˆvt+ϵ 8 wt Weight-Update(wt 1, pt, Jt, {gj,t}j Jt, Snull,t, γ) (Algorithm 3) We present our proposed algorithm, Adam with Combinatorial Bandit Sampling (Adam CB), which is described in Algorithm 1. The algorithm begins by initializing the sample weights w0 := {w1,0, w2,0, . . . , wn,0} uniformly, assigning an equal weight of 1 to each of n training samples. At each iteration t [T], the current sample weights wt 1 = {w1,t 1, w2,t 1, . . . , wn,t 1} are used to determine the sample selection probabilities pt := {p1,t, p2,t, . . . , pn,t}, where these probabilities are controlled with the exploration parameter γ (Line 2). A subset of samples, denoted by Dt D, is chosen based on these probabilities. The set of indices for samples chosen in the mini-batch Dt is denoted by Jt := {J1 t , J2 t , . . . , JK t } [n]. Using this mini-batch Dt, an unbiased gradient estimate gt is computed (Line 3). The algorithm then updates moments estimates mt, vt, and ˆvt following the Adam-based update rules (Lines 4 6). The model parameters θt are subsequently updated based on these moment estimates (Line 7). Finally, the weights wt 1 are adjusted to reflect the importance of each sample, improving the batch selection process in future iterations (Line 8). 1Liu et al. (2020) apply Jensen s inequality to handle the expectation of the squared norm of the sum of gradient estimates. However, the convexity assumption required for Jensen s inequality does not hold in this context, rendering this step in their proof invalid. Published as a conference paper at ICLR 2025 The following sections describe the detailed process for deriving the sample probabilities pt and selecting the mini-batch Dt = {(xj, yj)}j Jt from the sample weights wt 1 utilizing our proposed combinatorial bandit sampling. 3.2 Batch Selection: Combinatorial Bandit Sampling In our approach, we incorporate a bandit framework where each sample is treated as an arm. Since multiple samples must be selected for a mini-batch, we extend the selection process to handle multiple arms. There are two primary methods for sampling multiple arms: with replacement or without replacement. The previous method, Adam BS (Liu et al., 2020), samples multiple arms with replacement. In contrast, our proposed method, Adam CB, employs a combinatorial bandit algorithm to sample multiple arms without replacement, achieved by Batch-Selection (Algorithm 2). Algorithm 2: Batch-Selection Input: Sample weights wt 1, batch size K, exploration parameter γ [0, 1) 1 Set C (1/K γ/n)/(1 γ) 2 if maxi [n] wi,t 1 C Pn i=1 wi,t 1 then 3 Let wt 1 be a sorted list of {wi,t 1}n i=1 in descending order 4 Set S Pn i=1 wi,t 1 5 for i = 1 to n do 6 Compute τ C S/(1 i C) 7 if wi,t 1 < τ then break, else update S S wi,t 1 8 Set Snull,t {i : wi,t 1 τ} and wi,t 1 = τ for i Snull,t 10 Set Snull,t 11 Set pi,t K (1 γ) wi,t 1 Pn j=1 wj,t 1 + γ n for all i [n] 12 Set Jt Dep Round(K, (p1,t, p2,t, . . . , pn,t)) (Algorithm 6) 13 return Jt, pt, Snull,t Weight Adjustment (Lines 2 10). Unlike single-arm selection bandit approach like Adam BS, where Pn i=1 pi,t = 1, because only one sample is selected at a time, Adam CB must select K samples simultaneously for a mini-batch. Therefore, it is natural to scale the sum of the probabilities to K, reflecting the expected number of samples selected in each round.2 Allowing the sum of probabilities to equal K can lead to individual probabilities pi,t exceeding 1, especially when certain samples are assigned significantly higher weights due to their importance (or gradient magnitude). To ensure valid probabilities and prevent any sample from being overrepresented, Adam CB introduces a threshold τ. If a weight wi,t 1 exceeds τ, the index i is added to a null set Snull,t, effectively removing it from active consideration for selection. The probabilities of the remaining samples are adjusted to redistribute the excess weight while ensuring the sum of probabilities remains K. Probability Computation (Line 11). After adjusting the weights, the probabilities pt for selecting each sample are computed using the adjusted weights wt 1 and the exploration parameter γ. This computation balances the need to exploit samples with higher weights (more likely to provide useful gradients) and explore other samples. The inclusion of K in the scaling ensures that the sum of probabilities matches the batch size: Pn i=1 pi,t = K. Mini-batch Selection (Line 12). The final selection of K distinct samples for the mini-batch is performed using Dep Round (Algorithm 6), originally proposed by Gandhi et al. (2006) and later adapted by Uchiya et al. (2010). Dep Round efficiently selects K distinct samples from a set of n samples, ensuring that each sample i is selected with probability pi,t. The algorithm has a computational complexity of O(n), which is significantly more efficient than a naive approach requiring consideration of all possible combinations with a complexity of at least n K . 2If the sum of probabilities were constrained to 1, the algorithm would need to perform additional rescaling or sampling adjustments. Instead, directly setting Pn i=1 pi,t = K aligns the probability distribution with the batch-level selection requirements. Published as a conference paper at ICLR 2025 3.3 Computing Unbiased Gradient Estimates Given the mini-batch data Dt = {(xj, yj)}j Jt from Algorithm 2, and since pt is a probability over the full dataset D, and Dt is sampled according to pt, we employ an importance sampling technique to compute the empirical distribution ˆP for Dt: ˆP(x, y; Dt) := 1 δ(x = xj, y = yj) where δ is the Dirac-delta function. This formulation ensures that the empirical distribution ˆP for the mini-batch Dt closely approximates the original empirical distribution ˆP(x, y; D) defined over the full dataset D, as expressed in Eq.(2). According to the empirical distribution ˆP(x, y; Dt) in Eq.(6), the online objective function ft corresponding to the mini-batch Dt (as defined in Eq.(4)) can be computed as ft(θ) = f(θ; Dt) = Z ℓ(θ; x, y)d ˆP(x, y; Dt) = 1 ℓ(θ; xj, yj) This implies that the gradient gt = θft(θ) obtained from the mini-batch Dt at iteration t is computed as follows: gt = θft(θ) = 1 θℓ(θ; xj, yj) gj,t npj,t (7) Here, we denote the gradients for each individual sample in the mini-batch Dt as {gj,t}j Jt, where Jt is the set of indices for Dt. In stochastic optimization methods like SGD and Adam, it is crucial to use an unbiased gradient estimate when updating the moment vectors. We can easily show that gt is an unbiased estimate of the true gradient g over the entire dataset by taking the expectation over pt, i.e, Ept[gt] = g. The unbiased gradient estimate gt in Eq.(7) is then used to update the first moment estimate mt and the second moment estimate vt in each iteration of the algorithm. 3.4 Update of Sample Weights The final step in each iteration of Algorithm 1 involves updating the sample weights wt. Treating the optimization problem as an adversarial semi-bandit, our partial feedback consists only of the gradients {gj,t}j Jt. The loss ℓi,t occurred when the i-th arm is pulled is computed based on the norm of the gradient gi,t . Specifically, the loss ℓi,t is always non-negative and inversely related to gi,t . This implies that samples with smaller gradient norms are assigned lower weights, while samples with larger gradient norms are more likely to be selected in future iterations. Algorithm 3: Weight-Update Input: wt 1, pt, Jt, {gj,t}j Jt, Snull,t, γ [0, 1) 1 for j = 1 to n do 2 Compute loss ℓj,t = p2 min L2 gj,t 2 (pj,t)2 + L2 if j Jt; otherwise ℓj,t = 0 3 if j / Snull,t then 4 wj,t wj,t 1 exp ( Kγℓj,t/n) 5 return wt 4 Regret Analysis In this section, we present a regret analysis for our proposed algorithm, Adam CB. We begin by introducing the standard assumptions commonly used in the analysis of optimization algorithms. Assumption 1 (Bounded gradient). There exists L > 0 such that gi,t L for all i [n] and t [T]. Assumption 2 (Bounded parameter). There exists D > 0 such that θs θt D for any s, t [T]. Published as a conference paper at ICLR 2025 Discussion of Assumptions. Both Assumptions 1 and 2 are the standard assumptions in the relevant literature that studies the regret bounds of Adam-based optimization (Kingma & Ba, 2015; Reddi et al., 2018; Luo et al., 2019; Liu et al., 2020; Chen et al., 2020). A closely related work (Liu et al., 2020) relies on the additional stronger assumption of a doubly heavy-tailed feature distribution. In contrast, the regret bound for Adam CB is derived using only these two standard assumptions. 4.1 Regret Bound of Adam CB Theorem 1 (Regret bound of Adam CB). Suppose Assumptions 1-2 hold, and we run Adam CB for a total T iterations with αt = α t and with β1,t := β1λt 1, λ (0, 1). Then, the cumulative regret of Adam CB (Algorithm 1) with batch size K is upper-bounded by Discussion of Theorem 1. Theorem 1 establishes that the cumulative regret bound of Adam CB is sub-linear in T, i.e., Rπ(T) = o(T). Hence, Adam CB is guaranteed to converge to the optimal solution. The first term in the regret bound, d T, which is commonly shared by the results in all Adam-based methods (Kingma & Ba, 2015; Reddi et al., 2018; Liu et al., 2020). The second term, ( d/n3/4) ((T/K) ln (n/K))1/4, illustrates the impact of the number of samples n as well as the batch size K on regret. As the number of samples n increases, this term decreases, suggesting that having more data generally helps in reducing regret (hence converging faster to optimality). Similarly, increasing the batch size K also decreases this term, reflecting that larger mini-batches can reduce the variance in gradient estimates, thus improving the performance. 4.2 Proof Sketch of Theorem 1 In this section, we present the proof sketch of the regret bound in Theorem 1. The proof start by decomposing the cumulative regret Rπ(T) into three parts: the cumulative online regret Rπ online(T) and auxiliary terms (A) and (B), as shown below: Rπ(T) = Rπ online(T) + E t=1 (f(θt; D) f(θt; Dt)) t=1 f(θ; Dt) T min θ Rd f(θ; D) We now prove the following two key lemmas to bound the online regret Rπ online(T). Lemma 1. Suppose Assumptions 1-2 hold. Adam CB (Algorithm 1) with a mini-batch of size K, which is formed dynamically by distribution pt, achieves the following upper-bound for the cumulative online regret Rπ online(T) over T iterations, Rπ online(T) ρ1d v u u t 1 n2K where ρ1, ρ2, and ρ3 are constants (See Appendix B.2). Lemma 1 provides an upper bound for the cumulative online regret over T iterations. This lemma shows that pt affects the upper bound of Rπ online(T). Hence, we wish to choose pt that could lead to minimizing the upper bound. The following key lemma shows that it can be achieved by a combinatorial semi-bandit approach, adapted from EXP3 (Auer et al., 2002). Lemma 2. Suppose Assumptions 1-2 hold. If we set γ = min 1, q , the batch selection (Algorithm 2) and the weight update rule (Algorithm 3) following Adam CB (Algorithm 1) implies Published as a conference paper at ICLR 2025 Table 1: Comparison of Regret Bounds Optimizer Regret Bound Adam X (Tran et al., 2019) (variant of Adam ) O d Adam BS (Liu et al., 2020) (corrected ) O d d n3/4 (T ln n) 1 4 Adam CB (Ours) O d While the convergence of Adam (Kingma & Ba, 2015) has been correctly established by Zhang et al. (2022), the results in Zhang et al. (2022) do not provide regret guarantees. For fair comparisons of regret, we use a slight modification of Adam, Adam X (Tran et al., 2019), for which correct regret analysis is feasible. Similarly, for Adam BS(Liu et al., 2020), since the existing proof of regret guarantee contains technical errors, we present a corrected proof, resulting in a new regret bound (Theorem 3). Lemma 2 bounds the difference between the expected cumulative loss of the chosen mini-batch and the optimal mini-batch, showing sub-linear growth in T with dependence on the batch size K. Combining Lemma 1 and Lemma 2, we can bound the cumulative online regret Rπ online(T), which also grows sub-linearly in T. Proofs of Lemma 1 and Lemma 2 are in Appendix B.2 and Appendix B.3, respectively. The discrepancy terms (A) and (B) in Eq.(8) capture the difference between the full dataset D and the mini-batches {Dt}T t=1, and are also bounded sub-linearly in T (See Lemma 11 in Appendix B.4). Since the cumulative regret Rπ(T) is decomposed into the online regret Rπ online(T) with additional sub-linear terms, we obtain the cumulative regret bound for Adam CB. 4.3 Comparisons with Adam and Adam BS Our main goal is to demonstrate that the convergence rate of Adam CB (Algorithm 1) is provably more efficient than those of the existing Adam-based methods including ones that employ uniform sampling and Adam BS (Liu et al., 2020) that utilizes (non-combinatorial) bandit sampling. Since the claimed regret bounds in the original Adam and Adam BS are invalid, we provide new regret bounds for both a variant of Adam, called Adam X (Tran et al., 2019) and Adam BS in Theorems 2 and 3, respectively, which may be of independent interest. For comparison purposes, we additionally introduce the following assumption: Assumption 3 (Bounded variance of gradient). There exists σ > 0 such that Var( gi,t ) σ2 for all i [n] and t [T] Assumption 3 is commonly used in the previous literature (Reddi et al., 2016; Nguyen et al., 2018; Zou et al., 2019; Patel et al., 2022). It is important to note that Assumption 3 is not required for the analysis of our algorithm in Theorem 1. Rather, we employ the assumption to fairly compare with corrected results for the existing Adam-based methods (Tran et al., 2019; Liu et al., 2020). Under Assumptions 1, 2, and 3, the regret bound for the Adam variant (Adam X) using uniform sampling is given by O(d d T) (Theorem 2 in Appendix C), while the regret bound for a corrected version of Adam BS using (non-combinatorial) bandit sampling is O d d(T ln n)1/4 (Theorem 3 in Appendix D) when Assumptions 1 and 2 hold. The comparisons of regret bounds are outlined in Table 1. Faster Convergence of Adam CB. The second term in the regret bound of Adam X exhibits a dependence on n 1/2, which is the rate of regret decrease as the dataset size increases. However, this reduction in regret occurs at a slower rate compared to bandit-based sampling methods. Both Adam BS (corrected) and Adam CB achieve an improved n 3/4 dependency, resulting in a faster convergence. When comparing the two bandit-based sampling methods, Adam CB surpasses Adam BS (corrected) in terms of convergence, particularly by the factor of the batch size K. That is, as far as regret performance is concerned, Adam BS does not benefit from multiple samples in batch while our Adam CB enjoys faster convergence. Hence, Adam CB is not only the first algorithm with correct performance guarantees for Adam X with adaptive batch selection, but to our best knowledge, also the method with the fastest convergence guarantees in terms of regret performance. Published as a conference paper at ICLR 2025 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs (b) Fashion MNIST 0 2 4 6 8 10 Epochs 2.4 (c) CIFAR-10 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs Adam Adam X Adam BS Adam CB (ours) Figure 1: Performances with MLP model on MNIST, Fashion MNIST, and CIFAR10 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs (b) Fashion MNIST 0 2 4 6 8 10 Epochs 2.50 (c) CIFAR-10 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs Adam Adam X Adam BS Adam CB (ours) Figure 2: Performances with CNN model on MNIST, Fashion MNIST, and CIFAR10 5 Numerical Experiments Experimental Setup. To evaluate our proposed algorithm, Adam CB, we conduct experiments using deep neural networks, including multilayer perceptrons (MLP) and convolutional neural networks (CNN), on three benchmark datasets: MNIST, Fashion MNIST, and CIFAR10. Comparisons are made with Adam, Adam X and Adam BS, with all experiments implemented in Py Torch. Performance is assessed by plotting training and test losses over epochs, with training loss calculated on the full dataset and test loss calculated on the held-out validation data set. Results represent the average of five runs with different random seeds, including standard deviations. All methods use the same hyperparameters: β1 = 0.9, β2 = 0.999, γ = 0.4, K = 128, and α = 0.001. Additional experimental details are provided in Appendix F. Results. Figures 1 and 2 show that Adam CB consistently outperforms Adam, Adam X and Adam BS, demonstrating faster reductions in both training and test losses across all datasets. These results suggest that combinatorial bandit sampling is more effective than uniform sampling for performance optimization. Attempts to replicate the results of Adam BS from Liu et al. (2020) revealed inconsistent outcomes, with significant fluctuations in losses, indicating potential instability and divergence. In contrast, Adam CB exhibits consistent convergence across all datasets, highlighting its superior performance and practical efficiency compared to Adam, Adam X and Adam BS. Additional experimental results in Appendix F further reinforce the superior performance of Adam CB. Published as a conference paper at ICLR 2025 Reproducibility Statement For each theoretical result, we present the complete set of assumptions in the main paper and the detailed proofs of the main results are provided in the appendix, along with experimental details and additional experiments in Appendix F to reproduce the main experimental results. Acknowledgements The authors would like to thank Se Young Chun and Byung Hyun Lee for bringing the previous work (Liu et al., 2020) to our attention. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2022-NR071853 and RS-2023-00222663) and by AI-Bio Research Grant through Seoul National University. Ahmet Alacaoglu, Yura Malitsky, Panayotis Mertikopoulos, and Volkan Cevher. A new regret analysis for adam-type algorithms. In International conference on machine learning, pp. 202 210. PMLR, 2020. Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. Aman Bansal, Shubham Anand Jain, and Bharat Khandelwal. Bag of tricks for faster & stable image classification. CS231n Course Project Report, 2022. URL https://cs231n.stanford.edu/ reports/2022/pdfs/122.pdf. Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. Closing the generalization gap of adaptive gradient methods in training deep neural networks. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2020. Yineng Chen, Zuchao Li, Lefei Zhang, Bo Du, and Hai Zhao. Bidirectional looking with a novel double exponential moving average to adaptive and non-adaptive momentum optimizers. In International Conference on Machine Learning, pp. 4764 4803. PMLR, 2023. Alexandre D efossez, L eon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=ZPQhz TSWA7. Timothy Dozat. Incorporating nesterov momentum into adam. International Conference on Learning Representations, 2016. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. Rajiv Gandhi, Samir Khuller, Srinivasan Parthasarathy, and Aravind Srinivasan. Dependent rounding and its applications to approximation algorithms. Journal of the ACM (JACM), 53(3):324 360, 2006. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Angelos Katharopoulos and Franc ois Fleuret. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pp. 2525 2534. PMLR, 2018. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. Rui Liu, Tianyi Wu, and Barzan Mozafari. Adam with bandit sampling for deep learning. Advances in Neural Information Processing Systems, 33:5393 5404, 2020. Published as a conference paper at ICLR 2025 Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976 11986, 2022. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=Bkg3g2R9FX. Lam Nguyen, Phuong Ha Nguyen, Marten Dijk, Peter Richt arik, Katya Scheinberg, and Martin Tak ac. Sgd and hogwild! convergence without the bounded gradients assumption. In International Conference on Machine Learning, pp. 3750 3758. PMLR, 2018. Vivak Patel, Shushu Zhang, and Bowen Tian. Global convergence and stability of stochastic gradient descent. Advances in Neural Information Processing Systems, 35:36014 36025, 2022. Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pp. 314 323. PMLR, 2016. Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. Tijmen Tieleman and Geoffrey Hinton. Rmsprop: Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. COURSERA Neural Networks Mach. Learn, 17, 2012. Phuong Thi Tran et al. On the convergence proof of amsgrad and a new version. IEEE Access, 7: 61706 61716, 2019. Taishi Uchiya, Atsuyoshi Nakamura, and Mineichi Kudo. Algorithms for adversarial bandit problems with multiple plays. In International Conference on Algorithmic Learning Theory, pp. 375 389. Springer, 2010. Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam can converge without any modification on update rules. Advances in neural information processing systems, 35: 28386 28399, 2022. Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33:18795 18806, 2020. Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 11127 11135, 2019. Published as a conference paper at ICLR 2025 A Auxiliary Lemmas Definition 1. A function f : Rd R is convex if for all u, v Rd, and all λ [0, 1], λf(u) + (1 λ)f(v) f(λu + (1 λ)v) Lemma 3. If a function f : Rd R is convex, then for all u, v Rd, f(v) f(u) + f(u) (v u) where ( ) denotes the transpose of ( ). Lemma 4 (Cauchy-Schwarz inequality). For all n 1, ai, bi R, (1 i n), n X Lemma 5 (Taylor series). For α R, and 0 α 1, X t 1 αt = 1 1 α and X t 1 tαt 1 = 1 (1 α)2 Lemma 6 (Upper bound for the harmonic series). For N N, 1 n ln N + 1 and Lemma 7. For all n N, and ai, bi R such that ai 0 and bi > 0 for all i [n], Pn i=1 ai Pn j=1 bj B Proof for Adam CB Regret Bound In this section, we provide proofs of key lemmas, Lemma 1 and Lemma 2. They are needed to prove Theorem 1, which shows the regret bound for Adam CB. In the last of this section, we present the proof for Theorem 1. B.1 Auxiliary Lemmas for Lemma 1 We first present auxiliary lemmas and proofs for Lemma 1. Our proofs basically follow arguments as in Tran et al. (2019). For the sake of completeness, all lemmas from Tran et al. (2019) are restated with our problem setting. Lemma 8. For all t 1, we have ˆvt = max (1 β1,t)2 (1 β1,s)2 vs for all 1 s t , (9) where ˆvt is in Adam CB (Algorithm 1). Proof. Prove by induction on t. Recall that by the update rule on ˆvt, we have ˆv1 v1, ˆvt max n (1 β1,t)2 (1 β1,t 1)2 ˆvt 1, vt o if t 2. Thus, ˆv2 = max (1 β1,2)2 (1 β1,1)2 ˆv1, v2 = max (1 β1,2)2 (1 β1,1)2 v1, v2 = max (1 β1,2)2 (1 β1,s)2 vs, 1 s 2 Published as a conference paper at ICLR 2025 which we proved for the case when t = 2 in Eq.(9). Now, assume that ˆvt 1 = max (1 β1,t 1)2 (1 β1,s)2 vs for all 1 s t 1 , and Eq.(9) holds for all 1 j t 1. By the update rule on ˆvt, ˆvt = max (1 β1,t)2 (1 β1,t 1)2 ˆvt 1, vt = max (1 β1,t)2 (1 β1,t 1)2 max (1 β1,t 1)2 (1 β1,s)2 vs for all 1 s t 1 , vt = max max (1 β1,t)2 (1 β1,t 1)2 (1 β1,t 1)2 (1 β1,s)2 vs for all 1 s t 1 , (1 β1,t)2 (1 β1,t 1)2 vt = max max (1 β1,t)2 (1 β1,s)2 vs for all 1 s t 1 , (1 β1,t)2 (1 β1,t 1)2 vt = max (1 β1,t)2 (1 β1,s)2 vs for all 1 s t which ends the proof. Lemma 9. For all t 1, we have p ˆvt L γ(1 β1) where ˆvt is in Adam CB (Algorithm 1). Proof. By Lemma 8, ˆvt = max (1 β1,t)2 (1 β1,s)2 vs for all 1 s t Therefore, there is some 1 s t such that ˆvt = (1 β1,t)2 (1 β1,s)2 vs. Recall that by the update rule on vt, we have vt β2vt 1 + (1 β2)g2 t . This implies vt = (1 β2) k=1 βt k 2 g2 k (1 β1,s)2 vs k=1 βs k 2 g2 k k=1 βs k 2 ( max 1 r s gr )2 Recall the unbiased gradient estimate gt in Eq.(7), By the triangle inequality property of norms and the fact that pi,t γ/n and gi,t L for all i [n] and t [T] from Assumption 1, the unbiased gradient estimate is bounded by L/γ, i.e, Published as a conference paper at ICLR 2025 gt L/γ. Therefore, ˆvt (L/γ) p = (L/γ) 1 β1,t which ends the proof. Lemma 10. For the parameter settings and conditions assumed in Lemma 1, we have ln T + 1 (1 β1) 1 β2(1 η) g1:T,u Proof. Recall that by the update rule on mt, vt, we have mt β1,tmt 1 + (1 β1,t)gt and vt β2vt 1 + (1 β2)g2 t . This implies k=1 (1 β1,k) gk, vt = (1 β2) k=1 βt k 2 g2 k Since for all t 1, ˆvt,u vt,u by Lemma 8, we have tˆvt,u m2 t,u tvt,u h Pt k=1(1 β1,k) Qt r=k+1 β1,r gk,u i2 (1 β2)t Pt k=1 βt k 2 g2 k,u Pt k=1(1 β1,k)2 Qt r=k+1 β1,r Pt k=1 Qt r=k+1 β1,r g2 k,u (1 β2)t Pt k=1 βt k 2 g2 k,u Pt k=1 βt k 1 Pt k=1 βt k 1 g2 k,u (1 β2)t Pt k=1 βt k 2 g2 k,u 1 (1 β1) 1 β2 Pt k=1 βt k 1 g2 k,u q t Pt k=1 βt k 2 g2 k,u where the second inequality is by Lemma 4, the third inequality is from the fact that β1,k 1 and β1,k β1 for all 1 k T, and the fourth inequality is obtained by applying Lemma 5 to Published as a conference paper at ICLR 2025 Pt k=1 βt k 1 . Therefore, tˆvt,u 1 (1 β1) 1 β2 Pt k=1 βt k 1 g2 k,u q Pt k=1 βt k 2 g2 k,u 1 (1 β1) 1 β2 βt k 1 g2 k,u q βt k 2 g2 k,u = 1 (1 β1) 1 β2 βt k 2 |gk,u| = 1 (1 β1) 1 β2 k=1 ηt k|gk,u| where the second inequality is by Lemma 7 and we define η := β1 β2 . Therefore, tˆvt,u = 1 (1 β1) 1 β2 k=1 ηt k|gk,u| (10) It is sufficient to consider PT t=1 1 t Pt k=1 ηt k|gk,u|. Firstly, this can be expanded as: k=1 ηt k|gk,u| = η0|g1,u| η1|g1,u + η0|g2,u|] η2|g1,u + η1|g2,u| + η0|g3,u|] ηT 1|g1,u + ηT 2|g2,u| + + η0|g T,u|] Changing the role of |g1,u| as the common factor, we obtain, k=1 ηt k|gk,u| = |g1,u| η0 + 1 + |g T,u| 1 In other words, T X k=1 ηt k|gk,u| = Moreover, since k=t ηk t = 1 Published as a conference paper at ICLR 2025 where the last inequality is by Lemma 5, we obtain k=1 ηt k|gk,u| t=1 |gt,u| 1 Furthermore, since t=1 g2 t,u ( 1 + ln T) g1:T,u where the first inequality is by Lemma 4 and the last inequality is by Lemma 6, we obtain k=1 ηt k|gk,u| Hence, by Eq.(10), T X 1 + ln T (1 β1) 1 β2(1 η) g1:T,u which ends the proof. B.2 Proof for Lemma 1 Lemma 1. Suppose Assumptions 1-2 hold. Adam CB (Algorithm 1) with a mini-batch of size K, which is formed dynamically by distribution pt, achieves the following upper-bound for the cumulative online regret Rπ online(T) over T iterations, Rπ online(T) ρ1d v u u t 1 n2K where ρ1, ρ2, and ρ3 are defined as follows: ρ1 = D2L 2αγ(1 β1)2 , ρ2 = α 1 + ln T (1 β1)2 1 β2(1 η), ρ3 = dβ1D2L 2αγ(1 β1)2(1 λ)2 Note that d is the dimension of parameter space and the inputs of Algorithm 1 follows these conditions: (a) αt = α t, (b) β1, β2 [0, 1), β1,t := β1λt 1 for all t [T], λ (0, 1), (c) η = β1/ β2 1, and (d) γ [0, 1). Proof. Recall Lemma 3. Since ft : Rd R is convex, we have, ft(θ ) ft(θt) g T t (θ θt). This means that ft(θt) ft(θ ) g T t (θt θ ) = u=1 gt,u(θt,u θ ,u) From the parameter update rule presented in Algorithm 1, θt+1 = θt αtmt/ p β1,t ˆvt mt 1 + (1 β1,t) ˆvt gt We focus on the u-th dimension of the parameter vector θt Rd. Substract the scalar θ ,u and square both sides of the above update rule, we have, (θt+1,u θ ,u)2 = (θt,u θ ,u)2 2αt ˆvt,u mt 1,u + (1 β1,t) p (θt,u θ ,u) + α2 t Published as a conference paper at ICLR 2025 We can rearrange the above equation gt,u(θt,u θ ,u) = ˆvt,u 2αt(1 β1,t) (θt,u θ ,u)2 (θt+1,u θ ,u)2 + αt 2(1 β1,t) m2 t,u p ˆvt,u β1,t (1 β1,t)mt 1,u(θt,u θ ,u) (11) Rπ online(T) = E t=1 ft(θt) min θ Rd t=1 [ft(θt) ft(θ )] where θ arg minθ Rd PT t=1 ft(θ) is defined as the optimal parameter that minimizes the cumulative loss over given T iterations. Hence, Rπ online(T) = E t=1 [ft(θt) ft(θ )] t=1 g T t (θt θ ) u=1 gt,u(θt,u θ ,u) Combining Eq.(11) with Eq.(12), we obtain Rπ online(T) E ˆvt,u 2αt(1 β1,t) (θt,u θ ,u)2 (θt+1,u θ ,u)2 # αt 2(1 β1,t) m2 t,u p β1,t (1 β1,t)mt 1,u(θ ,u θt,u) On the other hand, for all t 2, we have mt 1,u(θ ,u θt,u) = (ˆvt 1,u)1/4 αt 1 (θ ,u θt,u) αt 1 mt 1,u (ˆvt 1,u)1/4 ˆvt 1,u 2αt 1 (θ ,u θt,u)2 + αt 1 m2 t 1,u 2 p ˆvt 1,u where the inequality is from the fact that pq p2/2 + q2/2 for any p, q R. Hence, Rπ online(T) E ˆvt,u 2αt(1 β1,t) (θt,u θ ,u)2 (θt+1,u θ ,u)2 # αt 2(1 β1,t) m2 t,u p β1,tαt 1 2(1 β1,t) m2 t 1,u p ˆvt 1,u 2αt 1(1 β1,t)(θ ,u θt,u)2 # Since β1,t β1(1 t T), we obtain ˆvt 1,u 2αt 1(1 β1,t)(θ ,u θt,u)2 ˆvt 1,u 2αt 1(1 β1)(θ ,u θt,u)2 Moreover, we have β1,tαt 1 2(1 β1,t) m2 t 1,u p β1,t+1αt 2(1 β1,t+1) m2 t,u p αt 2(1 β1,t+1) m2 t,u p αt 2(1 β1) m2 t,u p Published as a conference paper at ICLR 2025 where the last inequality is from the assumption that β1,t β1 < 1(1 t T). Therefore, αt 2(1 β1,t) m2 t,u p β1,tαt 1 2(1 β1,t) m2 t 1,u p and we obtain the bound for Rπ online(T) as: Rπ online(T) E ˆvt,u 2αt(1 β1,t) (θt,u θ ,u)2 (θt+1,u θ ,u)2 # ˆvt 1,u 2αt 1(1 β1)(θ ,u θt,u)2 # Now, we start to bound each term: (13), (14), and (15). Bound for the term (13). Let us rewrite the term (13) as ˆvt,u 2αt(1 β1,t) (θt,u θ ,u)2 (θt+1,u θ ,u)2 # ˆv1,u 2α1(1 β1,1)(θ1,u θ ,u)2 # ˆvt,u 2αt(1 β1,t)(θt,u θ ,u)2 # ˆvt 1,u 2αt 1(1 β1,t 1)(θt,u θ ,u)2 # ˆv T,u 2αT (1 β1,T )(θT,u θ ,u)2 # Omitting the last term and replacing αt = α/ t(1 t T), we obtain ˆvt,u 2αt(1 β1,t) (θt,u θ ,u)2 (θt+1,u θ ,u)2 # ˆv1,u 2α(1 β1,1)(θ1,u θ ,u)2 # t=2 (θt,u θ ,u)2 p tˆvt,u (1 β1,t) (t 1)ˆvt 1,u (1 β1,t 1) Recall that by the update rule on ˆvt, we have ˆvt,u max n (1 β1,t)2 (1 β1,t 1)2 ˆvt 1,u, vt,u o . Therefore, ˆvt,u (1 β1,t)2 (1 β1,t 1)2 ˆvt 1,u, and hence tˆvt,u (1 β1,t) (t 1)ˆvt 1,u (1 β1,t 1) t (1 β1,t)2 (1 β1,t 1)2 ˆvt 1,u (t 1)ˆvt 1,u (1 β1,t 1) tˆvt 1,u (1 β1,t 1) (t 1)ˆvt 1,u (1 β1,t 1) > 0 Published as a conference paper at ICLR 2025 Now by the positivity of the essential formula tˆvt,u (1 β1,t) (t 1)ˆvt 1,u (1 β1,t 1) , we obtain ˆvt,u 2αt(1 β1,t) (θt,u θ ,u)2 (θt+1,u θ ,u)2 # ˆv1,u (1 β1) + D2 tˆvt,u (1 β1,t) (t 1)ˆvt 1,u (1 β1,t 1) T ˆv T,u (1 β1,T ) d D2L 2αγ(1 β1)2 where the last inequality is by Lemma 9. Bound for the term (14). ln T + 1 (1 β1) 1 β2(1 η) g1:T,u ln T + 1 (1 β1)2 1 β2(1 η) u=1 E [ g1:T,u ] where the last inequality is by Lemma 10. Bound for the term (15). By Assumption 2 that θm θn D for any m, n [T], αt = α/ t, and β1,t = β1λt 1 β1 1, we obtain ˆvt 1,u 2αt 1(1 β1)(θ ,u θt,u)2 # (t 1)ˆvt 1,u Therefore, from Lemma 9, we obtain ˆvt 1,u 2αt 1(1 β1)(θ ,u θt,u)2 # d D2L 2αγ(1 β1)2 E t=2 β1λt 1p t=2 β1tλt 1 β1 (1 λ)2 where the first inequality is from the fact that β1 1, and the last inequality is from Lemma 5. Thus, the bound for the term (15) is ˆvt 1,u 2αt 1(1 β1)(θ ,u θt,u)2 # dβ1D2L 2αγ(1 β1)2(1 λ)2 We bounded for terms (13), (14), and (15). Rπ online(T) d D2L 2αγ(1 β1)2 ln T + 1 (1 β1)2 1 β2(1 η) u=1 E [ g1:T,u ] + dβ1D2L 2αγ(1 β1)2(1 λ)2 Rπ online(T) ρ1d u=1 E [ g1:T,u ] + ρ3 (16) Published as a conference paper at ICLR 2025 where ρ1, ρ2, and ρ3 are defined as the following: ρ1 = D2L 2αγ(1 β1)2 , ρ2 = α 1 + ln T (1 β1)2 1 β2(1 η), ρ3 = dβ1D2L 2αγ(1 β1)2(1 λ)2 Now, we consider Pd u=1 E [ g1:T,u ], which is in the right-hand side of Eq.(16). u=1 E [ g1:T,u ] = d t=1 E [ gt 2] where the first inequality is due to the concavity of square root. Recall that the unbiased gradient estimate is gt = 1 j Jt gj,t npj,t . Hence, Rπ online(T) ρ1d u=1 Ept [ g1:T,u ] + ρ3 t=1 Ept [ gt 2] + ρ3 v u u u u t The last inequality uses Jensen s inequality to the convex function 2. Therefore, Rπ online(T) ρ1d v u u u u t v u u u t 1 n2K where the last inequality is by Lemma 4. This completes the proof of Lemma 1. B.3 Proof for Lemma 2 Lemma 2. Suppose Assumptions 1-2 hold. If we set γ = min 1, q , the batch selection (Algorithm 2) and the weight update rule (Algorithm 3) following Adam CB (Algorithm 1) implies Proof. We set ℓj,t = p2 min L2 gj,t 2 (pj,t)2 + L2 in Algorithm 3. Since gi,t L and pi,t pmin for all i [n] and t [T] by Assumption 1, we have ℓi,t [0, 1]. Let Wt := Pn i=1 wt. Then, for any t [T], Wt Wt 1 = X i [n]\Snull,t wi,t Wt 1 + X i [n]\Snull,t Wt 1 exp Kγ Published as a conference paper at ICLR 2025 The last equality is by the weight update rule in Algorithm 3. From the probability computation in Algorithm 2, we have pi,t = K (1 γ) wi,t 1 Pn j=1 wj,t 1 + γ Thus, we obtain the following bound, n ˆℓi,t = Kγℓi,t npi,t ℓi,t 1 By the fact that e x 1 x + (e 2)x2 for all x [0, 1], and considering Kγ n ˆℓi,t as x, we have i [n]\Snull,t n ˆℓi,t + (e 2) Kγ n ˆℓi,t 2 + X i [n]\Snull,t n ˆℓi,t + (e 2) Kγ i [n]\Snull,t n ˆℓi,t + (e 2) Kγ i [n]\Snull,t pi,tˆℓi,t + K(e 2)γ2 i [n]\Snull,t pi,t(ˆℓi,t)2 i Jt\Snull,t ℓi,t + K(e 2)γ2 The last inequality uses the fact that pi,tˆℓi,t = ℓi,t 1 for i Jt and pi,tˆℓi,t = 0 for i / Jt. Taking logarithms and using the fact that ln(1 + x) x for all x > 1 gives Wt 1 γ n(1 γ) i Jt\Snull,t ℓi,t + K(e 2)γ2 By summing over t, we obtain W1 γ n(1 γ) i Jt\Snull,t ℓi,t + K(e 2)γ2 On the other hand, for the sequence {J t }T t=1 of batches with the optimal PT t=1 P j Jt ℓj,t among all subsets Jt containing K elements, P j J t wj,T W1 P j J t ln wj,T t:j / Snull,t ˆℓj,t + ln K The first line above uses the fact that X j J t wj,T K(Πj J t wj,T )1/K and the second line uses wj,T = exp (Kγ/n) P t:j / Snull,t ˆℓj,t . From combining results, t:j / Snull,t i Jt\Snull,t ℓi,t + (e 2)Kγ Published as a conference paper at ICLR 2025 t:j Snull,t ℓj,t 1 1 γ PT t=1 P i Snull,t ℓi,t trivially holds, we have t:j / Snull,t t:j Snull,t ℓj,t + n i Jt ℓi,t + (e 2)Kγ Let LMIN-K(T) := PT t=1 P j J t ℓj,t and LEXP3-K(T) := PT t=1 P j Jt ℓj,t. Taking the expectation of both sides and using the properties of ˆℓi,t, we obtain, LMIN-K(T) + n n 1 (1 γ)E[LEXP3-K(T)] + (e 2)Kγ This is because the expectation of ˆℓj,t is ℓj,t from the fact that Dep Round selects i-th sample with probability pi,t. Since PT t=1 Pn i=1 ℓi,t n LMIN-K(T ) K , we have the following statement, LMIN-K(T) E[LEXP3-K(T)] (e 1)γLMIN-K(T) + n Using the fact that LMIN-K(T) TK and choosing the input parameter as γ = , we obtain the following, LMIN-K(T) E[LEXP3-K(T)] 2 Therefore, considering the scaling factor, we have: p2 min (LMIN-K(T) E[LEXP3-K(T)]) This completes the proof of Lemma 2. Published as a conference paper at ICLR 2025 B.4 Proof for Theorem 1 (Regret Bound of Adam CB) In this section, we present the full proof of Theorem 1. Recall that the online regret only focuses on the minimization over the sequence of mini-batch datasets {Dt}T t=1. Thus, the online regret of the algorithm at the end of T iterations is defined as Rπ online(T) := E t=1 f(θt; Dt) min θ Rd t=1 f(θ; Dt) However, our ultimate goal is to find the optimal selection of the parameter under the full dataset. Consider an online optimization algorithm π that computes the sequence of model parameters θ1, . . . , θT . Then, we can compare the performance of π with the optimal selection of the parameter minθ Rd f(θ; D) under the full dataset. The cumulative regret after T iterations is t=1 f(θt; D) T min θ Rd f(θ; D) where the expectation is taken with respect to any stochasticity in data sampling and parameter estimation. Before we prove Theorem 1, we first prove the following lemma. Lemma 11. The cumulative regret Rπ(T) can be decomposed into sub-parts which includes the cumulative online regret Rπ online(T) and additional terms that are sub-linear in T: Rπ(T) = Rπ online(T) + O( Proof. First, rewrite Rπ(T) by expanding the terms inside the expectations. We add and subtract the sum PT t=1 f(θt; Dt) inside the expectation: t=1 f(θt; D) T min θ Rd f(θ; D) t=1 f(θt; D) t=1 f(θt; Dt) + t=1 f(θt; Dt) T min θ Rd f(θ; D) We also add and subtract the term minθ Rd PT t=1 f(θ; Dt) inside the expectation. Then, we have the following, t=1 f(θt; D) t=1 f(θt; Dt) + t=1 f(θt; Dt) T min θ Rd f(θ; D) t=1 f(θt; D) t=1 f(θt; Dt) t=1 f(θt; Dt) min θ Rd t=1 f(θ; Dt) t=1 f(θ; Dt) T min θ Rd f(θ; D) Since the second term of the right-hand side in above equation is equal the online cumulative regret Rπ online(T), we can rewrite Rπ(T) as: Rπ(T) = Rπ online(T) t=1 f(θt; D) t=1 f(θt; Dt) t=1 f(θ; Dt) T min θ Rd f(θ; D) Now, let us consider each term in detail. Published as a conference paper at ICLR 2025 Bound for the term (17). Recall the expression of f(θ; D) and ft; = f(θ; Dt): f(θ; D) = 1 i=1 ℓ(θ; xi, yi), f(θ; Dt) = 1 ℓ(θ; xj, yj) where Jt is the set of indices in the subset dataset (mini-batch) at iteration t, Dt D. For any θ Rd, we have E[f(θ; Dt)] = E ℓ(θ; xj, yj) j Jt E ℓ(θ; xj, yj) ℓ(θ; xi, yi) npi,t pi,t = 1 i=1 ℓ(θ; xi, yi) = f(θ; D). Note that, by linearity of expectation, we can interchange the expectation and the summation. Since E[f(θ; Dt)] = f(θ; D), we have for the term (17) as: t=1 f(θt; D) t=1 f(θt; Dt) t=1 [f(θt; D) f(θt; Dt)] t=1 E[f(θt; D) f(θt; Dt)] = 0 Bound for the term (18). Let θ be the parameter that minimizes the cumulative loss over the full dataset D, i.e, θ arg minθ Rd f(θ; D). Since θ is optimal for the full dataset, we have: min θ Rd f(θ; D) = f(θ ; D) Similarly, denote the optimal parameter for the cumulative regret for mini-batch datasets by θ t := arg minθ Rd PT t=1 f(θ; Dt). Given these notations, we can write the term (18) as: t=1 f(θ; Dt) T min θ Rd f(θ; D) t=1 f(θ t ; Dt) T f(θ ; D) We can add and subtract the term PT t=1 f(θ ; Dt) inside the expectation. t=1 f(θ t ; Dt) T f(θ ; D) t=1 f(θ t ; Dt) t=1 f(θ ; Dt) t=1 f(θ ; Dt) T f(θ ; D) Note that E[f(θ ; Dt)] = f(θ ; D) holds as we have shown when bounding the term (17). By the linearity of expectation, we have t=1 f(θ ; Dt) t=1 E[f(θ ; Dt)] = T f(θ ; D) Since E h PT t=1 f(θ ; Dt) T f(θ ; D) i = 0 holds, the term (18) reduces to t=1 (f(θ t ; Dt) f(θ ; Dt)) t=1 (ft(θ t ) ft(θ )) Published as a conference paper at ICLR 2025 By the convexity of ft, we have: ft(θ t ) ft(θ ) g T t (θ t θ ) Therefore, t=1 (ft(θ t ) ft(θ )) t=1 g T t (θ t θ ) Using bounded gradients assumption (Assumption 1), i.e, gt L/γ (Proof in Lemma 9), and Cauchy-Schwarz inequality (Lemma 4), we have t=1 g T t (θ t θ ) t=1 E[ gt θ t θ ] (L/γ) t=1 E[ θ t θ ] Recall the parameter update rule, θt+1 θt αtmt/( ˆvt + ϵ). Then θ t+1 θ θ t θ + αt mt/( p ˆvt + ϵ) (19) Now, we claim that mt is bounded. The update rule for the first moment estimate: mt β1,tmt 1 + (1 β1,t)gt Then, the expression for mt is: k=1 (1 β1,k) where β1,t = β1λt 1 with β1 < 1 and λ < 1. Note that gk is bounded by L/γ for all k. This implies that: k=1 |1 β1,k| k=1 |1 β1λk 1| r=k+1 β1λr 1 k=1 βt k 1 λ t(t 1) k(k 1) L γ(1 β1) The last inequality is due to Lemma 5. Therefore, the step size in Eq.(19) is bounded by: αt mt ˆvt + ϵ αt L ϵγ(1 β1) = αL We use the fact that αt = α/ t. By summing over T iterations, we obtain t=1 E[ θ t θ ] αL ϵγ(1 β1) The last inequality is by Lemma 6. Finally, we get t=1 E[ θ t θ ] 2αL2 T ϵγ2(1 β1) = O( In summary, the cumulative regret Rπ(T) is decomposed by the following: Rπ(T) = Rπ online(T) + (17) + (18) where (17) = 0 and (18) = O( T). Thus, this completes the proof of Lemma 11, saying Rπ(T) = Rπ online(T) + O( Published as a conference paper at ICLR 2025 Now, we prove the main Theorem 1. Proof. From Lemma 11, we have shown that the cumulative regret Rπ(T) can be decomposed into the online regret Rπ online(T) with the additional sub-linear terms. Hence, we are left to bound the cumulative online regret Rπ online(T). Recall the first key lemma (Lemma 1): Rπ online(T) ρ1d v u u t 1 n2K Recall also the second key lemma (Lemma 2): Let we denote M := minpt PT t=1 Ept h P j Jt gj,t 2 (pj,t)2 i . Then by Lemma 2, we have where C > 0 is a constant. By plugging above equation to Lemma 1, we obtain Rπ online(T) ρ1d We use the fact that b in the second inequality and we define ρ4 := ρ2 Now, we should consider M. Using the tower property, we can express M as, (pj,t)2 | pt (pi,t)2 pi,t For this minimization problem, it can be shown that for every iteration t, the optimal distribution p t is proportional to the gradient norm of individual example. Formally speaking, for any t, the optimal solution p t to the problem arg minpt PT t=1 Ept h Pn i=1 gi,t 2 i is (pj,t) = gj,t Pn i=1 gi,t for all j [n]. By plugging this solution, Published as a conference paper at ICLR 2025 By plugging M to the online regret bound expression, Rπ online(T) ρ1d By Assumption 1, gi,t L for i [n] and t [T]. Then, the second term in the right-hand side of above inequality is bounded by Lρ2 d T, which diminishes by the first term that have order of O(d T). Hence, the online regret Rπ online(T) after T iterations is, Rπ online(T) O(d Finally, by Lemma 11, we can bound the cumulative regret using the bound of the online regret as Rπ(T) = Rπ online(T) + O( This completes the proof of Theorem 1. Published as a conference paper at ICLR 2025 C Proof for Convergence Rate when using Uniform Sampling To compare the convergence rate between using uniform sampling and bandit sampling, we will now prove the following Theorem 2. It is important to note that Theorem 2 includes an additional condition Assumption 3 which was not present in Theorem 1. This assumption plays a key role in distinguishing the results between these two theorems. Theorem 2. Suppose Assumptions 1,2, and 3 hold. The convergence rate for Adam X (variant of Adam) using uniform sampling is given by: Proof. We start the proof from the first key lemma (Lemma 1): Lemma 1. Suppose Assumptions 1-2 hold. Adam CB (Algorithm 1) with a mini-batch of size K, which is formed dynamically by distribution pt, achieves the following upper-bound for the cumulative online regret Rπ online(T) over T iterations, Rπ online(T) ρ1d v u u t 1 n2K where ρ1, ρ2, and ρ3 are defined as follows: ρ1 = D2L 2αγ(1 β1)2 , ρ2 = α 1 + ln T (1 β1)2 1 β2(1 η), ρ3 = dβ1D2L 2αγ(1 β1)2(1 λ)2 Note that d is the dimension of parameter space and the inputs of Algorithm 1 follows these conditions: (a) αt = α t, (b) β1, β2 [0, 1), β1,t := β1λt 1 for all t [T], λ (0, 1), (c) η = β1/ β2 1, and (d) γ [0, 1). Consider the second term in the right-hand side of Eq.(20), (pj,t)2 | pt (pi,t)2 pi,t The tower property is used in the first equality. Since Pn i=1 gi,t 2 pi,t is independent to j Jt, the mini-batch size K is multiplied in the last equality. Therefore, we can express the cumulative online regret Rπ online(T) as: Rπ online(T) ρ1d In the case when we select samples uniformly, we can set the probability distribution pt to satisfy pi,t = 1/n for all t [T] and i [n]. By plugging it, we obtain Rπ online(T) ρ1d i=1 gi,t 2 # Published as a conference paper at ICLR 2025 Now, recall Assumption 3: Assumption 3. There exists σ > 0 such that Var( gi,t ) σ2 for all i [n] and t [T] i=1 gi,t 2 # i=1 gi,t 2 + σ2 Therefore, the online regret bound Rπ online(T) for uniform sampling is, Rπ online(T) = O(d Applying the fact that b, we obtain, Rπ online(T) = O(d By Assumption 1, gi,t L for i [n] and t [T]. Then, the second term in the right-hand side of above inequality is bounded by O( d T), which diminishes by the first term that have order of O(d T). Hence, the online regret Rπ online(T) after T iterations is given by Rπ online(T) = O(d Finally, by Lemma 11, we can bound the cumulative regret using the online regret, which completes the regret analysis for uniform sampling. Rπ(T) = Rπ online(T) + O( Published as a conference paper at ICLR 2025 D Correction of Adam BS (Liu et al., 2020) This section introduces the corrected analysis for Adam BS (Liu et al., 2020). We use Algorithm 4 and Algorithm 5 for modified Adam BS. Algorithm 4: (Corrected) Adam with Bandit Sampling (Adam BS) Input: learning rate {αt}T t=1, decay rates {β1,t}T t=1, β2, batch size K, exploration parameter γ [0, 1) Initialize: model parameters θ0; first moment estimate m0 0; second moment estimate v0 0, ˆv0 0; sample weights wi 0 1 for all i [n] 1 for t = 1 to T do 2 Compute sample distribution pt for all j [n] pj,t = (1 γ) wj,t 1 Pn i=1 wi,t 1 + γ Select a mini-batch Dt := {(xj, yj)}j Jt by sampling with replacement from pt 4 Compute unbiased gradient estimate gt with respect to the mini-batch Dt using Eq.(7) 5 mt β1,tmt 1 + (1 β1,t)gt 6 vt β2vt 1 + (1 β2)g2 t 7 ˆv1 v1, ˆvt max n (1 β1,t)2 (1 β1,t 1)2 ˆvt 1, vt o if t 2 8 θt+1 θt αtmt/( ˆvt + ϵ) 9 wt Weight-Update(wt 1, pt, Jt, {gj,t}j Jt, γ) (Algorithm 5) Algorithm 5: (Corrected) Weight-Update for Adam BS Input: wt 1, pt, Jt, {gj,t}j Jt, and γ [0, 1) 1 for j = 1 to n do 2 Compute loss ℓj,t = p2 min L2 gj,t 2 (pj,t)2 + L2 if j Jt, otherwise, ℓj,t = 0 3 Compute unbiased gradient estimate ˆℓj,t = ℓj,t PK k=1 I(j=Jk t ) Kpj,t 4 Update sample weights wj,t wj,t 1 exp γˆℓj,t/n 5 return wt At iteration t [T], Adam BS chooses a mini-batch Dt = {(xj, yj)}j Jt of size K according to probability distribution pt with replacement. We denote Jt as the set of indices for the mini-batch Dt. Then, the algorithm receives the loss, regarding losses from all chosen samples in the mini-batch D as one loss, is 1 j Jt ℓj,t, denote as ℓj,t [0, 1]. The unbiased estimate of the loss ˆℓj,t is, ˆℓj,t = ℓj,t PK k=1 I(j = Jk t ) Kpj,t We have a following key lemma concerning the rate of convergence of Adam BS. Lemma 12 (Corrected version of Lemma 1 in Liu et al. (2020)). Suppose Assumptions 1-2 hold. If we set γ = min 1, q n ln n (e 1)T , the weight update rule (Algorithm 5) following Adam BS (Algorithm 4) Proof. We set ℓj,t = p2 min L2 gj,t 2 (pj,t)2 + L2 in Algorithm 5. Since, gi,t 2 L and pi,t pmin for all t [T], i [n] by Assumption 1, we have ℓi,t [0, 1]. Published as a conference paper at ICLR 2025 We use the following simple facts, which are immediately derived from the definitions, i=1 pi,tˆℓi,t = 1 j Jt ℓj,t := ℓJt t (21) i=1 pi,t(ˆℓi,t)2 = ℓi,t PK k=1 I(i = Jk t ) Kpi,t PK k=1 I(i = Jk t ) K ˆℓi,t i=1 ˆℓi,t (22) Let Wt := Pn i=1 wt. Then, for any t [T], The last equality is by the weight update rule in Algorithm 5. From the probability computation in Algorithm 4, we have pi,t = (1 γ) wi,t 1 Pn j=1 wj,t 1 + γ Thus, we obtain the following bound, n ˆℓi,t = γ ℓi,t PK k=1 I(i = Jk t ) Kpi,t By the fact that e x 1 x + (e 2)x2 for all x [0, 1], and considering γ n ˆℓi,t as x, we have n ˆℓi,t + (e 2) γ n ˆℓi,t + (e 2) γ i=1 pi,tˆℓi,t + (e 2)(γ/n)2 i=1 pi,t(ˆℓi,t)2 1 γ ℓJt t + (e 2)(γ/n)2 The last inequality uses Eq.(21) and Eq.(22). Taking logarithms and using the fact that ln (1 + x) x for all x > 1 gives 1 γ ℓJt t + (e 2)(γ/n)2 By summing over t, we obtain t=1 ℓJt t + (e 2)(γ/n)2 On the other hand, for any action j, t=1 ˆℓj,t ln n From combining results, t=1 ℓJt t (1 γ) t=1 ˆℓj,t n ln n Published as a conference paper at ICLR 2025 We next take the expectation of both sides with respect to probability distribution pt and since Ept[ˆℓj,t] = ℓj,t, we have t=1 ℓJt t ] (1 γ) t=1 ℓj,t n ln n Since j Jt were chosen arbitrarily, we can choose the best J t for every iteration t. Let LMIN(T) := PT t=1 P j J t ℓj,t and LEXP3(T) := PT t=1 P j Jt ℓj,t. Summing over j J t , and using the fact that PT t=1 Pn i=1 ℓi,t n LMIN(T ) K , we have the following statement, E[LEXP3(T)] (1 γ)LMIN(T) n K ln n γ (e 2)γLMIN(T) Then, we get the following, LMIN(T) E[LEXP3(T)] (e 1)γLMIN(T) + n K ln n Using the fact that LMIN(T) TK and choosing the input parameter as γ = min 1, q n ln n (e 1)T we obtain the following, LMIN(T) E[LEXP3(T)] 2 n T ln n 2.63K Therefore, considering the scaling factor, we have: p2 min (LMIN(T) E[LEXP3(T)]) Theorem 3 (Corrected version of Theorem 4 in Liu et al. (2020)). Suppose Assumptions 1-2 hold. The convergence rate for (corrected) Adam BS using bandit sampling is given by: d n3/4 (T ln n)1/4 ! Proof. From Lemma 11, we have shown that the cumulative regret Rπ(T) can be decomposed into the online regret Rπ online(T) with the additional sub-linear terms. Hence, we are left to bound the cumulative online regret Rπ online(T). Recall the first key lemma (Lemma 1): Rπ online(T) ρ1d v u u t 1 n2K We can apply Lemma 1 to Adam BS as Adam CB, since both Adam BS and Adam CB follow the same model parameter update rule. However, we use the corrected lemma (Lemma 12) for Adam BS, rather than applying the key lemma (Lemma 2) used for Adam CB. Recall Lemma 12: Published as a conference paper at ICLR 2025 Let we denote M := minpt PT t=1 Ept h P j Jt gj,t 2 (pj,t)2 i . Then by Lemma 12, we have where C > 0 is a constant. By plugging above equation to Lemma 1, we obtain Rπ online(T) ρ1d n T ln n + ρ3 n T ln n + ρ3 d n (n T ln n)1/4 + ρ3 We use the fact that b in the second inequality and we define ρ5 := ρ2 C . Now, we should consider M. Using the tower property and applying the optimal solution for pt at each iteration, we can express M as, This follows the same argument as in the proof of Theorem 1 (see Appendix B.4). Then, by plugging M to the online regret bound expression, Rπ online(T) ρ1d d n (n T ln n)1/4 + ρ3 d n (n T ln n)1/4 + ρ3 d n (n T ln n)1/4 + ρ3 By Assumption 1, gi,t L for i [n] and t [T]. Then, the second term in the right-hand side of above inequality is bounded by Lρ2 d T, which diminishes by the first term that have order of O(d T). Hence, the online regret Rπ online(T) after T iterations is, Rπ online(T) = O(d d n (n T ln n)1/4 ! Finally, by Lemma 11, we can bound the cumulative regret using the bound of the online regret as Rπ(T) = Rπ online(T) + O( d n (n T ln n)1/4 ! d n3/4 (T ln n)1/4 ! This completes the proof of Theorem 3. Published as a conference paper at ICLR 2025 E Additional Algorithm E.1 Dep Round Algorithm Algorithm 6: Dep Round Input: Natural number K(< n), sample distribution p := (p1, p2, . . . , pn) with Pn i=1 pi = K Output: Subset of [n] with distinct K elements 1 while there is an i with 0 < pi < 1 do 2 Choose distinct i, j with 0 < pi < 1 and 0 < pj < 1 3 Set α = min{1 pi, pj} and β = min{pi, 1 pj} 4 Update pi and pj as: ( pi + α, pj α with probability β α+β pi β, pj + β with probability α α+β 5 return {i : pi = 1, 1 i n} The Dep Round (Gandhi et al., 2006) (Dependent Rounding) algorithm is used to select a subset of elements from a set while maintaining certain probabilistic properties. It ensures that the sum of probabilities is preserved and elements are chosen with the correct marginal probabilities. F More on Numerical Experiments F.1 Details on Experimental Setup We compared our method, Adam CB, with Adam, Adam X, and corrected Adam BS. The experiments measured training loss and test loss, averaged over five runs with different random seeds, and included 1-sigma error bars for reliability. Throughout the entire experiments, identical hyper-parameters are used with any tuning as shown in Table 2. Table 2: Hyper-parameters used for experiments Hyper-parameter Value Learning rate αt 0.001 Exponential decay rates for momentum β1,1, β2 0.9, 0.999 Decay rate for β1,1 for convergence guarantee λ 1-1e-8 ϵ for non-zero division 1e-8 Loss Function Cross-Entropy Batch Size K 128 exploration parameter γ 0.4 Number of epochs 10 We trained MLP models on the MNIST, Fashion MNIST, and CIFAR-10 datasets. The detailed architectures of the MLP models for each dataset are provided in Table 3. Table 3: MLP Architecture for MNIST/Fashion MNIST (left) and CIFAR10 (right) Layer Type Input Output Flatten (N, 28281) (N, 28281) Dense + Re LU (N, 28281) (N, 512) Dense + Re LU (N, 512) (N, 256) Dense (N, 256) (N, 10) Layer Type Input Output Flatten (N, 32323) (N, 32323) Dense + Re LU (N, 32323) (N, 512) Dense + Re LU (N, 512) (N, 256) Dense (N, 256) (N, 10) We also trained CNN models on the same datasets. The detailed architectures of the CNN models for each dataset are presented in Table 4. Published as a conference paper at ICLR 2025 Table 4: CNN Architecture for MNIST/Fashion MNIST (left) and CIFAR10 (right) Layer Type Input Output Conv + Re LU (N, 1, 28, 28) (N, 32, 28, 28) Max Pool (N, 32, 28, 28) (N, 32, 14, 14) Conv + Re LU (N, 32, 14, 14) (N, 64, 14, 14) Max Pool (N, 64, 14, 14) (N, 64, 7, 7) Flatten (N, 64, 7, 7) (N, 3136) Dense (N, 3136) (N, 128) Dense + Softmax (N, 128) (N, 10) Layer Type Input Output Conv + Re LU (N, 3, 32, 32) (N, 64, 32, 32) Max Pool (N, 64, 32, 32) (N, 64, 16, 16) Conv + Re LU (N, 64, 16, 16) (N, 128, 16, 16) Max Pool (N, 128, 16, 16) (N, 128, 8, 8) Conv + Re LU (N, 128, 8, 8) (N, 256, 8, 8) Max Pool (N, 256, 8, 8) (N, 256, 4, 4) Flatten (N, 256, 4, 4) (N, 25644) Dense (N, 25644) (N, 512) Dense + Softmax (N, 512) (N, 10) Table 5: VGG Architecture for MNIST/Fashion MNIST (left) and CIFAR10 (right) Layer Type Input Output Conv + Re LU (N, 1, 28, 28) (N, 64, 28, 28) Conv + Re LU (N, 64, 28, 28) (N, 64, 28, 28) Max Pool (N, 64, 28, 28) (N, 64, 14, 14) Conv + Re LU (N, 64, 14, 14) (N, 128, 14, 14) Conv + Re LU (N, 128, 14, 14) (N, 128, 14, 14) Max Pool (N, 128, 14, 14) (N, 128, 7, 7) Conv + Re LU (N, 128, 7, 7) (N, 256, 7, 7) Conv + Re LU (N, 256, 7, 7) (N, 256, 7, 7) Conv + Re LU (N, 256, 7, 7) (N, 256, 7, 7) Max Pool (N, 256, 7, 7) (N, 256, 3, 3) Flatten (N, 256, 3, 3) (N, 2304) Dense (N, 2304) (N, 512) Dense (N, 512) (N, 512) Dense (N, 512) (N, 10) Layer Type Input Output Conv + Re LU (N, 3, 32, 32) (N, 64, 32, 32) Conv + Re LU (N, 64, 32, 32) (N, 64, 32, 32) Max Pool (N, 64, 32, 32) (N, 64, 16, 16) Conv + Re LU (N, 64, 16, 16) (N, 128, 16, 16) Conv + Re LU (N, 128, 16, 16) (N, 128, 16, 16) Max Pool (N, 128, 16, 16) (N, 128, 8, 8) Conv + Re LU (N, 128, 8, 8) (N, 256, 8, 8) Conv + Re LU (N, 256, 8, 8) (N, 256, 8, 8) Conv + Re LU (N, 256, 8, 8) (N, 256, 8, 8) Max Pool (N, 256, 8, 8) (N, 256, 4, 4) Flatten (N, 256, 4, 4) (N, 4096) Dense (N, 4096) (N, 512) Dense (N, 512) (N, 512) Dense (N, 512) (N, 10) We also evaluated the AMSGrad optimizer and the corrected Adam BS algorithm (Algorithm 4) on the CIFAR-10 dataset using both MLP and CNN models. The results are presented in Figures 3 and 4. From these plots, it is evident that our Adam CB algorithm outperforms the other Adam-based algorithms. To further assess performance, we conducted experiments using the VGG model, which is a larger architecture compared to the MLP and CNN models. The detailed structure of the VGG architecture is provided in Table 5, and the results are shown in Figure 5. 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs (b) Fashion MNIST 0 2 4 6 8 10 Epochs 2.4 (c) CIFAR-10 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs Adam Adam X Adam BS Adam BS (corrected) AMSGrad Adam CB (ours) Figure 3: Comparison of Adam-based optimizations on MLP model Published as a conference paper at ICLR 2025 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs (b) Fashion MNIST 0 2 4 6 8 10 Epochs 2.50 (c) CIFAR-10 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs Adam Adam X Adam BS Adam BS (corrected) AMSGrad Adam CB (ours) Figure 4: Comparison of Adam-based optimizations on CNN model 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs (b) Fashion MNIST 0 2 4 6 8 10 Epochs 2.4 (c) CIFAR-10 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs Adam Adam X Adam BS Adam BS (corrected) AMSGrad Adam CB (ours) Figure 5: Comparison of Adam-based optimizations on VGG model F.2 Additional Experiments To further evaluate the effectiveness of our proposed method, we conducted additional experiments using logistic regression, Res Net-18 (He et al., 2016), Conv Ne Xt-Base (Liu et al., 2022), and 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs Adam Adam BS Adam CB (ours) Figure 6: Comparison of Adam-based optimizations on the logistic regression model (MNIST) Published as a conference paper at ICLR 2025 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs 3.0 (b) Fashion MNIST 0 2 4 6 8 10 Epochs 2.8 (c) CIFAR-10 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs Adam Adam BS Adam CB (ours) Figure 7: Comparison of Adam-based optimizations on Res Net-18 model 0 2 4 6 8 10 Epochs Training Loss 0 2 4 6 8 10 Epochs 3.0 (b) Fashion MNIST 0 2 4 6 8 10 Epochs 2.5 (c) CIFAR-10 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs Adam Adam BS Adam CB (ours) Figure 8: Comparison of Adam-based optimizations on Conv Next-base model 0 2 4 6 8 10 Epochs Training Loss 0 2 4 6 8 10 Epochs 2.5 (b) Fashion MNIST 0 2 4 6 8 10 Epochs 2.5 (c) CIFAR-10 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs 0 2 4 6 8 10 Epochs Adam Adam BS Adam CB (ours) Figure 9: Comparison of Adam-based optimizations on Conv Next-large model Published as a conference paper at ICLR 2025 Conv Ne Xt-Large (Liu et al., 2022) networks. The archecture of The logistic regression model was employed to assess the performance of our algorithm in convex optimization settings. For general non-convex optimization, we tested our method on the Res Net-18, Conv Ne Xt-Base, and Conv Ne Xt-Large models. Notably, Res Net-18 (11.4 million parameters), Conv Ne Xt-Base (89 million parameters), and Conv Ne Xt-Large (198 million parameters) are substantially larger architectures compared to the simple MLP and CNN models evaluated in the previous section. These experiments demonstrate the scalability and efficiency of our algorithm on larger, more complex models. In all experiments, our proposed algorithm, Adam CB, consistently outperformed existing methods, reaffirming its effectiveness across both convex and non-convex optimization tasks and on models of varying complexity. Published as a conference paper at ICLR 2025 G When L is not known Algorithm 7: Weight-Update (with unknown L) Input: wt 1, pt, Jt, {gj,t}j Jt, Snull,t, γ [0, 1), Lt 1 1 Set Lt max(Lt 1, maxj Jt gj,t ) 2 for j = 1 to n do 3 Compute loss ℓj,t = p2 min L2 t (pj,t)2 + L2 t p2 min if j Jt; otherwise ℓj,t = 0 4 if j / Snull,t then 5 wj,t wj,t 1 exp ( Kγℓj,t/n) 6 return wt, Lt Lemma 13. (Lemma 9 when L is unknown) For all t 1, we have p ˆvt Lt γ(1 β1) where ˆvt is in Adam CB (Algorithm 1). Proof. The argument follows the same reasoning as presented in Lemma 9, with the modification that L is replaced by Lt, reflecting the condition that gi,t Lt for all i [n] at any t. Lemma 14. (Lemma 1 when L is unknown) Suppose Assumptions 1-2 hold. Adam CB (Algorithm 1) with a mini-batch of size K, which is formed dynamically by distribution pt, achieves the following upper-bound for the cumulative online regret Rπ online(T) over T iterations, Rπ online(T) ρ 1d v u u t 1 n2K where ρ 1, ρ 2, and ρ 3 are defined as follows: ρ 1 = D2LT 2αγ(1 β1)2 , ρ 2 = α 1 + ln T (1 β1)2 1 β2(1 η), ρ 3 = dβ1D2LT 2αγ(1 β1)2(1 λ)2 Note that d is the dimension of parameter space and the inputs of Algorithm 1 follows these conditions: (a) αt = α t, (b) β1, β2 [0, 1), β1,t := β1λt 1 for all t [T], λ (0, 1), (c) η = β1/ β2 1, and (d) γ [0, 1). Proof. The proof is the same as Lemma 1 until bounding the terms (13), (14), and (15). Bound for the term (13). Following the same reasoning as Lemma 1, we have ˆvt,u 2αt(1 β1,t) (θt,u θ ,u)2 (θt+1,u θ ,u)2 # T ˆv T,u (1 β1,T ) d D2Lt 2αγ(1 β1)2 where the last inequality is by Lemma 13. Bound for the term (14). Nothing changes here. Bound for the term (15). Following the same reasoning as Lemma 1, we obtain ˆvt 1,u 2αt 1(1 β1)(θ ,u θt,u)2 # (t 1)ˆvt 1,u Therefore, from Lemma 13, we obtain ˆvt 1,u 2αt 1(1 β1)(θ ,u θt,u)2 # 2αγ(1 β1)2 E t=2 β1,t Lt p Published as a conference paper at ICLR 2025 Since Lt is a running max, {Lt}T t=1 is a non-decreasing sequence, i.e., L1 L2 LT . Thus, the inequality (23) becomes ˆvt 1,u 2αt 1(1 β1)(θ ,u θt,u)2 # d D2LT 2αγ(1 β1)2 E t=2 β1λt 1p t=2 β1tλt 1 β1 (1 λ)2 where the first inequality is from the fact that β1 1, and the last inequality is from Lemma 5. Thus, the bound for the term (15) is ˆvt 1,u 2αt 1(1 β1)(θ ,u θt,u)2 # dβ1D2LT 2αγ(1 β1)2(1 λ)2 We now bounded three terms: (13), (14), and (15). Hence, Rπ online(T) d D2LT 2αγ(1 β1)2 ln T + 1 (1 β1)2 1 β2(1 η) u=1 E [ g1:T,u ] + dβ1D2LT 2αγ(1 β1)2(1 λ)2 Thus, we can express Rπ online(T) as Rπ online(T) ρ 1d u=1 E [ g1:T,u ] + ρ 3 where ρ 1, ρ 2, and ρ 3 are defined as the following: ρ 1 = D2LT 2αγ(1 β1)2 , ρ 2 = α 1 + ln T (1 β1)2 1 β2(1 η), ρ 3 = dβ1D2LT 2αγ(1 β1)2(1 λ)2 The subsequent proof process is same as Lemma 1. Note that, by Assumption 1, LT is always less than or equal to the theoretical upper bound of the maximum gradient norm across all iterations (L). Hence, we have ρ 1 ρ1, ρ 2 = ρ2, and ρ 3 = ρ3. This implies that Lemma 1 holds even when L is not known. Lemma 15. (Lemma 2 when L is unknown) Suppose Assumptions 1-2 hold. If we set γ = , the batch selection (Algorithm 2) and the weight update rule (Algorithm 7) following Adam CB (Algorithm 1) implies Proof. The proof is the same as Lemma 2. However, at the last part, where we scale, L2 t p2 min L2 t p2 min = LMIN-K(T) E[LEXP3-K(T)] (24) Since Lt is a running max, {Lt}T t=1 is a non-decreasing sequence, i.e., L1 L2 LT . Hence, Eq.(24) becomes LMIN-K(T) E[LEXP3-K(T)] 2.63L2 T p2 min By Assumption 1, LT is always less than or equal to L, which implies LT = O(1). This completes the proof of Lemma 15. Published as a conference paper at ICLR 2025 Lemma 15 implies that Lemma 2 holds even when L is not known. Theorem 4. (Regret bound of Adam CB (Theorem 1) when L is unknown) Suppose Assumptions 1-2 hold, and we run Adam CB for a total T iterations with αt = α t and with β1,t := β1λt 1, λ (0, 1). Then, the cumulative regret of Adam CB (Algorithm 1) with batch size K is upper-bounded by Proof. The overall proof is similar to the proof of Theorem 1 (when L is known) detailed in Appendix B.4. The part that is different is when bounding the term Eq.(18) in Lemma 11. (18) (LT /γ) t=1 E[ θ t θ ] 2αL2 T T ϵγ2(1 β1) By Assumption 1, LT is always less than or equal to the upper bound of the maximum gradient norm across all iterations (L), which implies LT = O(1). Therefore, we have (18) = O( T). This implies that Lemma 11 still holds. Since both Lemma 1 and Lemma 2 hold even when L is not known according to Lemma 14 and Lemma 15, we complete the proof of Theorem 4 by following the same proof process as Theorem 1.