# interacting_contour_stochastic_gradient_langevin_dynamics__287ea5c2.pdf Published as a conference paper at ICLR 2022 INTERACTING CONTOUR STOCHASTIC GRADIENT LANGEVIN DYNAMICS Wei Deng1, 2, Siqi Liang1, Botao Hao3, Guang Lin1, Faming Liang1 1Purdue University 2Morgan Stanley 3Deep Mind fmliang@purdue.edu; weideng056@gmail.com We propose an interacting contour stochastic gradient Langevin dynamics (ICSGLD) sampler, an embarrassingly parallel multiple-chain contour stochastic gradient Langevin dynamics (CSGLD) sampler with efficient interactions. We show that ICSGLD can be theoretically more efficient than a single-chain CSGLD with an equivalent computational budget. We also present a novel random-field function, which facilitates the estimation of self-adapting parameters in big data and obtains free mode explorations. Empirically, we compare the proposed algorithm with popular benchmark methods for posterior sampling. The numerical results show a great potential of ICSGLD for large-scale uncertainty estimation tasks. 1 INTRODUCTION Stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) has achieved great successes in simulations of high-dimensional systems for big data problems. It, however, yields only a fast mixing rate when the energy landscape is simple, e.g., local energy wells are shallow and not well separated. To improve its convergence for the problems with complex energy landscapes, various strategies have been proposed, such as momentum augmentation (Chen et al., 2014; Ding et al., 2014), Hessian approximation (Ahn et al., 2012; Li et al., 2016), high-order numerical schemes (Chen et al., 2015; Li et al., 2019b), and cyclical learning rates (Izmailov et al., 2018; Maddox et al., 2019; Zhang et al., 2020b). In spite of their asymptotic properties in Bayesian inference (Vollmer et al., 2016) and non-convex optimization (Zhang et al., 2017), it is still difficult to achieve compelling empirical results for pathologically complex deep neural networks (DNNs). To simulate from distributions with complex energy landscapes, e.g., those with a multitude of modes well separated by high energy barriers, an emerging trend is to run multiple chains, where interactions between different chains can potentially accelerate the convergence of the simulation. For example, Song et al. (2014) and Futami et al. (2020) showed theoretical advantages of appropriate interactions in ensemble/population simulations. Other multiple chain methods include particle-based nonlinear Markov (Vlasov) processes (Liu & Wang, 2016; Zhang et al., 2020a) and replica exchange methods (also known as parallel tempering) (Deng et al., 2021a). However, the particle-based methods result in an expensive kernel matrix computation given a large number of particles (Liu & Wang, 2016); similarly, na ıvely extending replica exchange methods to population chains leads to a long waiting time to swap between non-neighboring chains (Syed et al., 2021). Therefore, how to conduct interactions between different chains, while maintaining the scalability of the algorithm, is the key to the success of the parallel stochastic gradient MCMC algorithms. In this paper, we propose an interacting contour stochastic gradient Langevin dynamics (ICSGLD) sampler, a pleasingly parallel extension of contour stochastic gradient Langevin dynamics (CSGLD) (Deng et al., 2020b) with efficient interactions. The proposed algorithm requires minimal communication cost in that each chain shares with others the marginal energy likelihood estimate only. As a result, the interacting mechanism improves the convergence of the simulation, while the minimal communication mode between different chains enables the proposed algorithm to be naturally adapted to distributed computing with little overhead. For the single-chain CSGLD algorithm, despite its theoretical advantages as shown in Deng et al. (2020b), estimation of the marginal energy likelihood remains challenging for big data problems with a wide energy range, jeopardizing the empirical performance of the class of importance sampling methods (Gordon et al., 1993; Doucet et al., 2001; Published as a conference paper at ICLR 2022 Wang & Landau, 2001; Liang et al., 2007; Andrieu et al., 2010; Deng et al., 2020b) in big data applications. To resolve this issue, we resort to a novel interacting random-field function based on multiple chains for an ideal variance reduction and a more robust estimation. As such, we can greatly facilitate the estimation of the marginal energy likelihood so as to accelerate the simulations of notoriously complex distributions. To summarize, the algorithm has three main contributions: We propose a scalable interacting importance sampling method for big data problems with the minimal communication cost. A novel random-field function is derived to tackle the incompatibility issue of the class of importance sampling methods in big data problems. Theoretically, we study the local stability of a non-linear mean-field system and justify regularity properties of the solution of Poisson s equation. We also prove the asymptotic normality for the stochastic approximation process in mini-batch settings and show that ICSGLD is asymptotically more efficient than the single-chain CSGLD with an equivalent computational budget. Our proposed algorithm achieves appealing mode explorations using a fixed learning rate on the MNIST dataset and obtains remarkable performance in large-scale uncertainty estimation tasks. 2 PRELIMINARIES 2.1 STOCHASTIC GRADIENT LANGEVIN DYNAMICS A standard sampling algorithm for big data problems is SGLD (Welling & Teh, 2011), which is a numerical scheme of a stochastic differential equation in mini-batch settings: xk+1 = xk ϵk N n x e U(xk) + 2τϵkwk, (1) where xk X Rd, ϵk is the learning rate at iteration k, N denotes the number of total data points, τ is the temperature, and wk is a standard Gaussian vector of dimension d. In particular, N n x e U(x) is an unbiased stochastic gradient estimator based on a mini-batch data B of size n and N n e U(x) is the unbiased energy estimator for the exact energy function U(x). Under mild conditions on U, xk+1 is known to converge weakly to a unique invariant distribution π(x) e U(x) 2.2 CONTOUR STOCHASTIC GRADIENT LANGEVIN DYNAMICS Despite its theoretical guarantees, SGLD can converge exponentially slow when U(x) is non-convex and exhibits high energy barriers. To remedy this issue, CSGLD (Deng et al., 2020b) exploits the flat histogram idea and proposes to simulate from a flattened density with much lower energy barriers ϖΨθ(x) π(x)/Ψζ θ(U(x)), (2) where ζ is a hyperparameter, Ψθ(u) = Pm i=1 θ(i 1)e(log θ(i) log θ(i 1)) u ui 1 u 1ui 1 1 iterations for future works. 4 CONVERGENCE PROPERTIES To study theoretical properties of ICSGLD, we first show a local stability property that is well-suited to big data problems, and then we present the asymptotic normality for the stochastic approximation process in mini-batch settings, which eventually yields the desired result that ICSGLD is asymptotically more efficient than a single-chain CSGLD with an equivalent computational cost. 4.1 LOCAL STABILITY FOR NON-LINEAR MEAN-FIELD SYSTEMS IN BIG DATA The first obstacle for the theoretical study is to approximate the components of θ corresponding to the high energy region. To get around this issue, the random field function e H(θ, x) in (8) is adopted to estimate a different target θ θ 1 ζ . As detailed in Lemma 3 in the supplementary material, the mean-field equation is now formulated as follows hi(θ) θζ (i) (θ(i)Cθ)ζ + perturbations, (10) where Cθ = e Zζ,θ e Zζ ζ,θ ζ and e Zζ,θ = Pm k=1 θζ 1(k) . We see that (10) may not be linearly stable as in Deng et al. (2020b). Although the solution of the mean-field system h(θ) = 0 is still unique, there may exist unstable invariant subspaces, leading us to consider the local properties. For a proper initialization of θ, which can be achieved by pre-training the model long enough time through SGLD, the mean value theorem implies a linear property in a local region hi(θ) θ (i) θ(i) + perturbations. Combining the perturbation theory (Vanden-Eijnden, 2001), we present the following stability result: Lemma 1 (Local stability, informal version of Lemma 3) Assume Assumptions A1-A4 (given in the supplementary material) hold. For any properly initialized θ, we have h(θ), θ bθ φ θ bθ 2, where bθ = θ + O supx Var(ξn(x)) + ϵ + 1 1 ζ , φ > 0, ϵ denotes a learning rate, and ξn(x) denotes the noise in the stochastic energy estimator of batch size n and Var( ) denotes the variance. By justifying the drift conditions of the adaptive transition kernel and relevant smoothness properties, we can prove the existence and regularity properties of the solution of the Poisson s equation in Lemma 6 in the supplementary material. In what follows, we can control the fluctuations in stochastic approximation and eventually yields the L2 convergence. Lemma 2 (L2 convergence rate, informal version of Lemma 7) Given standard Assumptions A1A5. θk converges to bθ , where bθ = θ + O supx Var(ξn(x)) + ϵ + 1 m , such that E h θk bθ 2i = O (ωk) . Published as a conference paper at ICLR 2022 The result differs from Theorem 1 of Deng et al. (2020b) in that the biased fixed point bθ instead of θ is treated as the equilibrium of the continuous system, which provides us a user-friendly proof. Similar techniques have been adopted by Durmus & Eric Moulines (2017); Xu et al. (2018). Although the global stability (Deng et al., 2020b) may be sacrificed when ζ = 1 based on Eq.(8), θ θ 1 ζ is much easier to estimate numerically for any i that yields 0 < θ (i) 1 based on a large ζ > 1. 4.2 ASYMPTOTIC NORMALITY To study the asymptotic behavior of ω 1 2 k (θk bθ ), where bθ is the equilibrium point s.t. bθ = θ +O Var(ξn(x)) + ϵ + 1 m , we consider a fixed step size ω in the SA step for ease of explanation. Let θt denote the solution of the mean-field system in continuous time ( θ0 = θ0), and rewrite the single-chain SA step (7) as follows θk+1 θ(k+1)ω = θk θkω + ω H(θk, xk+1) H( θkω, xk+1) + ω H( θkω, xk+1) h( θkω) θ(k+1)ω θkω ωh( θkω) . Further, we set eθkω := ω 1 2 (θk θkω). Then the stochastic approximation differs from the mean field system in that eθ(k+1)ω = ω 1 2 H(θi, xi+1) H( θiω, xi+1) | {z } I: perturbations H( θiω, xi+1) h( θiω) | {z } II: martingale Mi ω 1 2 remainder i=0 hθ(θiω) (θi θiω) | {z } i=0 Mi Z (k+1)ω 0 hθ( θs)eθsds + Z (k+1)ω 0 R 1 2 ( θs)d Ws, where hθ(θ) := d dθh(θ) is a matrix, W Rm is a standard Brownian motion, the last term follows from a certain central limit theorem (Benveniste et al., 1990) and R denotes the covariance matrix of the random-field function s.t. R(θ) := P k= Covθ(H(θ, xk), H(θ, x0)). We expect the weak convergence of Uk to the stationary distribution of a diffusion d Ut = hθ(θt)Utdt + R1/2(θt)d Wt, (11) where Ut = ω 1/2 t (θt bθ ). Given that θt converges to bθ sufficiently fast and the local linearity of hθ, the diffusion (11) resembles the Ornstein Uhlenbeck process and yields the following solution Ut e thθ(bθ )U0 + Z t 0 e (t s)hθ(bθ ) R(bθ )d Ws. Then we have the following theorem, whose formal proof is given in section C.3. Theorem 1 (Asymptotic Normality) Assume Assumptions A1-A5 (given in the supplementary material) hold. We have the following weak convergence ω 1/2 k (θk bθ ) N(0, Σ), where Σ = Z 0 ethθ R eth θ dt, hθ = hθ(bθ ). 4.3 INTERACTING PARALLEL CHAINS ARE MORE EFFICIENT For clarity, we first denote an estimate of θ based on ICSGLD with P interacting parallel chains by θP k and denote the estimate based on a single-long-chain CSGLD by θk P . Note that Theorem 1 holds for any step size ωk = O(k α), where α (0.5, 1]. If we simply run a single-chain CSGLD algorithm with P times of iterations, by Theorem 1, ω 1/2 k P (θk P bθ ) N(0, Σ). As to ICSGLD, since the covariance Σ relies on R, which depends on the covariance of the martingale {Mi}i 1, the conditional independence of x(1), x(2), , x(P ) naturally results in an efficient variance reduction such that Published as a conference paper at ICLR 2022 Corollary 1 (Asymptotic Normality for ICSGLD) Assume the same assumptions. For ICSGLD with P interacting chains, we have the following weak convergence ω 1/2 k (θP k bθ ) N(0, Σ/P). That is, under a similar computational budget, we have Var(θk P bθ ) F Var(θP k bθ ) F = wk P wk/P P 1 α. Corollary 2 (Efficiency) Given a decreasing step size ωk = O(k α), where 0.5 < α < 1, ICSGLD is asymptotically more efficient than the single-chain CSGLD with an equivalent training cost. In practice, slowly decreasing step sizes are often preferred in stochastic algorithms for a better non-asymptotic performance (Benveniste et al., 1990). 5 EXPERIMENTS 5.1 LANDSCAPE EXPLORATION ON MNIST VIA THE SCALABLE RANDOM-FIELD FUNCTION This section shows how the novel random-field function (8) facilitates the exploration of multiple modes on the MNIST dataset , while the standard methods, such as stochastic gradient descent (SGD) and SGLD, only get stuck in few local modes. To simplify the experiments, we choose a large batch size of 2500 and only pick the first five classes, namely digits from 0 to 4. The learning rate is fixed to 1e-6 and the temperature is set to 0.1 . We see from Figure 2(a) that both SGD and SGLD lead to fast decreasing losses. By contrast, ICSGLD yields fluctuating losses that traverse freely between high energy and low energy regions. As the particles stick in local regions, the penalty of re-visiting these zones keeps increasing until a negative learning rate is injected to encourage explorations. (a) Training Loss (b) SGD (c) SGLD (d) ICSGLD 0 100 200 300 400 500 Epochs Losses (in thousands) ICSGLD SGLD SGD Figure 2: Visualization of mode exploration on a MNIST example based on different algorithms. We conducted a singular value decomposition (SVD) based on the first two coordinates to visualize the trajectories: We first choose a domain that includes all the coordinates, then we recover the parameter based on the grid point and truncated values in other dimensions, and finally we fine-tune the parameters and present the approximate losses of the trajectories in Figure 2(b-d). We see SGD trajectories get stuck in a local region; SGLD exploits a larger region but is still quite limited in the exploration; ICSGLD, instead, first converges to a local region and then escapes it once it overvisits this region. This shows the strength of ICSGLD in the simulations of complex multi-modal distributions. More experimental details are presented in section D.1 of the supplementary material. 5.2 SIMULATIONS OF MULTI-MODAL DISTRIBUTIONS This section shows the acceleration effect of ICSGLD via a group of simulation experiments for a multi-modal distribution. The baselines include popular Monte Carlo methods such as CSGLD, SGLD, cyclical SGLD (cyc SGLD), replica exchange SGLD (re SGLD), and the particle-based SVGD. The target multi-modal density is presented in Figure 3(a). Figure 3(b-g) displays the empirical performance of all the testing methods: the vanilla SGLD with 5 parallel chains ( P5) undoubtedly The random-field function (Deng et al., 2020b) requires an extra perturbation term as discussed in section D4 in the supplementary material (Deng et al., 2020b); therefore it is not practically appealing in big data. Data augmentation implicitly leads to a more concentrated posterior (Wenzel et al., 2020; Aitchison, 2021). Published as a conference paper at ICLR 2022 performs the worst in this example and fails to quantify the weights of each mode correctly; the single-chain cyc SGLD with 5 times of iterations ( T5) improves the performance but is still not accurate enough; re SGLD ( P5) and SVGD ( P5) have good performances, while the latter is quite costly in computations; ICSGLD ( P5) does not only traverse freely over the rugged energy landscape, but also yields the most accurate approximation to the ground truth distribution. By contrast, CSGLD ( T5) performs worse than ICSGLD and overestimates the weights on the left side. For the detailed setups, the study of convergence speed, and runtime analysis, we refer interested readers to section D.2 in the supplementary material. (c) cyc SGLD (e) re SGLD (g) ICSGLD Figure 3: Empirical behavior on a simulation dataset. Figure 3(c) and 3(f) show the simulation based on a single chain with 5 times of iterations ( T5) and the others run 5 parallel chains ( P5). 5.3 DEEP CONTEXTUAL BANDITS ON MUSHROOM TASKS This section evaluates ICSGLD on the contextual bandit problem based on the UCI Mushroom data set as in Riquelme et al. (2018). The mushrooms are assumed to arrive sequentially and the agent needs to take an action at each time step based on past feedbacks. Our goal is to minimize the cumulative regret that measures the difference between the cumulative reward obtained by the proposed policy and optimal policy. We evaluate Thompson Sampling (TS) based on a variety of approximate inference methods for posterior sampling. We choose one ϵ-greedy policy (Eps Greedy) based on the RMSProp optimizer with a decaying learning rate (Riquelme et al., 2018) as a baseline. Two variational methods, namely stochastic gradient descent with a constant learning rate (Const SGD) (Mandt et al., 2017) and Monte Carlo Dropout (Dropout) (Gal & Ghahramani, 2016) are compared to approximate the posterior distribution. For the sampling algorithms, we include preconditioned SGLD (p SGLD) (Li et al., 2016), preconditioned CSGLD (p CSGLD) (Deng et al., 2020b), and preconditioned ICSGLD (p ICSGLD). Note that all the algorithms run 4 parallel chains with average outputs ( P4) except that p CSGLD runs a single-chain with 4 times of computational budget ( T4). For more details, we refer readers to section D.3 in the supplementary material. Figure 4 shows that Eps Greedy P4 tends to explore too much for a long horizon as expected; Const SGD P4 and Dropout P4 perform poorly in the beginning but eventually outperform Eps Greedy P4 due to the inclusion of uncertainty for exploration, whereas the uncertainty seems to be inadequate due to the nature of variational inference. By contrast, p SGLD P4 significantly 0 500 1000 1500 2000 Steps Cumulative Regret Eps Greedy x P4 Const SGD x P4 Dropout x P4 p CSGLD x T4 p SGLD x P4 p ICSGLD x P4 Figure 4: Cumulative regret on the mushroom task. outperforms the variational methods by considering preconditioners within an exact sampling framework (SGLD). As a unique algorithm that runs in a single-chain manner, p CSGLD T4 leads to the worst performance due to the inefficiency in learning the selfadapting parameters, fortunately, p CSGLD T4 slightly outperform p SGLD P4 in the later phase with the help of the well-estimated self-adapting parameters. Nevertheless, p ICSGLD P4 propose to optimize the shared self-adapting parameters at the same time, which in turn greatly contributes to the simulation of the posterior. As a result, p ICSGLD P4 consistently shows the lowest regret excluding the very early period. This shows the great superiority of the interaction mechanism in learning the self-adapting parameters for accelerating the simulations. 5.4 UNCERTAINTY ESTIMATION This section evaluates the qualify of our algorithm in uncertainty quantification. For model architectures, we use residual networks (Res Net) (He et al., 2016) and a wide Res Net (WRN) (Zagoruyko & Komodakis, 2016); we choose 20, 32, and 56-layer Res Nets (denoted by Res Net20, et al.) and a WRN-16-8 network, a 16-layer WRN that is 8 times wider than Res Net16. We train the models on Published as a conference paper at ICLR 2022 CIFAR100, and report the test accuracy (ACC) and test negative log-likelihood (NLL) based on 5 trials with standard error. For the out-of-distribution prediction performance, we test the well-trained models in Brier scores (Brier) * on the Street View House Numbers dataset (SVHN). Due to the wide adoption of momentum stochastic gradient descent (M-SGD), we use stochastic gradient Hamiltonian Monte Carlo (SGHMC) (Chen et al., 2014) as the baseline sampling algorithm and denote the interacting contour SGHMC by ICSHMC. In addition, we include several high performing baselines, such as SGHMC with cyclical learning rates (cyc SGHMC) (Zhang et al., 2020b), SWAG based on cyclic learning rates of 10 cycles (cyc SWAG) (Maddox et al., 2019) and variance-reduced replica exchange SGHMC (re SGHMC) (Deng et al., 2021a). For a fair comparison, ICSGLD also conducts variance reduction on the energy function to alleviate the bias. Moreover, a large ζ = 3 106 is selected, which only induces mild gradient multipliers ranging from 1 to 2 to penalize over-visited partitions. We don t include SVGD (Liu & Wang, 2016) and SPOS (Zhang et al., 2020a) for scalability reasons. A batch size of 256 is selected. We run 4 parallel processes ( P4) with 500 epochs for M-SGD, re SGHMC and ICSGHMC and run cyc SGHMC and cyc SWAG 2000 epochs ( T4) based on a single process with 10 cycles. Refer to section D.4 of the supplementary material for the detailed settings. TABLE 1: UNCERTAINTY ESTIMATIONS ON CIFAR100 AND SVHN. MODEL Res Net20 Res Net32 ACC (%) NLL Brier ( ) ACC (%) NLL Brier ( ) cyc SGHMC T4 75.41 0.10 8437 30 2.91 0.13 77.93 0.17 7658 19 3.29 0.13 cyc SWAG T4 75.46 0.11 8419 26 2.78 0.12 77.91 0.15 7656 22 3.19 0.14 M-SGD P4 76.01 0.12 8175 25 2.58 0.08 78.41 0.12 7501 23 2.77 0.15 re SGHMC P4 76.15 0.16 8196 27 2.73 0.10 78.57 0.07 7454 15 3.04 0.09 ICSGHMC P4 76.34 0.15 8076 31 2.54 0.14 78.72 0.16 7406 29 2.76 0.15 MODEL Res Net56 WRN-16-8 ACC (%) NLL Brier ( ) ACC (%) NLL Brier ( ) cyc SGHMC T4 81.23 0.19 6770 59 3.18 0.08 82.98 0.03 6384 11 2.17 0.05 cyc SWAG T4 81.14 0.11 6744 55 3.06 0.09 83.05 0.04 6359 14 2.04 0.07 M-SGD P4 81.03 0.14 6847 22 2.86 0.08 82.57 0.07 6821 21 1.77 0.06 re SGHMC P4 81.11 0.16 6915 40 2.92 0.12 82.72 0.08 6452 19 1.92 0.04 ICSGHMC P4 81.51 0.18 6630 38 2.88 0.09 83.12 0.10 6338 36 1.83 0.06 Table 1 shows that the vanilla ensemble results via M-SGD P4 surprisingly outperform cyc SGHMC T4 and cyc SWAG T4 on medium models, such as Res Net20 and Res Net32, and show very good performance on the out-of-distribution samples in Brier scores. We suspect that the parallel implementation ( P4) provides isolated initializations with less correlated samples; by contrast, cyc SGHMC T4 and cyc SWAG T4 explore the energy landscape contiguously, implying a risk to stay near the original region. re SGHMC P4 shows a remarkable performance overall, but demonstrates a large variance occasionally; this indicates the insufficiency of the swaps when multiple processes are included. When it comes to testing WRN-16-8, cyc SWAG T4 shows a marvelous result and a large improvement compared to the other baselines. We conjecture that cyc SWAG is more independent of hyperparameter tuning, thus leading to better performance in larger models. We don t report CSGHMC P4 since it becomes quite unstable during the training of Res Net56 and WRN-16-8 models and causes mediocre results. As to ICSGHMC P4, it consistently performs remarkable in both ACC and NLL and performs comparable to M-SGD P4 in Brier scores. Code is available at github.com/Wayne DW/Interacting-Contour-Stochastic-Gradient-Langevin-Dynamics. 6 CONCLUSION We have proposed the ICSGLD as an efficient algorithm for sampling from distributions with a complex energy landscape, and shown theoretically that ICSGLD is indeed more efficient than the single-chain CSGLD for a slowly decreasing step size. To our best knowledge, this is the first interacting importance sampling algorithm that adapts to big data problems without scalability concerns. ICSGLD has been compared with numerous state-of-the-art baselines for various tasks, whose remarkable results indicate its promising future in big data applications. *The Brier score measures the mean squared error between the predictive and actual probabilities. Published as a conference paper at ICLR 2022 ACKNOWLEDGMENT Liang s research was supported in part by the grants DMS-2015498, R01-GM117597 and R01GM126089. Lin acknowledges the support from NSF (DMS-1555072, DMS-2053746, and DMS2134209), BNL Subcontract 382247, and DE-SC0021142. Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring. In Proc. of the International Conference on Machine Learning (ICML), 2012. Sungjin Ahn, Babak Shahbaba, and Max Welling. Distributed Stochastic Gradient MCMC. In Proc. of the International Conference on Machine Learning (ICML), 2014. Laurence Aitchison. A Statistical Theory of Cold Posteriors in Deep Neural Networks. In Proc. of the International Conference on Learning Representation (ICLR), 2021. C. Andrieu, E. Moulines, and P. Priouret. Stability of Stochastic Approximation under Verifiable Conditions. SIAM J. Control Optim., 44(1):283 312, 2005. Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov Chain Monte Carlo Methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3), 2010. Albert Benveniste, Michael M etivier, and Pierre Priouret. Adaptive Algorithms and Stochastic Approximations. Berlin: Springer, 1990. Elmar Bittner, Andreas Nussbaumer, and Wolfhard Janke. Make Life Simple: Unleash the Full Power of the Parallel Tempering Algorithm. Physical Review Letters, 101:130603 130603, 2008. Changyou Chen, Nan Ding, and Lawrence Carin. On the Convergence of Stochastic Gradient MCMC Algorithms with High-order Integrators. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2278 2286, 2015. Changyou Chen, Nan Ding, Chunyuan Li, Yizhe Zhang, and Lawrence Carin. Stochastic Gradient MCMC with Stale Gradients. In Advances in Neural Information Processing Systems (Neur IPS), 2016. Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. In Proc. of the International Conference on Machine Learning (ICML), 2014. Wei Deng, Qi Feng, Liyao Gao, Faming Liang, and Guang Lin. Non-Convex Learning via Replica Exchange Stochastic Gradient MCMC. In Proc. of the International Conference on Machine Learning (ICML), 2020a. Wei Deng, Guang Lin, and Faming Liang. A Contour Stochastic Gradient Langevin Dynamics Algorithm for Simulations of Multi-modal Distributions. In Advances in Neural Information Processing Systems (Neur IPS), 2020b. Wei Deng, Qi Feng, Georgios Karagiannis, Guang Lin, and Faming Liang. Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction. In Proc. of the International Conference on Learning Representation (ICLR), 2021a. Wei Deng, Yi-An Ma, Zhao Song, Qian Zhang, and Guang Lin. On Convergence of Federated Averaging Langevin Dynamics. ar Xiv:2112.05120v1, 2021b. Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut Neven. Bayesian Sampling using Stochastic Gradient Thermostats. In Advances in Neural Information Processing Systems (Neur IPS), pp. 3203 3211, 2014. Arnaud Doucet, Nando de Freitas, and Neil Gordon. Sequential Monte Carlo Methods in Practice. Springer Science & Business Media, 2001. Published as a conference paper at ICLR 2022 Alain Durmus and Eric Moulines. Non-asymptotic Convergence Analysis for the Unadjusted Langevin Algorithm. Annals of Applied Probability, 27:1551 1587, 2017. David J. Earl and Michael W. Deem. Parallel Tempering: Theory, Applications, and New Perspectives. Phys. Chem. Chem. Phys., 7:3910 3916, 2005. Murat A Erdogdu, Lester Mackey, and Ohad Shamir. Global Non-convex Optimization with Discretized Diffusions. In Advances in Neural Information Processing Systems (Neur IPS), 2018. G. Fort, E. Moulines, and P. Priouret. Convergence of Adaptive and Interacting Markov Chain Monte Carlo Algorithms. Annals of Statistics, 39:3262 3289, 2011. G. Fort, B. Jourdain, E. Kuhn, T. Leli evre, and G. Stoltz. Convergence of the Wang-Landau Algorithm. Math. Comput., 84(295):2297 2327, 2015. Futoshi Futami, Issei Sato, and Masashi Sugiyama. Accelerating the Diffusion-based Ensemble Sampling by Non-reversible Dynamics. In Proc. of the International Conference on Machine Learning (ICML), 2020. Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proc. of the International Conference on Machine Learning (ICML), 2016. Charles J. Geyer. Markov Chain Monte Carlo Maximum Likelihood. Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interfac, pp. 156 163, 1991. Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation. IEE Proceedings F (Radar and Signal Processing), 140(2), 1993. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Pavel Izmailov, Dmitry Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging Weights Leads to Wider Optima and Better Generalization. In Proc. of the Conference on Uncertainty in Artificial Intelligence (UAI), 2018. K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. Le Cun. What is the Best Multi-stage Architecture for Object Recognition? In Proc. of the International Conference on Computer Vision (ICCV), pp. 2146 2153, September 2009. Helmut G Katzgraber, Simon Trebst, David A Huse, and Matthias Troyer. Feedback-Optimized Parallel Tempering Monte Carlo. Journal of Statistical Mechanics: Theory and Experiment, pp. p. P03018, 2008. Chunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin. Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks. In Proc. of the National Conference on Artificial Intelligence (AAAI), pp. 1788 1794, 2016. Chunyuan Li, Changyou Chen, Yunchen Pu, Ricardo Henao, and Lawrence Carin. Communication Efficient Stochastic Gradient MCMC for Neural Networks. In Proc. of the National Conference on Artificial Intelligence (AAAI), 2019a. Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the Convergence of Fed Avg on Non-IID Data. In Proc. of the International Conference on Learning Representation (ICLR), 2020. Xuechen Li, Denny Wu, Lester Mackey, and Murat A. Erdogdu. Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond. In Advances in Neural Information Processing Systems (Neur IPS), pp. 7746 7758, 2019b. Faming Liang, Chuanhai Liu, and Raymond J. Carroll. Stochastic Approximation in Monte Carlo Computation. Journal of the American Statistical Association, 102:305 320, 2007. Qiang Liu and Dilin Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. In Advances in Neural Information Processing Systems (Neur IPS), 2016. Published as a conference paper at ICLR 2022 Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. A Simple Baseline for Bayesian Uncertainty in Deep Learning. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic Gradient Descent as Approximate Bayesian Inference. Journal of Machine Learning Research, 18:1 35, 2017. J.C. Mattingly, A.M. Stuartb, and D.J. Highamc. Ergodicity for SDEs and Approximations: Locally Lipschitz Vector Fields and Degenerate Noise. Stochastic Processes and their Applications, 101: 185 232, 2002. Jonathan C. Mattingly, Andrew M. Stuart, and M.V. Tretyakov. Convergence of Numerical Time Averaging and Stationary Measures via Poisson Equations. SIAM Journal on Numerical Analysis, 48:552 577, 2010. Mariane Pelletier. Weak Convergence Rates for Stochastic Approximation with Application to Multiple Targets and Simulated Annealing. Annals of Applied Probability, 8:10 44, 1998. Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex Learning via Stochastic Gradient Langevin Dynamics: a Nonasymptotic Analysis. In Proc. of Conference on Learning Theory (COLT), June 2017. Tom Rainforth, Christian A. Naesseth, Fredrik Lindsten, Brooks Paige, Jan-Willem van de Meent, Arnaud Doucet, and Frank Wood. Interacting Particle Markov Chain Monte Carlo. In Proc. of the International Conference on Machine Learning (ICML), 2016. Carlos Riquelme, George Tucker, and Jasper Snoek. Deep Bayesian Bandits Showdown. In Proc. of the International Conference on Learning Representation (ICLR), 2018. Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. Annals of Mathematical Statistics, 22:400 407, 1951. Gareth O. Roberts and Richard L. Tweedie. Exponential Convergence of Langevin Distributions and Their Discrete Approximations. Bernoulli, 2(4):341 363, 1996. Issei Sato and Hiroshi Nakagawa. Approximation Analysis of Stochastic Gradient Langevin Dynamics by Using Fokker-Planck Equation and Ito Process. In Proc. of the International Conference on Machine Learning (ICML), 2014. Qifan Song, Mingqi Wu, and Faming Liang. Weak Convergence Rates of Population versus Single Chain Stochastic Approximation MCMC Algorithms. Advances in Applied Probability, 46: 1059 1083, 2014. Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo Simulation of Spin-Glasses. Physical Review Letters, 57:2607 2609, 1986. Saifuddin Syed, Alexandre Bouchard-Cˆot e, George Deligiannidis, and Arnaud Doucet. Non Reversible Parallel Tempering: a Scalable Highly Parallel MCMC scheme. Journal of Royal Statistical Society, Series B, 2021. Yee Whye Teh, Alexandre Thi ery, and Sebastian Vollmer. Consistency and Fluctuations for Stochastic Gradient Langevin Dynamics. Journal of Machine Learning Research, 17:1 33, 2016. Eric Vanden-Eijnden. Introduction to Regular Perturbation Theory. Slides, 2001. URL https: //cims.nyu.edu/ eve2/reg_pert.pdf. Sebastian J. Vollmer, Konstantinos C. Zygalakis, and Yee Whye Teh. Exploration of the (Non-) Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics. Journal of Machine Learning Research, 17(159):1 48, 2016. Fugao Wang and David P. Landau. Efficient, Multiple-range Random Walk Algorithm to Calculate the Density of States. Physical Review Letters, 86:2050 3, 2001. T. Weinhart, A. Singh, and A.R. Thornton. Perturbation Theory & Stability Analysis. Slides, 2010. Published as a conference paper at ICLR 2022 Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proc. of the International Conference on Machine Learning (ICML), pp. 681 688, 2011. Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How Good is the Bayes Posterior in Deep Neural Networks Really? In Proc. of the International Conference on Machine Learning (ICML), 2020. Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2018. Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In Proceedings of the British Machine Vision Conference (BMVC), pp. 87.1 87.12, September 2016. Jianyi Zhang, Ruiyi Zhang, Lawrence Carin, and Changyou Chen. Stochastic Particle-Optimization Sampling and the Non-Asymptotic Convergence Theory. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, 2020a. Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning. In Proc. of the International Conference on Learning Representation (ICLR), 2020b. Yuchen Zhang, Percy Liang, and Moses Charikar. A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics. In Proc. of Conference on Learning Theory (COLT), pp. 1980 2022, 2017. Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmentation. Ar Xiv e-prints, 2017. Published as a conference paper at ICLR 2022 We summarize the supplementary material as follows: Section A provides the preliminary knowledge for stochastic approximation; Section B shows a local stability condition that adapts to high losses; Section C proves the main asymptotic normality for the stochastic approximation process, which naturally yields the conclusion that interacting contour stochastic gradient Langevin dynamics (ICSGLD) is more efficient than the analogous single chain based on slowly decreasing step sizes; Section D details the experimental settings. A PRELIMINARIES A.1 STOCHASTIC APPROXIMATION Given a random-field function e H(θ, x), the stochastic approximation algorithm (Benveniste et al., 1990) proposes to solve the mean-field equation h(θ) = 0 in the analysis of adaptive algorithms X e H(θ, x)ϖθ(dx) = 0, where x X Rd, θ Θ Rm, ϖθ(x) is a distribution that depends on the self-adapting parameter θ. Given the transition kernel Πθ(x, A) for any Borel subset A X, the algorithm can be written as follows (1) Simulate xk+1 Πθk(xk, ), which yields the invariant distribution ϖθk( ), (2) Optimize θk+1 = θk + ωk+1 e H(θk, xk+1). Compared with the standard Robbins Monro algorithm (Robbins & Monro, 1951), the algorithm proposes to simulate x from a transition kernel Πθ( , ) instead of the distribution ϖθ( ) directly. In other words, , e H(θk, xk+1) h(θk) is not a Martingale but rather a Markov state-dependent noise. A.2 POISSON S EQUATION In the stochastic approximation algorithm, the sequence of {(xk, θk)} k=1 on the product space X Θ is generated, which is an inhomogeneous Markov chain and requires the tool of the Poisson s equation to study the convergence µθ(x) Πθµθ(x) = e H(θ, x) h(θ), where µθ( ) is a function on X. The solution µθ(x) to the Poisson s equation exists and is formulated in the following form when the above series converges: k 0 Πk θ( e H(θ, x) h(θ)), where Πk θ( e H(θ, x) h(θ)) = R ( e H(θ, y) h(θ))Πk θ(x, dy). To ensure such a convergence, Benveniste et al. (1990) made the following regularity conditions on the solution µθ( ) of the Poisson s equation: There exist a Lyapunov function V : X [1, ) and a positive constant C > 0 such that θ, θ Θ, we have Πθµθ(x) CV (x), Πθµθ(x) Πθ µθ (x) C θ θ V (x), E[V (x)] , (12) where a common choice for the Lyapunov function is to set V (x) = 1 + x 2 (Teh et al., 2016; Vollmer et al., 2016). A.3 GAUSSIAN DIFFUSIONS Consider a stochastic linear differential equation d Ut = hθ(θt)Utdt + R1/2(θt)d Wt, (13) where U is a m-dimensional random vector, hθ := d dθh(θ), R(θ) := P k= Covθ(H(θ, xk), H(θ, x0)) is a positive definite matrix that depends on θ( ), W Rm Published as a conference paper at ICLR 2022 is a standard Brownian motion. Given a large enough t such that θt converges to a fixed point bθ sufficiently fast, we may write the diffusion associated with Eq.(13) as follow Ut e thθ(bθ )U0 + Z t 0 e (t s)hθ(bθ ) R(bθ )d Ws, (14) Suppose that the matrix hθ(bθ ) is negative definite, then Ut converges in distribution to a Gaussian variable E[Ut] = e thθ(bθ )U0 Var(Ut) = Z t 0 ethθ(bθ ) R ethθ(bθ )du. The main goal of this supplementary file is to study the Gaussian approximation of the process ω 1/2 k (θk bθ ) to the solution Eq.(14) for a proper step size ωk. Thereafter, the advantage of interacting mechanisms can be naturally derived. B STABILITY AND CONVERGENCE ANALYSIS As required by the algorithm, we update P contour stochastic gradient Langevin dynamics (CSGLD) simultaneously. For the notations, we denote the particle of the p-th chain at iteration k by x(p) k X Rd and the joint state of the P parallel particles at iteration k by x N P k := x(1) k , x(2) k , , x(P ) k X N P Rd P . We also denote the learning rate and step size at iteration k by ϵk and ωk, respectively. We denote by N(0, Id P ) a standard d P-dimensional Gaussian vector and denote by ζ a positive hyperparameter. B.1 ICSGLD ALGORITHM First, we introduce the interacting contour stochastic gradient Langevin dynamics (ICSGLD) with P parallel chains: (1) Simulate x N P k+1 = x N P k ϵk x e L(x N P k , θk) + N(0, 2ϵkτId P ), (S1) (2) Optimize θk+1 = θk + ωk+1f H(θk, x N P k+1 ), (S2) where x e L(x N P , θ) := xe L(x(1), θ), xe L(x(2), θ), , xe L(x(P ), θ) , xe L(x, θ) is the stochastic adaptive gradient given by xe L(x, θ) = N u log θ(Je U(x)) log θ((Je U(x) 1) 1) | {z } gradient multiplier x e U(x). (15) In particular, the interacting random-field function is written as f H(θk, x N P k+1 ) = 1 p=1 e H(θk, x(p) k+1), (16) where each random-field function e H(θ, x) = ( e H1(θ, x), . . . , e Hm(θ, x)) follows e Hi(θ, x) = θ(Je U(x)) 1i=J e U(x) θ(i) , i = 1, 2, . . . , m. (17) Here Je U(x) denotes the index i {1, 2, 3, , m} such that ui 1 < N n e U(x) ui for a set of energy partitions {ui}m i=0 and e U(x) = P i B Ui(x) where Ui denotes the negative log of a posterior based on a single data point i and B denotes a mini-batch of data of size n. Note that the stochastic Published as a conference paper at ICLR 2022 energy estimator e U(x) results in a biased estimation for the partition index Je U(x) due to a non-linear transformation. To avoid such a bias asymptotically with respect to the learning rate ϵk, we may consider a variance-reduced energy estimator e UVR(x) following Deng et al. (2021a) n e UVR(x) = N Ui(x) Ui xq k i=1 Ui xq k where the control variate xq k q is updated every q iterations. Compared with the na ıve parallelism of CSGLD, a key feature of the ICSGLD algorithm lies in the joint estimation of the interacting random-field function f H(θ, x N P ) in Eq.(16) for the same mean-field function h(θ). B.1.1 DISCUSSIONS ON THE HYPERPARAMETERS The most important hyperparameter is ζ. A fine-tuned ζ usually leads to a small or even slightly negative learning rate in low energy regions to avoid local-trap problems. Theoretically, ζ affects the L2 convergence rate hidden in the big-O notation in Lemma 3. The other hyperparameters can be easily tuned. For example, the Res Net models yields the full loss ranging from 10,000 to 60,000 after warm-ups, we thus partition the sample space according to the energy into 200 subregions equally without tuning; since the optimization of SA is nearly convex, tuning {ωk} is much easier than tuning {ϵk} for non-convex learning. B.1.2 DISCUSSIONS ON DISTRIBUTED COMPUTING AND COMMUNICATION COST In shared-memory settings, the implementation is trivial and the details are omitted. In distributed-memory settings: θk+1 is updated by the central node as follows: The p-th worker conducts the sampling step (S1) and sends the indices Je U(x(p) k+1) s to the central The central node aggregates the indices from all worker and updates θk based on (S2); The central node sends θk+1 back to each worker. We emphasize that we don t communicate the model parameters x Rd, but rather share the selfadapting parameter θ Rm, where m d. For example, WRN-16-8 has 11 M parameters (40 MB), while θ can be set to dimension 200 of size 4 KB; hence, the communication cost is not a big issue. Moreover, the theoretical advantage still holds if the communication frequency is slightly reduced. B.1.3 SCALABILITY TO BIG DATA Recall that the adaptive sampler follows that xk+1 = xk ϵk+1 N 1 + ζτ log θk(J e U(xk)) log θk((J e U(xk) 1) 1) u | {z } gradient multiplier x e U(xk) + p 2τϵk+1wk+1, The key to the success of (I)CSGLD is to generate sufficiently strong bouncy moves (negative gradient multiplier) to escape local traps. To this end, ζ can be tuned to generate proper bouncy moves. Take the CIFAR100 experiments for example: the self-adjusting mechanism fails if the gradient multiplier uniformly equals to 1 and a too small value of ζ = 1 could lead to this issue; the self-adjusting mechanism works only if we choose a large enough ζ such as 3e6 to generate (desired) negative gradient multiplier in over-visited regions. Published as a conference paper at ICLR 2022 However, when we set ζ =3e6, the original stochastic approximation (SA) update proposed in (Deng et al., 2020b) follows that θk+1(i) = θk(i) + ωk+1 θζ k(Je U(xk+1)) | {z } essentially 0 for ζ 1 1i=J e U(xk+1) θk(i) . Since θ(i) < 1 for any i {1, , m}, θ(i)ζ is essentially 0 for such a large ζ, which means that the original SA fails to optimize when ζ is large. Therefore, the limited choices of ζ inevitably limits the scalability to big data problems. Our newly proposed SA scheme θk+1(i) = θk(i) + ωk+1 θk(Je U(xk+1)) | {z } independent of ζ 1i=J e U(xk+1) θk(i) is more independent of ζ and proposes to converge to a much smoother equilibrium θ1/ζ instead of θ , where θ (i) = R χi π(x)dx R τ dx is the energy PDF. As such, despite the linear stability is sacrificed, the resulting algorithm is more scalable. For example, estimating e 10,000 1 ζ is numerically much easier than e 10,000 for a large ζ such as 10, 000, where 10, 000 can be induced by the high losses in training deep neural networks in big data. B.2 ASSUMPTIONS A long-standing problem for stochastic approximation is the difficulty in establishing the stability property and a practical remedy for this problem is to study Θ on a fixed compact set. Assumption A1 (Compactness) The space Θ is compact and infΘ θ(i) > 0 for any i {1, 2, . . . , m}. For weaker assumptions, we refer readers to Theorem 3.2 (Fort et al., 2015), where a recurrence property can be proved for the Metropolis-based Wang-Landau algorithm, which eventually established that the estimates return to a desired compact set often enough. Next, we lay out the smoothness assumption, which is standard in the convergence analysis of SGLD, see e.g. Mattingly et al. (2010), Raginsky et al. (2017) and Xu et al. (2018). Assumption A2 (Smoothness) U(x) is M-smooth when there exists a positive constant M that satisfies x, x X, x U(x) x U(x ) M x x . (19) In addition, we assume the dissipativity condition to ensure that the geometric ergodicity of the dynamical system holds. This assumption is also crucial for verifying the solution properties of the solution of Poisson s equation. Similar assumptions have been made in Mattingly et al. (2010); Raginsky et al. (2017) and Xu et al. (2018). Assumption A3 (Dissipativity) There exist constants m > 0 and b 0 that satisfies x X and any θ Θ, x L(x, θ), x b m x 2. (20) To further establish a bounded second moment on x X with respect to a proper Lyapunov function V (x), we impose the following conditions on the gradient noise: Assumption A4 (Gradient noise) The stochastic gradient based on mini-batch settings is an unbiased estimator such that E[ x e U(xk) x U(xk)] = 0; furthermore, for some positive constants M and B, we have E[ x e U(xk) x U(xk) 2] M 2 x 2 + B2, where E[ ] acts on the distribution of the noise in the stochastic gradient x e U(xk). Published as a conference paper at ICLR 2022 B.3 LOCAL STABILITY VIA THE SCALABLE RANDOM-FIELD FUNCTION Now, we are ready to present our first result. Lemma 3 establishes a local stability condition for the non-linear mean-field system of ICSGLD, which implies a potential convergence of θk to a unique fixed point that adapts to a wide energy range under mild assumptions. Lemma 3 (Local stability, restatement of Lemma 1) Assume Assumptions A1-A4 hold. Given any small enough learning rate ϵ, a large enough m and batch size n, and any θ eΘ, where eΘ is a small neighborhood of θ that contains bθ , we have h(θ), θ bθ φ θ bθ 2, where bθ = θ +O(ε), ε = O supx Var(ξn(x)) + ϵ + 1 X1 π(x)dx 1 Xk π(x)dx 1 ζ , . . . , ( R Xm π(x)dx) 1 ζ Xk π(x)dx 1 φ = infθ mini b Z 1 ζ,θ(i) 1 O(ε) > 0, b Zζ,θ(i) is defined below Eq.(28), and ξn(x) denotes the noise in the energy estimator e U(x) of batch size n and Var( ) denotes the variance. Proof The random-field function e Hi(θ, x) = θ(Je U(x)) 1i=J e U(x) θ(i) based on the stochastic energy estimator e U(x) yields a biased estimator of Hi(θ, x) = θ(J(x)) 1i=J(x) θ(i) for any i {1, 2, . . . , m} based on the exact energy partition function J( ). By Lemma.4, we know that the bias caused by the stochastic energy is of order O(Var(ξn(x))). Now we compute the mean-field function h(θ) based on the measure ϖθ(x) simulated from SGLD: X e Hi(θ, x)ϖθ(x)dx = Z X Hi(θ, x)ϖθ(x)dx + O (Var(ξn(x))) ϖeΨθ(x) | {z } I1 ϖeΨθ(x) + ϖΨθ(x) | {z } I2:piece-wise approximation ϖΨθ(x) + ϖθ(x) | {z } I3:numerical discretization dx + O (Var(ξn(x))) , where ϖθ is the invariant measure simulated via SGLD that approximates ϖΨθ(x). ϖΨθ(x) and ϖeΨθ(x) are two invariant measures that follow ϖΨθ(x) π(x) Ψζ θ(U(x)) and ϖeΨθ(x) π(x) eΨζ θ(U(x)); Ψθ(u) and eΨθ(u) are piecewise continuous and constant functions, respectively θ(k 1)e(log θ(k) log θ(k 1)) u uk 1 u 1uk 1 0 given the compactness assumption A1 and a small enough ε = O supx Var(ξn(x)) + ϵ + 1 Remark 1 The newly proposed random-field function Eq.(17) may sacrifice the global stability by including an approximately linear mean-field system Eq.(28) instead of a linear stable system (see formula (15) in Deng et al. (2020b)). The advantage, however, is that such a mechanism facilitates the estimation of θ . We emphasize that the original energy probability in each partition R Xk π(x)dx m k=1 (Deng et al., 2020b) may be very difficult to estimate for big data problems. By contrast, the estimation of R Xk π(x)dx 1 ζ m k=1 becomes much easier given a proper ζ > 0. Technical lemmas Lemma 4 The stochastic energy estimator e U(x) leads to a controllable bias in the random-field function. |E[ e Hi(θ, x)] Hi(θ, x)| = O (Var(ξn(x))) , where the expectation E[ ] is taken with respect to the random noise in the stochastic energy estimator of e U( ). Proof Denote the noise in the stochastic energy estimator by ξ(x), such that e U( ) = U( ) + ξ( ). Recall that e Hi(θ, x) = θ(Je U(x)) 1i=J e U(x) θ(i) and Je U(x) {1, 2, , m} satisfies A small change of θ won t significantly affect the perturbations caused by Eq.(24), Eq.(25) and Var(ξn( )). Published as a conference paper at ICLR 2022 u J e U(x) 1 < N n e U(x) u J e U(x) for a set of energy partitions {ui}m i=0. We can interpret e Hi(θ, x) as a non-linear transformation Φ that maps e U(x) to (0, 1). Similarly, Hi(θ, x) maps U(x) to (0, 1). In what follows, the bias of random-field function is upper bounded as follows |E[ e Hi(θ, x)] Hi(θ, x)| = Z Φ(U(x) + ξ(x)) Φ(U(x))dµ(ξ(x)) Z ξ(x)Φ (U(x)) + ξ(x)2 2 Φ (u)dµ(ξ(x)) Z ξn(x)Φ (U(x))dµ(ξn(x)) + Φ (u) Z ξn(x)2dµ(ξn(x)) O (Var(ξn(x))) , where µ(ξ(x)) is the probability measure associated with ξ(x); the second equality follows from Taylor expansion for some energy u and the third equality follows because the stochastic energy estimator is unbiased; Φ (U(x)) = O( θ(J(x)) θ(J(x) 1) u ) is clearly bounded due to the definition of θ; a similar conclusion also applies to Φ ( ). The last inequality easily follows by applying Cauchy Schwarz inequality. B.4 CONVERGENCE OF THE SELF-ADAPTING PARAMETERS The following is a restatement of Lemma 3.2 of Raginsky et al. (2017), which holds for any θ in the compact space Θ. Lemma 5 (Uniform L2 bounds) Assume Assumptions A1, A3 and A4 hold. We have a bounded second moment supk 1 E[ xk 2] < given a small enough learning rate. The following lemma justifies the regularity properties of Poisson s equation, which is crucial in controlling the perturbations through the stochastic approximation process. The first version was proposed in Lemma B2 of Deng et al. (2020b). Now we give a more detailed proof by utilizing a Lyapunov function V (x) = 1 + x2 and Lemma 5. Lemma 6 (Solution of Poisson s equation) Assume that Assumptions A1-A4 hold. There is a solution µθ( ) on X to the Poisson s equation µθ(x) Πθµθ(x) = e H(θ, x) h(θ). (30) Furthermore, there exists a constant C such that for all θ, θ Θ E[ Πθµθ(x) ] C, E[ Πθµθ(x) Πθ µθ (x) ] C θ θ . (31) Proof The existence and the regularity property of Poisson s equation can be used to control the perturbations. The key of the proof lies in verifying drift conditions proposed in Section 6 of Andrieu et al. (2005). (DRI) By the smoothness assumption A2, we have that U(x) is continuously differentiable almost everywhere. By the dissipative assumption A3 and Theorem 2.1 (Roberts & Tweedie, 1996), we can show that the discrete dynamics system is irreducible and aperiodic. Now consider a Lyapunov function V = 1 + x 2 and any compact subset K Θ, the drift conditions are verified as follows: (DRI1) Given small enough learning rates {ϵk}k 1, the smoothness assumption A2, and the dissipative assumption A3, applying Corollary 7.5 (Mattingly et al., 2002) yields the minorization condition for the CSGLD algorithm, i.e. there exists η > 0, a measure ν, and a set C such that ν(C) = 1. Moreover, we have Pθ K(x, A) ην(A) A X, x C. (I) where Pθ(x, y) := 1 2 (4πϵ)d/2 E e y x+ϵ x e L(x,θ) 2 4ϵ |x denotes the transition kernel based on CS- GLD with the parameter θ K and a learning rate ϵ, in addition, the expectation is taken over the Published as a conference paper at ICLR 2022 adaptive gradient xe L(x, θ) in Eq.(15). Using Assumption A1-A4, we can prove the uniform L2 upper bound by following Lemma 3.2 (Raginsky et al., 2017). Further, by Theorem 7.2 (Mattingly et al., 2002), there exist α (0, 1) and β 0 such that Pθ KV (x) αV (x) + β. (II) Consider a Lyapunov function V = 1 + x 2 and a constant κ = α + β, it yields that Pθ KV (x) κV (x). (III) Now we have verified the first condition (DRI1) by checking conditions (I),(II), and (III), (DRI2) In what follows, we check the boundedness and Lipshitz conditions on the random-field function e H(θ, x), where each subcomponent is defiend as e Hi(θ, x) = θ(Je U(x)) 1i=J e U(x) θ(i) . Recall that V = 1 + x 2, the compactness assumption A1 directly leads to sup θ K [0,1]m H(θ, x) m V (x). (IV) For any θ1, θ2 K and a fixed x X, it suffices for us to solely verify the i-th index, which is the index that maximizes |θ1(i) θ2(i)|, then | e Hi(θ1, x) e Hi(θ2, x)| = θ1(Je U(x)) 1i=J e U(x) θ1(i) θ2(Je U(x)) 1i=J e U(x) θ2(i) |θ1(Je U(x)) θ2(Je U(x))| + |θ1(Je U(x))θ1(i) θ2(Je U(x))θ2(i)| |θ1(j) θ2(j)| + θ1(j)|θ1(i) θ2(i)| + |θ1(j) θ2(j)|θ2(i) 3|θ1(i) θ2(i)|, where the last inequality holds since θ(i) (0, 1] for any i m. (DRI3) We proceed to verify the smoothness of the transitional kernel Pθ(x, y) with respect to θ. For any θ1, θ2 K and fixed x and y, we have |Pθ1(x, y) Pθ2(x, y)| (4πϵ)d/2 E e y x+ϵ x e L(x,θ1) 2 (4πϵ)d/2 E e y x+ϵ x e L(x,θ2) 2 | y x + ϵ xe L(x, θ1) 2 y x + ϵ xe L(x, θ2) 2| xe L(x, θ1) xe L(x, θ2) θ1 θ2 , where the first inequality (up to a finite constant) follows by ex ey x y for any x, y in a compact space; the last inequality follows by the definition of the adaptive gradient in Eq.(15) and log(x) log(y) x y by the compactness assumption A1. For f : X Rd, define the norm f V = supx X |f(x)| V (x) . Following the same technique proposed in Liang et al. (2007) (page 319), we can verify the last drift condition Pθ1f Pθ2f V C f V θ1 θ2 , f LV := {f : X Rd, f V < }. (VI) Having conditions (I), (II), and (VI) verified, we are now able to prove the drift conditions proposed in Section 6 of Andrieu et al. (2005). Before we present the L2 convergence of θk, we make some extra assumptions on the step size. Assumption A5 (Learning rate and step size) The learning rate {ϵk}k N is a positive nonincreasing sequence of real numbers satisfying the conditions lim k ϵk = 0, Published as a conference paper at ICLR 2022 The step size {ωk}k N is a positive non-increasing sequence of real numbers such that lim k ωk = 0, k=1 ωk = + , k=1 ω2 k < + . (32) A practical strategy is to set ωk := O(k α) to satisfy the above conditions for any α (0.5, 1]. The following is an application of Theorem 24 (page 246) (Benveniste et al., 1990) given stability conditions (Lemma 3). Lemma 7 (L2 convergence rate, restatement of Lemma 2) Assume Assumptions A1-A5 hold. For any θ0 eΘ Θ, a large m, small learning rates {ϵk} k=1, and step sizes {ωk} k=1, {θk} k=0 converges to bθ , where bθ = θ + O supx Var(ξn(x)) + supk k0 ϵk + 1 m for some k0, such that E h θk bθ 2i = O (ωk) . The theoretical novelty is that we treat the biased bθ as the equilibrium of the continuous system instead of analyzing how far we are away from θ in all aspects as in Theorem 1 (Deng et al., 2020b). This enables us to directly apply Theorem 24 (page 246). Nevertheless, it can be interpreted as a special case of Theorem 1 (Deng et al., 2020b) except that there are no perturbation terms and the equilibrium is bθ instead of θ . C GAUSSIAN APPROXIMATION C.1 PRELIMINARY: SUFFICIENT CONDITIONS FOR WEAK CONVERGENCE To formally prove the asymptotic normality of the stochastic approximation process ω 1/2 k (θk bθ ), we first lay out a preliminary result (Theorem 1 of Pelletier (1998)) that provides sufficient conditions to guarantee the weak convergence. Lemma 8 (Sufficient Conditions) Consider a stochastic algorithm as follows θk+1 = θk + ωk+1h(θk) + ωk+1eνk+1 + ωk+1ek+1, where eνk+1 denotes a perturbation and ek+1 is a random noise. Given three conditions (C1), (C2), and (C3) defined below, we have the desired weak convergence result 2 (θk bθ ) N(0, Σ), (33) where Σ = R 0 ethθ R eth θ dt, R denotes the limiting covariance of the martingale limk E[ek+1e k+1|Fk] and Fk is the σ-algebra of the events up to iteration k, hθ = hθ(bθ )+bξI, bξ = limk ω0.5 k ω0.5 k+1 ω1.5 k . (C1) There exists an equilibrium point bθ and a stable matrix hθ := hθ(bθ ) Rm m such that for any θ {θ : θ bθ f M} for some f M > 0, the mean-field function h : Rm Rm satisfies h(θ) hθ (θ bθ ) θ bθ 2, (C2) The step size ωk decays with an order α (0, 1] such that ωk = O(k α). (C3) Assumptions on the disturbances . There exists constants f M > 0 and eα > 2 such that E [ek+1|Fk] 1{ θ bθ f M} = 0, (I1) sup k E h ek+1 eα|Fk i 1{ θ bθ f M} < , (I2) E ω 1 k eνk+1 2 1{ θ bθ f M} 0, (II) E ek+1e k+1|Fk 1{ θ bθ f M} R. (III) For example, bξ = 0 if ωk = O(k α), where α (0.5, 1] and bξ = k0 2 if ωk = k0 Published as a conference paper at ICLR 2022 Remark 2 By the definition of the mean-field function h(θ) in Eq.(26), it is easy to verify the condition C1. Moreover, Assumption A5 also fulfills the condition C2. Then, the proof hinges on the verification of the condition C3. C.2 PRELIMINARY: CONVERGENCE OF THE COVARIANCE ESTIMATORS In particular, to verify the condition E ek+1e k+1|Fk 1{ θ bθ f M} R, , we study the convergence of the empirical sample mean E[f(xk)] for a test function f to the posterior expectation f = R X f(x)ϖbθ (x)(dx). Poisson s equation is often used to characterize the fluctuation between f(x) and f: Lg(x) = f(x) f, (34) where L refers to an infinitesimal generator and g(x) denotes the solution of the Poisson s equation. Similar to the proof of Lemma 6, the existence of the solution of the Poisson s equation has been established in (Mattingly et al., 2002; Vollmer et al., 2016). Moreover, the perturbations of E[f(xk)] f are properly bounded given regularity properties for g(x), where the 0-th, 1st, and 2nd order of the regularity properties has been established in Erdogdu et al. (2018). The following result helps us to identify the convergence of the covariance estimators, which is adapted from Theorem 5 (Chen et al., 2015) with decreasing learning rates {ϵk}k 1. The gradient biases from Theorem 2 (Chen et al., 2015) are also included to handle the adaptive biases. Lemma 9 (Convergence of the Covariance Estimators) Suppose Assumptions A1-A5 hold. For any θ0 eΘ Θ, a large m, small learning rates {ϵk} k=1, step sizes {ωk} k=1 and any bounded function f, we have E [f(xk)] Z X f(x)ϖbθ (x)dx 0, where ϖbθ (x) is the invariant measure simulated via SGLD that approximates ϖeΨθ (x) π(x) θζ (J(x)). Proof We study the single-chain CSGLD and reformulate the adaptive algorithm as follows: xk+1 = xk ϵk xe L(xk, θk) + N(0, 2ϵkτI) = xk ϵk xe L(xk, bθ ) + Υ(xk, θk) + N(0, 2ϵkτI), where xe L(x, θ) = N u (log θ(J(x)) log θ((J(x) 1) 1)) i x e U(x) , xe L(x, θ) is defined in Section B.1 and the bias term is given by Υ(xk, θk) = xe L(xk, θk) xe L(xk, bθ ). Then, by Jensen s inequality and Lemma 7, we have E[Υ(xk, θk)] E[ xe L(xk, θk) xe L(xk, bθ ) ] E[ θk bθ ] q E[ θk bθ 2] O ( ωk) . (35) Combining Eq.(35) and Theorem 5 (Chen et al., 2015), we have E [f(xk)] Z X f(x)ϖbθ (x)dx = O 1 Pk i ϵi + Pk i=1 ωi E[Υ(xi, θi)] Pk i ωi + Pk i ϵ2 i Pk i ϵi where the last argument directly follows from the conditions on learning rates and step sizes in Assumption A5. J(x) = Pm i=1 i1ui 1 2 E[ek+1|Fk] = 0, sup k 0 E[ ek+1 eα|Fk] < . (I) (II) By the definition of h(θk) in Eq.(26), we can easily check that h(θk) is Lipschitz continuous in a neighborhood of bθ . Combining Eq.(31), we have h(θk) h( θk) = O( θk θk ) = O( ωk+1Πθkµθk(xk) ) = O(ωk+1). Then E[ νk+1 ] C θk θk + O(ωk+2) = O(ωk+1) by the step size condition Eq.(32). In what follows, we can verify " h(θk) h( θk) 2 = O(ωk) 0. (II) (III) For the martingale difference noise ek+1 = µθk(xk+1) Πθkµθk(xk) with mean 0, we have E[ek+1e k+1|Fk] = E[µθk(xk+1)µθk(xk+1) |Fk] Πθkµθk(xk)Πθkµθk(xk) . We denote E[ek+1e k+1|Fk] by a function f(xk). Applying Lemma 9, we have E[ek+1e k+1|Fk] = E[f(xk)] Z f(x)ϖbθ dx = lim k E[ek+1e k+1|Fk] := R, (III) where R := R(bθ ) and R(θ) is also equivalent to P k= Covθ(H(θ, xk), H(θ, x0)). Having the conditions C1, C2 and C3 verified, we apply Lemma 8 and have the following weak convergence for θk ω 1/2 k ( θk bθ ) N(0, Σ), Published as a conference paper at ICLR 2022 where Σ = R 0 ethθ R eth θ dt and hθ = hθ(bθ ) + bξI, bξ = limk ω0.5 k ω0.5 k+1 ω1.5 k . Considering the definition that θk = θk + ωk+1Πθkµθk(xk) and E[ Πθkµθk(xk) ] is uniformly bounded by Eq.(31), we have ω1/2 k Πθkµθk(xk) 0 in probability. By Slutsky s theorem, we eventually have the desired result ω 1/2 k (θk bθ ) N(0, Σ). where the step size ωk decays with an order α (0.5, 1] such that ωk = O(k α). Published as a conference paper at ICLR 2022 D MORE ON EXPERIMENTS D.1 MODE EXPLORATION ON MNIST VIA THE SCALABLE RANDOM-FIELD FUNCTION For the network structure, we follow Jarrett et al. (2009) and choose a standard convolutional neural network (CNN). Such a CNN has two convolutional (conv) layers and two fully-connected (FC) layers. The two conv layers has 32 and 64 feature maps, respectively. The FC layers both have 50 hidden nodes and the network has 5 outputs. A large batch size of 2500 is selected to reduce the gradient noise and reduce the stochastic approximation bias. We fix ζ = 3e4 and weight decay 25. For simplicity, we choose 100,000 partitions and u = 10. The step size follows ωk = min{0.01, 1 k0.6+100}. D.2 SIMULATIONS OF MULTI-MODAL DISTRIBUTIONS The target density function is given by π(x) exp( U(x)), where x = (x1, x2) and U(x) follows U(x) = 0.2(x2 1 + x2 2) 2(cos(2πx1) + cos(2πx2)). Figure 5: Target density. We also include a regularization term L(x) = I(x2 1+x2 2)>20 ((x2 1 + x2 2) 20). This design leads to a highly multi-modal distribution with 25 isolated modes. Figure 5 shows the contour and the 3-D plot of the target density. The ICSGLD and baseline algorithms are applied to this example. For ICSGLD, we set ϵk = 3e 3, τ = 1, ζ = 0.75 and total number of iterations= 8e4. Besides, we partition the sample space into 100 subregions with bandwidth u = 0.125 and set ωk = min(3e 3, 1 k0.6+100). For comparison, we run the baseline algorithms under similar settings. For CSGLD, we run a single process 5 times of the time budget and all the settings are the same as those used by ICSGLD. For re SGLD, we run five parallel chains with learning rates 0.001, 0.002, , 0.005 and temperatures 1, 2, , 5, respectively. We estimate the correction every 100 iterations. We fix the initial correction 30 and choose the same step size for the stochastic approximation as in ICSGLD. For SGLD, we run five chains in parallel with the learning rate 3e 3 and a temperature of 1. For cyc SGLD, we run a single-chain with 5 times of the time budget. We set the initial learning rate as 1e 2 and choose 10 cycles. For the particle-based SVGD, we run five chains in parallel. For each chain, we initialize 100 particles as being drawn from a uniform distribution over a rectangle. The learning rate is set to 3e 3. Figure 6: Estimation KL divergence versus time steps for ICSGLD and baseline methods. We repeat experiments 20 times. To compare the convergence rates in terms of running steps and time between ICSGLD and other algorithms, we repeat each algorithm 20 times and calculate the mean and standard error over 20 trials. Note that we run all the algorithms based on 5 parallel chains ( P5) except that cyc SGLD and CSGLD are run in a single-chain with 5 times of time budget ( T5) and the steps and running Published as a conference paper at ICLR 2022 time are also scaled accordingly. Figure 6 shows that the vanilla SGLD P5 converges the slowest among the five algorithms due to the lack of mechanism to escape local traps; cyc SGLD T5 slightly alleviates that problem by adopting cyclical learning rates; re SGLD P5 greatly accelerates the computations by utilizing high-temperature chains for exploration and low-temperature chains for exploitation, but the large correction term inevitably slows down the convergence; ICSGLD P5 converges faster than all the others and the noisy energy estimators only induce a bias for the latent variables and don t affect the convergence rate significantly. For the particle-based SVGD method, since more particles require expensive computations while fewer particles lead to a crude approximation. Therefore, we don t show the convergence of SVGD and only compare the Monte Carlo methods. D.3 DEEP CONTEXTUAL BANDITS ON MUSHROOM TASKS For the UCI Mushroom data set, each mushroom is either edible or poisonous. Eating an edible mushroom yields a reward of 5, but eating a poisonous mushroom has a 50% chance to result in a reward of -35 and a reward of 5 otherwise. Eating nothing results in 0 reward. All the agents use the same architecture. In particular, we fit a two-layer neural network with 100 neurons each and Re LU activation functions. The input of the network is a feature vector with dimension 22 (context) and there are 2 outputs, representing the predicted reward for eating or not eating a mushroom. The mean squared loss is adopted for training the models. We initialize 1024 data points and keep a data buffer of size 4096 as the training proceeds. The size of the mini-batch data is set to 512. To adapt to online scenarios, we train models after every 20 new observations. We choose one ϵ-greedy policy (Eps Greedy) based on the RMSProp optimizer with a decaying learning rate (Riquelme et al., 2018) as a baseline. Two variational methods, namely stochastic gradient descent with a constant learning rate (Const SGD) (Mandt et al., 2017) and Monte Carlo Dropout (Dropout) (Gal & Ghahramani, 2016) are compared to approximate the posterior distribution. For the sampling algorithms, we include preconditioned SGLD (p SGLD) (Li et al., 2016), preconditioned CSGLD (p CSGLD) (Deng et al., 2020b), and preconditioned ICSGLD (p ICSGLD). Note that all the algorithms run 4 parallel chains with average outputs ( P4) except that p CSGLD runs a single-chain with 4 times of computational budget ( T4). In particular for the two contour algorithms, we set ζ = 20 and choose a constant step size for the stochastic approximation to fit for the time-varying posterior distributions. For more details on the experimental setups, we refer readers to section D in the supplementary material. We report the experimental setups for each algorithm. Similar to Table 2 of Riquelme et al. (2018), the inclusion of advanced techniques may change the optimal settings of the hyperparameters. Nevertheless, we try to report the best setups for each individual algorithm. We train each algorithm 2000 steps. We initialize 1024 mushrooms and keep a data buffer of size 4096 as the training proceeds. For each step, we are given 20 random mushrooms and train the model 16 iterations every step for the parallel algorithms ( P4); we train p CSGLD T4 64 iterations every step. Eps Greedy decays the learning rate by a factor of 0.999 every step; by contrast, all the others choose a fixed learning rate. RMSprop adopts a regularizer of 0.001 and a learning rate of 0.01 to learn the preconditioners. Dropout proposes a 50% dropout rate and each subprocess simulates 5 models for predictions. For the two importance sampling (IS) algorithms, we partition the energy space into m = 100 subregions and set the energy depth u as 10. We fix the hyperrameter ζ = 20. The step sizes for p ICSGLD P4 and p CSGLD T4 are chosen as 0.03 and 0.006, respectively. A proper regularizer is adopted for the low importance weights. See Table 2 for details. TABLE 2: DETAILS OF THE EXPERIMENTAL SETUPS. ALGORITHM LEARNING RATE Temperature RMSprop IS Train Dropout ϵ-Greedy EPSGREEDY P4 5e-7 (0.999) 0 YES NO 16 NO 0.3% CONSTSGD P4 1e-6 0 NO NO 16 NO NO DROPOUT P4 1e-6 0 NO NO 16 YES (50%) NO PCSGLD T4 5e-8 0.3 YES YES 64 NO NO PSGLD P4 3e-7 0.3 YES NO 16 NO NO PICSGLD P4 3e-7 0.3 YES YES 16 NO NO Published as a conference paper at ICLR 2022 D.4 UNCERTAINTY ESTIMATION All the algorithms, excluding M-SGD P4, choose a temperature of 0.0003 . We run the parallel algorithms 500 epochs ( P4) and run the single-chain algorithms 2000 epochs ( T4). The initial learning rate is 2e-6 (Bayesian settings), which corresponds to the standard 0.1 for averaged data likelihood. We train cyc SGHMC T4 and Multi SWAG T4 based on the cosine learning rates with 10 cycles. The learning rate in the last 15% of each cycle is fixed at a constant value. Multi SWAG simulates 10 random models at the end of each cycle. M-SGD P4 follows the same cosine learning rate strategy with one cycle. re SGHMC P4 proposes swaps between neighboring chains and requires a fixed correction of 4000 for Res Net20, 32, and 56 and a correction of 1000 for WRN-16-8. The learning rate is annealed at 250 and 375 epochs with a factor of 0.2. ICSGHMC P4 also applies the same learning rate. We choose m = 200 and u = 200 for Res Net20, 32, and 56 and u = 60 for WRN-16-8. Proper regularizations may be applied to the importance weights and gradient multipliers for training deep neural networks. Variance reduction (Deng et al., 2021a) only applies to re SGHMC P4 and ICSGHMC P4 because they are the only two algorithms that require accurate estimations of the energy. We only update control variates every 2 epochs in the last 100 epochs, which maintain a reasonable training time and a higher reduction of variance due to a small learning rate. Other algorithms yield a worse performance when variance reduction is applied to the gradients. D.5 EMPIRICAL VALIDATION OF REDUCED VARIANCE To compare the θ s learned from ICSGLD and CSGLD, we try to simulate from a Gaussian mixture distribution 0.4N( 6, 1) + 0.6N(4, 1), where N(u, v) denotes a Gaussian distribution with mean u and standard deviation v. We fix ζ = 0.9 and u = 1. We run ICSGLD with 1,000,000 iterations 2.5 5.0 7.5 10.0 Energy PDF in Energy (density of states) ICSGLD x P10 CSGLD x T10 θ* Figure 7: ICSGLD v.s. CSGLD. based on 10 interacting parallel chains and run CSGLD with 10,000,000 iterations using a single chain. We refer to them as ICSGLD P10 and CSGLD T10, respectively. The rest of the settings follows from the experimental setup in section 4.1 (Deng et al., 2020a). To measure the variance of the estimates, we repeated the experiments 10 times and present the mean and two standard deviations for both CSGLD T10 and ICSGLD P10 in Figure 7. The results indicate that both estimates of θζ (by CSGLD and ICSGLD) converge to the equilibrium that approximates the ground truth of the density of states. Notably, ICSGLD P10 yields a significantly smaller variance than CSGLD T10, but with the same computational budget. This shows the clear advantage of ICSGLD (many interacting short runs) over CSGLD (a single long run) in tackling the large variance issue for importance sampling. We use various data augmentation techniques, such as random flipping, cropping, and random erasing (Zhong et al., 2017). This leads to a much more concentrated posterior and requires a very low temperature.