# langevin_quasimonte_carlo__298dd491.pdf Langevin Quasi-Monte Carlo Sifan Liu Department of Statistics Stanford University Stanford, CA 94305 sfliu@stanford.edu Langevin Monte Carlo (LMC) and its stochastic gradient versions are powerful algorithms for sampling from complex high-dimensional distributions. To sample from a distribution with density π(θ) exp( U(θ)), LMC iteratively generates the next sample by taking a step in the gradient direction U with added Gaussian perturbations. Expectations w.r.t. the target distribution π are estimated by averaging over LMC samples. In ordinary Monte Carlo, it is well known that the estimation error can be substantially reduced by replacing independent random samples by quasi-random samples like low-discrepancy sequences. In this work, we show that the estimation error of LMC can also be reduced by using quasirandom samples. Specifically, we propose to use completely uniformly distributed (CUD) sequences with certain low-discrepancy property to generate the Gaussian perturbations. Under smoothness and convexity conditions, we prove that LMC with a low-discrepancy CUD sequence achieves smaller error than standard LMC. The theoretical analysis is supported by compelling numerical experiments, which demonstrate the effectiveness of our approach. 1 Introduction Sampling from probability distributions is a crucial task in both statistics and machine learning. However, when the target distribution does not permit exact sampling, researchers often rely on Markov chain Monte Carlo (MCMC) methods. These techniques simulate a Markov chain that converges to the target distribution as its stationary distribution. Recently, MCMC samplers based on discretizing the continuous-time Langevin diffusion have become popular, due to its ease of implementation and ability to handle stochastic gradients (Welling and Teh, 2011). The primary focus of this work is on the quality of samples generated by Langevin Monte Carlo (LMC) algorithms in terms of estimating the expectation Eθ π [f(θ)] for some integrand f by sample averages. In the context of Bayesian inference, the target distribution π is typically the posterior distribution, and computing the posterior expectation, posterior variance, or confidence intervals are of great interest. In the context of post-selection inference, the target distribution π is the probability distribution conditioned on the selection event, and computing the selection-adjusted p-value is the main task. LMC has been widely used in this problem as well (Markovic and Taylor, 2016; Shi et al., 2022). In all these situations, the accuracy of the sample average estimator is critical and affects the downstream data analysis. In traditional Monte Carlo sampling, it is well known that using quasi-Monte Carlo (QMC) samples, instead of independent and identically distributed (i.i.d.) random samples, can lead to significant error reduction. So it is natural to ask whether we can apply QMC techniques to improve Langevin Monte Carlo sampling as well. In this work, we introduce the Langevin quasi-Monte Carlo (LQMC) algorithm, which replaces the i.i.d. random inputs in the LMC algorithm with quasi-random numbers. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). 1 251 points from MT19937 251 points from LCG Figure 1: Scatter plots of 251 points generated from Mersenne Twister 19937 (left) and 251 points generated from a linear congruential generator (LCG) of period 251. Points from an entire period of a pseudo-random number generator (right) fill the unit square more evenly than the same number of points from a PRNG with a larger period (left). These quasi-random numbers are carefully designed to sample from the target distribution more evenly and more balanced, leading to improved estimation accuracy. Not all quasi-Monte Carlo point sets are suitable for simulating Markov chains. Suppose the Markov chain is driven by a sequence of uniform random vectors in the unit cube. A sufficient condition for the sequence is known as completely uniformly distributed (CUD). In our implementation of the driving sequence, we use an entire period of a pseudo-random number generator (PRNG). While modern computer simulations often use PRNGs with a large period, such as Mersenne Twister with a period of 219937 1, our approach runs through the entire period of a PRNG with a relatively small period in the LMC algorithm. The advantage of using an entire period of a PRNG is that the points are more evenly distributed, which is more desirable for numerical integration. We illustrate the balancing property of an entire PRNG in Figure 1. The main contributions of this paper are threefold. First, we propose a novel technique of using quasi-random numbers in Langevin-type algorithms, which can be applied to a wide range of such algorithms by substituting i.i.d. random numbers with a sequence of quasi-random numbers. The quasi-random numbers are constructed similarly as usual PRNGs, therefore no extra computational complexity is required. Second, we evaluate the performance of the proposed LQMC algorithm in a variety of numerical experiments, demonstrating that it can significantly reduce the mean squared error (MSE) of traditional LMC by a factor ranging from 2 to 500, depending on the problem. Finally, we provide theoretical analysis showing that LQMC can reduce the Monte Carlo part of the error from O(n 1/2) to O(n 1+δ) for any δ > 0 in situations where the Markov chain is strongly contracting and the integrand function f is sufficiently regular. This error reduction is consistent with the usual improvement achieved by using quasi-Monte Carlo in place of plain Monte Carlo. The rest of the paper is organized as follows. In Section 2, we provide some background on LMC and QMC, followed by a review of related work. Section 3 describes the LQMC algorithm and its implementation details. In Section 4, we present theoretical guarantees for the proposed method. Finally, in Section 5, we provide empirical results to evaluate the performance of LQMC and compare it with the standard LMC algorithm. 2 Backgrounds This section provides some background on Langevin Monte Carlo and quasi-Monte Carlo. 2.1 Langevin Monte Carlo Suppose we want to sample from the target distribution π(θ) exp( U(θ)) where θ Rd and U is known as the potential function. LMC algorithms are based on Euler-Maruyama discretization of the Langevin diffusion θ(t), which satisfies the stochastic differential equation dθ(t) = U(θ(t))dt + where {Wt}t 0 is a d-dimensional standard Brownian motion. Under mild technical conditions, the Langevin diffusion θ(t) has π as its unique invariant distribution (Roberts and Tweedie, 1996). With a discretization step size h, LMC updates the sample θk by θk+1 θk h U(θk) + where ξk iid N(0, Id). In many applications, we are interested in computing the expectation µ := Eθ π [f(θ)] over π for some π-integrable function f. The LMC estimator of µ is the sample average where n is the number of iterations. Teh et al. (2016) provide an asymptotic bias-variance decomposition of the MSE of the weighted average Pn k=1 hkf(θk) Pn k=1 hk and show that the optimal step size scales as hk k 1/3, leading to an MSE of order O(n 2/3). Here hk is the step size used at the k-th iteration. Vollmer et al. (2016) generalize this result to the non-asymptotic setting with a constant step size h. They show that the MSE is of order O(h2 + 1 nh), where h2 corresponds to the squared bias and 1 nh corresponds to the variance. 2.2 Quasi-Monte Carlo QMC is an alternative to Monte Carlo for numerical integration and is well-known for having much higher accuracy than Monte Carlo. QMC is primarily designed to numerically evaluate the integral µ = R [0,1]d f(u)du. It estimates µ by taking points ui [0, 1]d and let the estimator be Unlike Monte Carlo which takes ui to be identically independently distributed (i.i.d.), QMC constructs the point set {ui}n i=1 that aims to minimize the star discrepancy D n = D n(u1, . . . , un) = sup a [0,1]d i=1 1{ui [0, a)} The star discrepancy measures the uniformity of the point sets by comparing the fraction of points inside [0, a) and the volume Qd j=1 aj, taking supreme over all the rectangles inside [0, 1]d anchored at 0. QMC can generate points with D n = O(n 1(log n)d 1), thus QMC is also known as low-discrepancy sequence. Commonly used QMC points include Sobol sequence (Sobol , 1967), Niederreiter s sequence (Niederreiter, 1987), Halton s sequence, and lattice rules. For a comprehensive survey, we refer to the monograph Dick and Pillichshammer (2010). If the integrand f has bounded variation in the sense of Hardy and Krause f HK, then the Koksma-Hlawka inequality (see e.g. Dick and Pillichshammer (2010)) bounds the integration error by |ˆµ µ| D n f HK O(n 1(log n)d 1). (4) While the Koksma-Hlawka inequality shows that QMC is asymptotically better than usual Monte Carlo, it doesn t provide a practical way to estimate the error. Moreover, integrands might have infinite Hardy-Krause variation. One can apply randomization techniques to QMC to address both problems. Common randomization techniques include random shifts (Cranley and Patterson, 1976) and scrambling (Owen, 1995). For RQMC samples u1, . . . , un, each ui Unif([0, 1]d) individually but they have the low-discrepancy property collectively with probability 1. One can estimate the error by multiple independent random replicates. For sufficiently smooth f, the scrambled Sobol sequence has variance O(n 3(log n)d 1) (Owen, 1997a,b). 2.3 Related work The first attempt to apply quasi-random numbers to simulate stochastic differential equations was made by Hofmann and Mathé (1997). They showed that if a numerical scheme is weakly convergent with i.i.d. samples, then using completely uniformly distributed (CUD) sequences also leads to consistent estimation. They also demonstrated that certain low-discrepancy sequences are not suitable for simulating SDEs. There have also been some efforts to apply QMC to MCMC. Owen and Tribble (2005) proposed to apply CUD sequences to a Metropolis algorithm and showed that the method is consistent in problems with finite state spaces. Chen et al. (2011) generalized the consistency result to continuous state spaces under the assumption that the Markov chain is a contraction. More recently, Dick et al. (2016); Dick and Rudolf (2014) proved that there exists constructions of the driving sequence {uk}k 1 such that the discrepancy between the empirical distribution of MCMC samples and the target distribution is bounded by O(n 1/2(log n)1/2), the same rate achieved by random inputs. Another line of applying QMC to Markov chains is known as array-RQMC proposed by L Ecuyer et al. (2008). Array-RQMC runs in parallel multiple Markov chains, and each iteration involves a complicated reordering of the states so that the low-discrepancy among the chains is maintained. Empirically, it achieves significantly smaller estimation error than usual MCMC, but theoretical guarantees remain a challenging open problem. There has been a growing interest in using QMC techniques in various machine learning tasks, such as variational inference (Buchholz et al., 2018; Liu and Owen, 2021), policy learning and evaluation (Arnold et al., 2022), reinforcement learning with evolution strategies (Choromanski et al., 2019; Rowland et al., 2018), compression of large datasets (Dick and Feischl, 2021), example selection in stochastic gradient descent (SGD) (Lu et al., 2021), and deep learning for solving partial differential equations (Longo et al., 2021). Numerous efforts have been devoted to improving LMC and stochastic gradient Langevin dynamics (SGLD). To overcome the instability of Euler-Maruyama discretization, various numerical schemes have been proposed, including higher-order integrators (Chen et al., 2015), underdamped LMC (Cheng et al., 2018), and stochastic Runge-Kutta diffusion (Li et al., 2019). For SGLD, variance reduction techniques such as SAGA and SVGR (Dubey et al., 2016) and control variates (Baker et al., 2019) have been proposed. LMC also provides a useful perspective for optimization, as demonstrated by the analyses in Chen et al. (2016); Dalalyan (2017); Raginsky et al. (2017); Xu et al. (2018); Erdogdu et al. (2018). Our contribution is orthogonal to all the aforementioned work, as our algorithm only modifies the random numbers used in the algorithm. Therefore, our method can be combined with other algorithms without interference. 3 QMC for LMC In the LMC algorithm, we can think of the Markov chain as being driven by a sequence of uniform variables uk in the unit cube [0, 1]d. For instance, the Gaussian perturbation can be represented as ξk = Φ 1(uk), where Φ 1 denotes the inverse Gaussian CDF applied element-wise to uk. If a stochastic gradient is employed, the randomness associated with the stochastic gradient can also be expressed as uniform variables. Therefore, we can write the transition of the Markov chain as θk+1 = ψ(θk, uk+1). In typical computer experiments, uk are not really i.i.d. but are deterministic pseudo-random numbers. In this section, we will describe an alternative method of generating the pseudo-random numbers uk, which are carefully constructed and can lead to more accurate sample averages. The idea here is to use point sets that are more evenly distributed such as QMC points, which can lead to significant improvement in the usual Monte Carlo estimation. However, caution is required when using QMC points to simulate an SDE like (1). This is because the correlation between successive QMC samples may introduce undesired behavior in the Markov chain, as demonstrated in (Tribble, 2007, Section 3.2). To avoid the dependence among successive values, we require that the blocks of points (vi, vi+1, . . . , vi+d 1) for any lag d are uniformly distributed. This notion of uniformity is formally known as completely uniformly distributed (CUD, Korobov (1948)), which we define next. We say an infinite sequence {ui} i=1 [0, 1]d is uniformly distributed on [0, 1]d if the star discrepancy D ({ui}n i=1) goes to 0 as n , where the star discrepancy is defined in Equation 3. Definition 3.1 (Completely uniformly distributed sequence (CUD)). An infinite sequence {vi} i=0 [0, 1] is called completely uniformly distributed, if for all positive integer d, the sequence {(vk, . . . , vk+d 1)} k=0 Rd is uniformly distributed on [0, 1]d. A triangular array vn = (vn,1, . . . , vn,Nn) is called array-CUD, if for all positive integer d, D ((vn,1, . . . , vn,d), (vn,2, . . . , vn,d+1), . . . , (vn,Nn d+1, . . . , vn,Nn)) 0 as n , Nn . In other words, the subsequent d-tuples in a CUD sequence are uniformly distributed in the ddimensional unit cube for any positive dimension d. Now we are ready to present the main algorithm. 3.1 LQMC algorithm Let {vi} i=0 be a CUD sequence. Let uk = (vkd, . . . , v(k+1)d 1) Rd be the k-th non-overlapping d-tuple from the sequence (k 0). A CUD sequence is often constructed deterministically. They can further be randomized using the Cranley-Patterson (i.e. random shift) rotation (Cranley and Patterson, 1976) uk uk + mod 1, where Unif([0, 1]d). The Cranley-Patterson rotation randomly shifts each dimension of uk by a uniform random number separately. Then each uk is uniformly distributed on [0, 1]d. If we apply the inverse Gaussian CDF to each coordinate of uk, then Φ 1(uk) N(0, Id). In the Langevin-type algorithms, we will let ξk = Φ 1(uk) and use ξk as the Gaussian perturbation in the k-th iteration. Specifically, each iteration takes the form θk+1 = θk h U(θk) + 2h Φ 1(uk+1), k 0. Thus the transition map is ψ(θ, u) = θ h U(θ) + 2hΦ 1(u). In practice, we can only run finite many iterations. In the following, we will describe how to construct a finite CUD sequence and feed it into the LMC algorithm. 3.2 Construction of CUD sequences A finite CUD (array-CUD) sequence is often implemented by using an entire period of a pseudo random number generator with a small period (Tribble, 2007). There exist other constructions of CUD sequences. For further details, interested readers can refer to Levin (1999). We propose to use the linear-feedback shift register (LFSR) provided in Chen (2011), because it has demonstrated good performance and the computational effort required is comparable to other commonly used PRNGs. The binary Galois LFSR (Tausworthe generator, Tausworthe (1965)) of order m updates the states bi {0, 1} recursively by j=0 ajbi m+j mod 2, i m with initial states b0, b1, . . . , bm 1 pre-specified. The m-tuple (bi, bi+1, . . . , bi+m 1) GF(2)m can only take 2m different values. If there is an m-tuple that is all zero, then all bi s in this sequence must be zero. So the period of the sequence {bi}i 0 is at most n = 2m 1. Moreover, the period is exactly equal to 2m 1 if and only if the characteristic polynomial xm + am 1xm 1 + . . . + a1x + a0 is a primitive polynomial over GP(2) (Niederreiter, 1992, Lemma 9.1). Given the states {bi}i 0 and an offset s > 0 such that gcd(s, 2m 1) = 1, vi is computed with j=0 bsi+j2 j 1, i = 0, 1, . . . , 2m 2. That is, for each i, we take the m-tuple (bsi+j)0 j 1, then we take d to be the smallest integer greater than d and co-prime with n. We then create the matrix in (5) similarly but with d columns. In the LQMC algorithm, we take uk to be the k-th row of the matrix but only use the first d coordinates. Algorithm 1 may seem to be restricted by having a fixed number of iterations, n = 2m 1. However, in practice, the LQMC algorithm can be started with an initial value of m. If the chain does not converge after 2m 1 iterations, one can continue the chain with another freshly generated LFSR, possibly with a larger period. This allows for flexibility in adjusting the number of iterations based on the convergence of the chain. Additionally, if a burn-in period is required, one can first run the algorithm with an LFSR of a small period to serve as the burn-in stage and then continue with a larger LFSR. Furthermore, running multiple chains with independent random shifts is embarrassingly parallel. We present the algorithm in the form of the basic LMC algorithm with accurate gradient and constant learning rate. However, as we noted previously, other Langevin-type algorithms can also utilize the CUD sequence directly by substituting the pseudo-random numbers with the LFSR sequence. 4 Theoretical guarantee Here we study the estimation error |n 1 Pn k=1 f(θk) π(f)| of LQMC for some test function f that is 1-Lipschitz and bounded. As the first attempt to prove the convergence rate of using QMC in LMC, we impose the relatively strong conditions of smoothness and convexity. Assumption 1. The potential function U is L-smooth U(θ) U(θ ) 2 L θ θ 2, θ, θ , and M-strongly convex U(θ ) U(θ) + U(θ) (θ θ) + M 2 θ θ 2 2, θ, θ . We will also assume a constant step size h. While LMC with vanishing step sizes converges weakly to the target distribution, in practice a constant step size is often used (Vollmer et al., 2016; Brosse et al., 2018). With a constant step size, we can derive a non-asymptotic error bound for LQMC. Assumption 1 implies that if the step size h 2 L+M , then the transition map ψ is a strong contraction with parameter ρ = 1 h M, i.e. ψ(θ, u) ψ(θ , u) 2 = θ θ h( U(θ) U(θ )) 2 ρ θ θ 2. (6) See e.g. Lemma 2 of Dalalyan and Karagulyan (2019). The strong contraction implies that if we start two chains from θ and θ , and use the same random numbers at every step, then the two chains will merge exponentially fast. In other words, the state θk largely depends on the most recent iterations and quickly forgets about the past history. Formally, let w(ℓ) k = (uk, . . . , uk ℓ+1) denote the random numbers used in the most recent ℓsteps. Define the ℓ-step transition as θk = ψℓ(θk ℓ, w(ℓ) k ) and let fℓ(w(ℓ) k ) denote the value of f(θk) marginalized over θk ℓ π, i.e. fℓ(w(ℓ) k ) = Z f ψℓ(x, w(ℓ) k )π(dx). Thus fℓ(w(ℓ) k ) only depends on the most recent ℓiterations. Due to the strong contraction, | fℓ(w(ℓ) k ) f(θk)| decays exponentially fast with ℓ. So for large ℓ, the estimation error of n 1 Pn k=1 f(θk) is close to the error of 1 n ℓ Pn k=ℓ+1 fℓ(w(ℓ) k ). The latter can be viewed as a dℓ-dimensional numerical integration scheme based on the point set {w(ℓ) k }n k=ℓ+1. By leveraging the discrepancy bound of the LFSR sequence and assuming that fℓhas bounded variation in the sense of Hardy and Krause, we can derive an error bound using the Koksma-Hlawka inequality (4). Now we state the main error bound and leave the detailed proof in the Appendix A. Theorem 4.1 (Error bound of LQMC). Let Assumption 1 hold. Define the step size h 2 L+M , ρ = 1 h M, ℓ= (1/2) logρ h . Let θ1, . . . , θn be the output of Algorithm 1 which runs n iterations with step size h 2 L+M . Assume the LFSR sequence {vi}i 0 in use has period n = 2m 1, offset s, and gcd(m, n) = gcd(dℓ, n) = 1. If fℓhas bounded variation in the sense of Hardy and Krause, then as n we have 1 n k=1 f(θk) π(f) C1n 1+δ + C2h1/2, δ > 0. Here δ hides poly-logarithmic factors (log n)d, C1 depends on d, ℓand fℓ HK, and C2 = 3 2 2 L M d + max0 k n θk + Eπ [ θ ]. The upper bound consists of two terms. The first term represents the numerical integration error, which arises from the discrepancy of the point set used in the integration scheme. By utilizing low-discrepancy CUD sequences, we can reduce this numerical integration error (the first term) from the standard rate of O(n 1/2) to a faster rate of O(n 1+δ) for any δ > 0. However, it is important to note that when using a constant step size h in LMC, the bias term (second term) does not vanish. This bias term includes not only the discretization error of the Langevin diffusion, but also the difference between f(θk) and its truncated version fℓ(w(ℓ) k ). Consequently, the bias term in our analysis is larger than the bias term in Vollmer et al. (2016), which employs different techniques and assumptions based on the Poisson equation. The theorem s assumption of finite Hardy-Krause variation is a common requirement in error bounds for QMC methods, and it can be challenging to verify in practice. Basu and Owen (2016) provide sufficient conditions in order for f ψℓto have finite HK variation, requiring the ℓ-step transition ψℓto be sufficiently smooth. In the next section, we aim to assess the practical performance of the proposed LQMC algorithm through numerical experiments. 5 Numerical experiments To comprehensively evaluate the performance of the algorithm, we will consider both convex and non-convex potentials, both low-dimensional and high-dimensional state spaces, both accurate and stochastic gradients, both smooth and discontinuous integrands, as well as different learning rate schedules. Additional numerical results can be found in the Appendix B. 210 211 212 213 214 215 216 210 211 212 213 214 215 216 0 f(x) = x2 210 211 212 213 214 215 216 f(x) = 1xj > 0 210 211 212 213 214 215 216 210 211 212 213 214 215 216 210 211 212 213 214 215 216 f(x) = 1xj > 0 Figure 2: Bayesian logistic regression with accurate gradients (top) and stochastic gradients (bottom). 5.1 Bayesian logistic regression We first consider the Bayesian logistic model yi | xi Bernoulli((1 + exp( x i β)) 1), 1 i N, β N(0, Id). We take N = 20, d = 10. The features xi are generated from N(0, Σ) with Σij = 2 |i j|. The coefficients β and the data yi s are generated from the same model. We consider the test functions f(x) = xj, x2 j, 1{xj>0} for j = 1, . . . , d. The step size h is fixed to 0.001. We compute the MSE of the estimator based on usual LMC and the proposed LQMC with CUD sequences and report the MSE averaged over all coordinates and 20 random replicates. We do not have a closed form for the expectations E [f], so the ground truth is estimated using a high-accuracy estimator proposed in He et al. (2023) using scrambled Sobol sequence with a very large sample size. In Figure 2 (top panel), we present a log-log plot of the MSE against the number of iterations. Across all three test functions, we observe that LQMC reduces the MSE by a factor ranging from 4 to 8. As the number of iterations increases, the curve corresponding to LQMC reaches a plateau. This behavior can be attributed to the discretization error inherent in the unadjusted LMC, which cannot be further reduced by increasing the number of iterations. In the bottom panel of Figure 2, we increase the number of observations to N = 100 and incorporate stochastic gradient estimation in the Langevin algorithm. Specifically, at each iteration, we estimate the gradient using a random subset of 10 observations. The results demonstrate that LQMC still provides a big improvement when n is smaller than 214. However, as n surpasses 214, we observe that the LQMC curve flattens again. It is worth noting that the improvement achieved by LQMC in this scenario is less pronounced compared to the previous example, primarily due to the presence of noise in the gradient estimates. 5.2 Bayesian linear regression Now we try a higher-dimensional example with Bayesian linear regression. The model is defined as yi N(x i β, σ2 = 4 1), 1 i N, β N(0, I). We take d = 100 and N = 20. We generate xi Rd similarly as in the logistic regression example. The test functions and step size are also unchanged. The posterior distribution of β has the closed form N ( X X σ2 + I) 1 X Y σ2 + I) 1 . The results are shown in Figure 3. We see that even at 100 dimension, LQMC still brings a substantial improvement over LMC in terms of MSE. In 210 211 212 213 214 215 216 210 211 212 213 214 215 216 210 211 212 213 214 215 216 f(x) = 1xj > 0 Figure 3: Bayesian linear regression in 100 dimensions. 210 211 212 213 214 215 216 210 211 212 213 214 215 216 210 211 212 213 214 215 216 decreasing h Figure 4: Crossed random effect. particular, for the integrand f(x) = xj, LQMC achieves a reduction in MSE of approximately 500-fold compared to LMC. 5.3 A hierarchical Bayesian model We consider a hierarchical Bayesian model known as the crossed random effect model Yij N(µ + ai + bj, 1), 1 i I, 1 j J, µ N(0, 1), ai iid N(0, σ2 a), bj iid N(0, σ2 b), log(σ2 a), log(σ2 b) iid N(0, 1). The goal is to sample from the posterior distribution of (µ, a, b, log(σ2 a), log(σ2 b)), which has dimension d = I + J + 3. We take I = 3, J = 5. We will consider the test functions f(x) = xj (1 j d). The ground truth of E [f(x)] is estimated by Langevin dynamics with Metropolis adjustments (MALA) using a large sample size. We will compare the performance of the LQMC algorithm using three different step sizes: a constant step size of 10 4, a constant step size of 10 2, and decreasing step sizes with hk = c0(c1 + k) 1/3. The choice of c0 and c1 ensures that the step size decreases from 10 2 to 10 4 throughout the entire algorithm. The use of the exponent 1/3 in the decreasing step sizes is recommended in Teh et al. (2016). The results of these comparisons are presented in Figure 4. In the small step size case (left panel), we observe that the errors of LMC and LQMC are initially comparable for small values of n. This is because the algorithm converges slowly, and thus the error is dominated by the bias. However, as n increases, the improvement of LQMC becomes evident. In the large step size case (middle panel), the MSE of LQMC is consistently smaller than that of LMC even for small values of n. This is because the algorithm converges faster to the target distribution with a larger step size h. Therefore, the improvement of LQMC is more pronounced. Interestingly, in this particular example, using decreasing step sizes yields similar accuracy to using a constant step size of 10 4. It is worth noting that the MSE of LMC does not decrease at a rate of n 2/3 as in Teh et al. (2016). This is because the line in the plot does not represent the accuracy against the iteration k within a single training process. Instead, it reflects the accuracy achieved after completing all n iterations of the algorithm, considering different values of n. 215216217218219220221222223 215216217218219220221222223 215216217218219220221222223 f(x) = 1x > 0 Figure 5: Double well potential. 5.4 Nonconvex potential Finally we investigate a double-well potential function U(x) = 1 2 log(1 + x2) from Pagès and Panloup (2018). We know E [x] = 0 and E 1{x>0} = 0.5. The second moment E x2 is computed by Gaussian quadrature. See the results in Figure 5. Since the potential has two separate local minimums, it takes longer for the Langevin algorithm to explore the space sufficiently and converge to the target distribution. Once converged, the improvement of LQMC over LMC is still significant. Acknowledgments and Disclosure of Funding The author thanks Prof. Art Owen for helpful conversations. This work was partially funded by the NSF grant DMS-2152780 and the Stanford Data Science Scholars program. Arnold, S. M., L Ecuyer, P., Chen, L., Chen, Y.-F., and Sha, F. (2022). Policy learning and evaluation with randomized quasi-Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 1041 1061. PMLR. Baker, J., Fearnhead, P., Fox, E. B., and Nemeth, C. (2019). Control variates for stochastic gradient MCMC. Statistics and Computing, 29:599 615. Basu, K. and Owen, A. B. (2016). Transformations and Hardy Krause variation. SIAM Journal on Numerical Analysis, 54(3):1946 1966. Brosse, N., Durmus, A., and Moulines, E. (2018). The promises and pitfalls of stochastic gradient Langevin dynamics. Advances in Neural Information Processing Systems, 31. Buchholz, A., Wenzel, F., and Mandt, S. (2018). Quasi-Monte Carlo variational inference. In International Conference on Machine Learning, pages 668 677. PMLR. Chen, C., Carlson, D., Gan, Z., Li, C., and Carin, L. (2016). Bridging the gap between stochastic gradient MCMC and stochastic optimization. In Artificial Intelligence and Statistics, pages 1051 1060. PMLR. Chen, C., Ding, N., and Carin, L. (2015). On the convergence of stochastic gradient MCMC algorithms with high-order integrators. Advances in neural information processing systems, 28. Chen, S. (2011). Consistency and convergence rate of Markov chain quasi Monte Carlo with examples. Ph D thesis, Stanford University. Chen, S., Dick, J., and Owen, A. (2011). Consistency of Markov chain quasi-Monte Carlo on continuous state spaces. The Annals of Statistics, 39(2):673 701. Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I. (2018). Underdamped Langevin MCMC: A non-asymptotic analysis. In Conference on learning theory, pages 300 323. PMLR. Choromanski, K., Pacchiano, A., Parker-Holder, J., and Tang, Y. (2019). Structured Monte Carlo sampling for nonisotropic distributions via determinantal point processes. ar Xiv preprint ar Xiv:1905.12667. Cranley, R. and Patterson, T. N. (1976). Randomization of number theoretic methods for multiple integration. SIAM Journal on Numerical Analysis, 13(6):904 914. Dalalyan, A. (2017). Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In Conference on Learning Theory, pages 678 689. PMLR. Dalalyan, A. S. and Karagulyan, A. (2019). User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278 5311. Dick, J. and Feischl, M. (2021). A quasi-monte carlo data compression algorithm for machine learning. Journal of Complexity, 67:101587. Dick, J. and Pillichshammer, F. (2010). Digital Sequences, Discrepancy and Quasi-Monte Carlo Integration. Cambridge University Press, Cambridge. Dick, J. and Rudolf, D. (2014). Discrepancy estimates for variance bounding Markov chain quasi Monte Carlo. Electron. J. Probab, 19(105):1 24. Dick, J., Rudolf, D., and Zhu, H. (2016). Discrepancy bounds for uniformly ergodic Markov chain quasi-Monte Carlo. Annals of Applied Probability, 26(5):3178 3205. Dubey, K. A., J Reddi, S., Williamson, S. A., Poczos, B., Smola, A. J., and Xing, E. P. (2016). Variance reduction in stochastic gradient Langevin dynamics. Advances in neural information processing systems, 29. Erdogdu, M. A., Mackey, L., and Shamir, O. (2018). Global non-convex optimization with discretized diffusions. Advances in Neural Information Processing Systems, 31. He, Z., Zheng, Z., and Wang, X. (2023). On the error rate of importance sampling with randomized quasi-Monte Carlo. SIAM Journal on Numerical Analysis, 61(2):515 538. Hofmann, N. and Mathé, P. (1997). On quasi-Monte Carlo simulation of stochastic differential equations. Mathematics of computation, 66(218):573 589. Korobov, N. (1948). On functions with uniformly distributed fractional parts. In Dokl. Akad. Nauk SSSR, volume 62, pages 21 22. L Ecuyer, P., Lécot, C., and Tuffin, B. (2008). A randomized quasi-Monte Carlo simulation method for Markov chains. Operations Research, 56(4):958 975. Levin, M. B. (1999). Discrepancy estimates of completely uniformly distributed and pseudorandom number sequences. International Mathematics Research Notices, 1999(22):1231 1251. Li, X., Wu, Y., Mackey, L., and Erdogdu, M. A. (2019). Stochastic Runge-Kutta accelerates Langevin Monte Carlo and beyond. Advances in neural information processing systems, 32. Liu, S. and Owen, A. B. (2021). Quasi-Monte Carlo quasi-Mewton in variational Bayes. The Journal of Machine Learning Research, 22(1):11043 11065. Longo, M., Mishra, S., Rusch, T. K., and Schwab, C. (2021). Higher-order quasi-monte carlo training of deep neural networks. SIAM Journal on Scientific Computing, 43(6):A3938 A3966. Lu, Y., Meng, S. Y., and De Sa, C. (2021). A general analysis of example-selection for stochastic gradient descent. In International Conference on Learning Representations. Markovic, J. and Taylor, J. (2016). Bootstrap inference after using multiple queries for model selection. ar Xiv preprint ar Xiv:1612.07811. Niederreiter, H. (1987). Point sets and sequences with small discrepancy. Monatshefte für Mathematik, 104:273 337. Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods. SIAM. Owen, A. B. (1995). Randomly permuted (t, m, s)-nets and (t, s)-sequences. In Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing, pages 299 317, New York. Springer-Verlag. Owen, A. B. (1997a). Monte Carlo variance of scrambled net quadrature. SIAM Journal of Numerical Analysis, 34(5):1884 1910. Owen, A. B. (1997b). Scrambled net variance for integrals of smooth functions. Annals of Statistics, 25(4):1541 1562. Owen, A. B. and Tribble, S. D. (2005). A quasi-Monte Carlo Metropolis algorithm. Proceedings of the National Academy of Sciences, 102(25):8844 8849. Pagès, G. and Panloup, F. (2018). Weighted multilevel langevin simulation of invariant measures. Annals of Applied Probability, 28(6):3358 3417. Raginsky, M., Rakhlin, A., and Telgarsky, M. (2017). Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674 1703. PMLR. Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pages 341 363. Rowland, M., Choromanski, K. M., Chalus, F., Pacchiano, A., Sarlos, T., Turner, R. E., and Weller, A. (2018). Geometrically coupled monte carlo sampling. Advances in Neural Information Processing Systems, 31. Shi, J., Liu, C., and Mackey, L. (2022). Sampling with mirrored stein operators. In International Conference on Learning Representations. Sobol , I. M. (1967). On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel noi Matematiki i Matematicheskoi Fiziki, 7(4):784 802. Tausworthe, R. C. (1965). Random numbers generated by linear recurrence modulo two. Mathematics of Computation, 19(90):201 209. Teh, Y. W., Thiery, A. H., and Vollmer, S. J. (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17. Tribble, S. D. (2007). Markov chain Monte Carlo algorithms using completely uniformly distributed driving sequences. Ph D thesis, Citeseer. Vollmer, S. J., Zygalakis, K. C., and Teh, Y. W. (2016). Exploration of the (non-) asymptotic bias and variance of stochastic gradient Langevin dynamics. The Journal of Machine Learning Research, 17(1):5504 5548. Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681 688. Xu, P., Chen, J., Zou, D., and Gu, Q. (2018). Global convergence of Langevin dynamics based algorithms for nonconvex optimization. Advances in Neural Information Processing Systems, 31. A.1 Proof of Theorem 4.1 We start by decomposing the error | 1 n Pn k=1 f(θk) π(f)| into three parts 1 n k=1 f(θk) π(f) 1 n k=ℓ+1 f(θk) 1 k=ℓ+1 fℓ(w(ℓ) k ) + 1 n k=ℓ+1 fℓ(w(ℓ) k ) π(f) + ℓ = (I) + (II) + 2ℓ We first upper bound (I). Lemma 1 (Upper bound of (I); adapted from Lemma 6.1.4 of Chen (2011)). If the transition map ψ is a contraction with parameter ρ and if f is 1-Lipschitz, then | fℓ(w(ℓ) k ) f(θk)| max 0 i n θi + Eπ [ θ ] ρℓ. Proof of Lemma 1. Note that | fℓ(w(ℓ) k ) f(θk)| Z |f(ψℓ(θ, w(ℓ) k )) f(ψℓ(θk ℓ, w(ℓ) k ))|π(dθ) Z (ψℓ(θ, w(ℓ) k )) (ψℓ(θk ℓ, w(ℓ) k )) π(dθ) ρ Z (ψℓ 1(θ, w(ℓ 1) k 1 )) (ψℓ 1(θk ℓ, w(ℓ 1) k 1 )) π(dθ) ρℓ Z θ θk ℓ π(dθ) ρℓ( max 0 i n θi + Eπ [ θ ]). To bound (II), note that 1 n ℓ Pn k=ℓ+1 fℓ(w(ℓ) k ) is estimating E h fℓ(w(ℓ)) i = Z ψℓ(θ, w(ℓ))π(dθ)dw(ℓ) =: πPℓ(f). Here, πPℓdenote the distribution of the ℓ-step state θℓstarting from θ0 π. So we have the further decomposition k=ℓ+1 fℓ(w(ℓ) k ) π(f) + ℓ n ℓ f |π(f) πPℓ(f)| + 1 n ℓ k=ℓ+1 fℓ(w(ℓ) k ) πPℓ(f) + ℓ n ℓ f (II) + (II) + ℓ n ℓ f . The first term (II) is due to the discretization in time. The second term (II) is the numerical integration error. To bound (II) , we use the following result. Lemma 2 (Upper bound on discretization error (II) ). Under Assumption 1, we have for f 1Lipschitz, |π(f) πPℓ(f)| 3 2 2 L M h1/2d. Proof of Lemma 2. We let θ(t) be the continuous-time Langevin diffusion with θ(0) = θ0 π, Wtk+1 Wtk = hξk+1, where ξk+1 iid N(0, Id), tk = kh. So we have θ(tk+1) = θ(tk) Z tk+1 tk U(θ(s))ds + θk+1 = θk h U(θk) + Combing the previous two equations gives θ(tk+1) θk+1 = θ(tk) θk h[ U(θ(tk)) U(θk)] Z tk+1 tk U(θ(s)) U(θ(tk))ds. Let k = θ(tk) θk. The last display reads k+1 = k h[ U(θk + k) U(θk)] Z tk+1 tk U(θ(s)) U(θ(tk))ds. By the contracting property (6) in the main paper, k h[ U(θk + k) U(θk)] ρ k . Taking expectation and use L-smoothness of U, we have E [ k+1 ] ρE [ k ] + L Z tk+1 tk E [ θ(s) θ(tk) ] ds. By Lemma 3 of Dalalyan and Karagulyan (2019), E U(θ) 2 2 Ld. So we have E [ U(θ) ] p d E [ U(θ) 2 2] Ld. Because θ(t) is a stationary process, Z tk+1 tk E [ θ(s) θ(tk) ] ds = Z h 0 E [ θ(t) θ(0) ] dt 0 U(θ(s))ds + 0 E [ U(θ(s)) ] dsdt + Z h 2E [ Wt ] dt 2t E [ ξ1 ] dt. 2Γ(d/2 + 1/2) Thus, Z tk+1 tk E [ θ(s) θ(tk) ] ds 1 2L1/2h2d + 3 2 2 h3/2d1/2 2 2 h3/2d + 3 2 2 h3/2d1/2 Denote r = 3 2 2 Lh3/2d. So E [ k+1 ] ρE [ k ] + r ρk+1E [ 0 ] + 2 2 L M h1/2d Therefore, for any k 1, |π(f) πPk(f)| = |E [f(θ(tk)) E [f(θk)]] E [|f(θ(tk)) f(θk)|] 2 2 L M h1/2d. If we use a noisy gradient ˆg(θk) = U(θk) + ek where ek is the noise with mean zero and bounded variance such that E(||ek||2 2) σ2, then an extra term 2hσ will appear in Lemma 2. As σ2 is usually expected to be proportional to the dimension , this additional term is of the same order as the other term. Theorem A.1 (Theorem 9.8 of Niederreiter (1992)). Let v0, v1, . . . be an LFSR with offset s and period n = 2m 1 which satisfy gcd(m, n) = 1. Then the sequence {ui}n 1 i=0 [0, 1]s with ui = (vi, vi+1, . . . , vi+s 1) has, on average, star-discrepancy O(n 1(log n)d+1 log log n) with an implied constant depending only on d and the average is taken over all primitive polynomials over GF(2) of degree m. Proof of Theorem 4.1. The error on the left-hand-side is bounded by (I) + (II) + (II) + 4ℓ Lemma 1 shows that (I) (max0 i