# online_sampling_from_logconcave_distributions__0a57df84.pdf

Online Sampling from Log-Concave Distributions

Holden Lee Duke University

Oren Mangoubi Worcester Polytechnic Institute

Nisheeth K. Vishnoi

Yale University

Given a sequence of convex functions f0, f1, . . . , f T , we study the problem of sampling from the Gibbs distribution t / e Pt

k=0 fk for each epoch t in an online manner. Interest in this problem derives from applications in machine learning, Bayesian statistics, and optimization where, rather than obtaining all the observations at once, one constantly acquires new data, and must continuously update the distribution. Our main result is an algorithm that generates roughly independent samples from t for every epoch t and, under mild assumptions, makes polylog(T) gradient evaluations per epoch. All previous results imply a bound on the number of gradient or function evaluations which is at least linear in T. Motivated by real-world applications, we assume that functions are smooth, their associated distributions have a bounded second moment, and their minimizer drifts in a bounded manner, but do not assume they are strongly convex. In particular, our assumptions hold for online Bayesian logistic regression, when the data satisfy natural regularity properties, giving a sampling algorithm with updates that are poly-logarithmic in T. In simulations, our algorithm achieves accuracy comparable to an algorithm specialized to logistic regression. Key to our algorithm is a novel stochastic gradient Langevin dynamics Markov chain with a carefully designed variance reduction step and constant batch size. Technically, lack of strong convexity is a signiﬁcant barrier to analysis and, here, our main contribution is a martingale exit time argument that shows our Markov chain remains in a ball of radius roughly poly-logarithmic in T for enough time to reach within " of t.

1 Introduction

In this paper, we study the following online sampling problem: Problem 1.1. Consider a sequence of convex functions f0, f1, . . . , f T : Rd ! R for some T 2 N, and let " > 0. At each epoch t 2 {1, . . . , T}, the function ft is given to us, so that we have oracle access to the gradients of the ﬁrst t + 1 functions f0, f1, . . . , ft. The goal at each epoch t is to generate a sample from the distribution t(x) / e Pt

k=0 fk(x) with ﬁxed total-variation (TV) error ". The samples at different time steps should be almost independent.

Various versions of this problem have been considered in the literature, with applications in Bayesian statistics, optimization, and theoretical computer science; see [NR17, DDFMR00, ADH10] and references therein. If f is convex, then a distribution p / e f is logconcave; this captures a large class of useful distributions such as gaussian, exponential, Laplace, Dirichlet, gamma, beta, and chi-squared distributions. We give some settings where online sampling can be used:

Online posterior sampling. In Bayesian statistics, the goal is to infer the probability distribution

(the posterior) of a parameter, based on observations; however, rather than obtaining all the observations at once, one constantly acquires new data, and must continuously update the posterior distribution, rather than only after all data is collected. Suppose p0 / e f0 for a given prior distribution, and samples yt drawn from the conditional distribution p( | , y1, . . . , yt 1) arrive in a streaming manner. By Bayes s rule, letting pt( ) = e ft( ) := p( |y1, . . . , yt) be the posterior distribution, we have the following recursion: pt( ) / pt 1( )p(yt| , y1, . . . , yt 1). 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Hence, pt( ) / e Pt

k=0 fk( ). The goal is to sample from pt( ) for each t. This ﬁts the setting of Problem 1.1 if p0 and all updates p(yt| , y1, . . . yt 1) are logconcave.

One practical application is online logistic regression; logistic regression is a common model for binary classiﬁcation. Another is inference for Gaussian processes, which are used in many Bayesian models because of their ﬂexibility, and where stochstic gradient Langevin algorithms have been applied [FE15]. A third application is latent Dirichlet allocation (LDA), often used for document classiﬁcation [BNJ03]. As new documents are published, it is desirable to update the distribution of topics without excessive re-computation.1

Optimization. One online optimization method is to sample a point from the exponential of

the (weighted) negative loss ([CBL06, HAK07], Lemma 10 in [NR17]). There are settings such as online logistic regression where the only known way to achieve optimal regret is a Bayesian sampling approach [FKL+18], with lower bounds known for the naive convex optimization approach [HKL14].

Reinforcement learning (RL). Thompson sampling [RVRK+18, DFE18] solves RL problems

by maximizing the expected reward at each period with respect to a sample from the Bayesian posterior for the environment parameters, reducing it to the online posterior sampling problem.

In all of these applications, because a sample is needed at every epoch t, it is desirable to have a fast online sampling algorithm. In particular, the ultimate goal is to design an algorithm for Problem 1.1 such that the number of gradient evaluations is almost constant at each epoch t, so that the

computational requirements at each epoch do not increase over time. This is challenging because at epoch t, one has to incorporate information from all t + 1 functions f0, . . . , ft in roughly O(1) time.

Our main contribution is an algorithm for Problem 1.1 that computes e OT (1) gradients per epoch, under mild assumptions on the functions2. All previous rigorous results (even with comparable assumptions) imply a bound on the number of gradient or function evaluations which is at least linear in T; see Table 1. Our assumptions are motivated by real-world considerations and hold in the setting of online Bayesian logistic regression when the data vectors satisfy natural regularity properties.

In the ofﬂine setting, our result also implies the ﬁrst algorithm to sample from a d-dimensional log-concave distribution / e PT

t=1 ft where the ft s are not assumed strongly convex and the total number of gradient evaluations is roughly T log(T) + poly(d), instead of T poly(d) implied by prior works (Table 1).

A natural approach to online sampling is to design a Markov chain with the right steady state distribution [NR17, DMM19, DCWY18, CFM+18]. The main difﬁculty is that running a step of a Markov chain that incorporates all previous functions takes time (t) at epoch t; all previous algorithms with provable guarantees suffer from this. To overcome this, one must use stochasticity for example, sample a subset of the previous functions. However, this fails because of the large

variance of the gradient. Our result relies on a stochastic gradient Langevin dynamics (SGLD) Markov chain with a carefully designed variance reduction step and ﬁxed batch size.

We emphasize that we do not assume that the functions ft are strongly convex. This is important for applications such as logistic regression. Even if the negative log-prior f0 is strongly convex, we cannot obtain the same bounds by using existing results on strongly convex f, because the bounds depend on the condition number of PT

t=0 ft, which grows as T. Lack of strong convexity is a technical barrier to analyzing our Markov chain and, here, our main contribution is a martingale exit time argument that shows that our Markov chain is constrained to a ball of radius roughly 1/

t for time that is sufﬁcient for it to reach within " of t.

1Note that LDA requires sampling from non-logconcave distributions. Our algorithm can be used for non-logconcave distributions, but our theoretical guarantees are only for logconcave distributions.

2The subscript T in e OT means that we only show the dependence on the parameters t, T, and exclude dependence on non-T, t parameters such as the dimension d, sampling accuracy " and the regularity parameters C, D, L which we deﬁne in Section 2.1.

2 Our algorithm and results

2.1 Assumptions

Denote by L(Y ) the distribution of a random variable Y . For any two probability measures µ, , denote the 2-Wasserstein distance by W2(µ, ) := inf(X,Y ) (µ, )

E[k X Y k2], where (µ, ) denotes the set of all possible couplings of random vectors ( ˆX, ˆY ) with marginals ˆX µ and ˆY . For every t 2 {0, . . . , T}, deﬁne Ft := Pt

k=0 fk, and let x?

t be a minimizer of Ft(x) on Rd. For any x 2 Rd, let δx be the Dirac delta distribution centered at x. We make the following assumptions: Assumption 1 (Smoothness/Lipschitz gradient (with constants L0, L > 0)). For all 1 t T and x, y 2 Rd, krft(y) rft(x)k L kx yk. For t = 0, krf0(y) rf0(x)k L0 kx yk.

We allow f0 to satisfy our assumptions with a different parameter value, since in Bayesian applications f0 models a prior" which has different scaling from f1, f2, . . . f T . Assumption 2 (Bounded second moment with exponential concentration (with constants A, k > 0, c 0)). For all 0 t T and all s 0, PX t(k X x?

t k s/pt+c) Ae ks.

Note Assumption 2 implies a bound on the second moment, m

1/2 2 := (Ex t kx x?

1 2 C/pt+c for C := (2 + 1/k) log(A/k2). For conciseness, we write bounds in terms of this parameter C.3

Assumption 3 (Drift of mode (with constants D 0, c 0)). For all 0 t, T such that 2 [t, max{2t, 1}], kx?

Assumption 2 says that the data is informative enough the current distribution t (posterior) concentrates near the mode x?

t as t increases. The 1

t decrease in the second moment is what one would expect based on central limit theorems such as the Bernstein-von Mises theorem. Assumption 2 is a weaker condition than strong convexity: if the ft s are -strongly convex, then t(x) / e Pt

concentrates to within

(t+1); however, many distributions satisfy Assumption 2 without being strongly log-concave. For instance, posterior distributions used in Bayesian logistic regression satisfy Assumption 2 under natural conditions on the data, but are not strongly log-concave with comparable parameters (Section 2.4). Hence, together Assumptions 1 and 2 are a weaker condition than strong convexity and gradient Lipschitzness, the typical assumptions under which the ofﬂine algorithm is analyzed. Similar to the typical assumptions, our assumptions avoid the ill-conditioned case when the distribution becomes more concentrated in one direction than another as the number of functions t increases.

Assumption 3 is typically satisﬁed in the setting where the ft s are iid. This is the case when we observe iid random variables and deﬁne functions ft based on them, as will be the case for our application to Bayesian logistic regression (Problem 2.2). To help with intuition, note that Assumption 3 is satisﬁed for the problem of Gaussian mean estimation: the mode is the same as the mean, and the assumption reduces to the fact that a random walk drifts on the order of

t, and hence the mean of the posterior drifts by OT (1/

t), after t time steps. We need this assumption because our algorithm uses cached gradients computed T (t) time steps ago, and in order for the past gradients to be close in value to the gradient at the current point, the points where the gradients were last calculated should be at distance OT (1/

t) from the current point. We give a simple example where the assumptions hold (Appendix G of the supplement).

In Section 2.4 we show these assumptions hold for functions arising in online Bayesian logistic regression; unlike previous work on related techniques [NDH+17, CFM+18], our assumptions are weak enough to hold in such applications, as they do not require f0, . . . , f T to be strongly convex.

2.2 Algorithm for online sampling

At every epoch t = 1, . . . , T, given gradient access to the functions f0, . . . , ft, Algorithm 2 generates a point Xt approximately distributed according to t / e Pt

k=0 fk(x). It does so by running SAGALD (Algorithm 1), with step size t that decreases as the epoch, and a given number of steps imax.

3Having a bounded second moment sufﬁces to obtain (weaker) polynomial bounds (by replacing the use of the concentration inequality with Chebyshev s inequality). We use this slightly stronger condition because exponential concentration improves the dependence on ", and is typically satisﬁed in practice.

Our main Theorem 2.1 says that for each sample to have ﬁxed TV error ", at each epoch the number of steps imax only needs to be poly-logarithmic in T.

Algorithm 1 makes the following update rule at each step for the SGLD Markov chain Xi, for a certain choice of stochastic gradient gi, where E[gi] = Pt

k=0 rfk(Xi):

Xi+1 = Xi tgi +

2 t i, i N(0, Id). (1) Key to our algorithm is the construction of the variance reduced stochastic gradient gi. It is constructed by taking the sum of the cached gradients at previous points in the chain and correcting it with a batch of constant size b.

This variance reduction is only effective when the points where the cached gradients were computed stay within e OT (1/

t) of the current mode x?

t . Algorithm 2 ensures that this holds with high probability by resetting to the sample at the previous power of 2 if the sample has drifted too far.

The step size t is determined by an input parameter 0 > 0. We set t = 0/t+c for the following reason: Assumption 2 says that the variance of the target distribution t decreases at the rate C2/t+c, and we want to ensure that the variance of each step of Langevin dynamics decreases at roughly the same rate. With the step size t = 0/t+c, the Markov chain can travel across a sub-level set containing most of the probability measure of t in roughly the same number imax = e OT (1) of steps at each epoch t. We will take the acceptance radius to be C0 = 2.5(C1 + D) where C1 is given by (65) in the supplement, and show that with good probability this choice of C0 ensures k Xt 1 Xt0k 4(C1+D)/pt+c in Algorithm 2. Note that in practice, one need not know the values of the regularity constants in Assumptions 1-3 but can instead use heuristics to tune the Markov chain s parameters.

Algorithm 1 SAGA-LD Input: Oracles for rfk for k 2 [0, t], step size > 0, batch size b 2 N, number of steps imax, initial point X0, cached gradients Gk = rfk(uk) for some points uk, and s = Pt

k=1 Gk. Output: Ximax 1: for i from 0 to imax 1 do 2: (Sample batch) Sample with replacement a (multi)set S of size b from {1, . . . , t}. 3: (Calculate gradients) For each k 2 S, let Gk

new = rfk(Xi). 4: (Variance-reduced gradient estimate) Let gi = rf0(Xi) + s + t

new Gk). 5: (Langevin step) Let Xi+1 = Xi gi + p2 i where i N(0, I). 6: (Update sum) Update s [ s + P

k2set(S)(Gk

7: (Update gradients) For each k 2 S, update Gk [ Gk

new. 8: end for

2.3 Result in the online setting

In this section we give our main result for the online sampling problem; for additional results in the ofﬂine sampling problem, see Appendix A in the supplement. Theorem 2.1 (Online variance-reduced SGLD). Suppose that f0, . . . , f T : Rd ! R are (weakly) convex and satisfy Assumptions 1-3 with c = L0/L. Let C = (2 + 1/k) log(A/k2). Then there exist

parameters b = 9, 0 = e

L2 log6(T )(C+D)2d

, and imax = e O

(C+D)2 log2(T )

, such that at each

epoch t, Algorithm 2 generates an "-approximate independent sample Xt from t.4 The total number of gradient evaluations imax required at each epoch t is polynomial in d, L, C, D, " 1 and log(T). Here, e and e O hide polylogarithmic factors in d, L, C, D, " 1 and log(T).

Note that the dependence of imax on " is imax = e O"

. See Section B.4 in the supplement for the proof of Theorem 2.1. Note that the algorithm needs to know the parameters, but bounds are enough.

Previous results all imply a bound on the number of gradient or function evaluations5 at each epoch which is at least linear in T. Our result is the ﬁrst to obtain bounds on the number of gradient

4See Deﬁnition B.1 in the supplement for the formal deﬁnition. Necessarily, k L(Xt) tk TV ". 5In our setting a gradient can be computed in at worst 2d function evaluations. In many applications (including logistic regression) gradient evaluation takes the same number of operations as function evaluation.

Algorithm 2 Online SAGA-LD Input: T 2 N and gradient oracles for functions ft : Rd ! R, for all t 2 {0, . . . , T} , where only the gradient oracles rf0, . . . , rft are available at epoch t, an initial point X0 2 Rd. Input: step size 0 > 0, batch size b > 0, imax > 0, constant offset c, acceptance radius C0. Output: At each epoch t, a sample Xt

1: Set s = 0. . Initial gradient sum 2: for epoch t = 1 to T do 3: Set t0 = 2blog2(t 1)c if t > 1, and t0 = 0 if t = 1. . The previous power of 2

(((Xt 1 Xt0((( C0/pt+c then Xt

0 [ Xt 1 . If the previous sample hasn t drifted too far, use the previous sample as warm start 5: else Xt

0 [ Xt0 . If the previous sample has drifted too far, reset to the sample at time t0

6: end if 7: Set Gt [ rft(Xt

0) 8: Set s [ s + Gt. 9: For all gradients Gk = rfk(uk) which were last updated at time t/2, replace them by rfk(Xt

0) and update s accordingly. 10: Draw it uniformly from {1, . . . , imax}. 11: Run Algorithm 1 with step size 0/t+c, batch size b, number of steps it, initial point Xt

0, and precomputed gradients Gk with sum s. Keep track of when the gradients are updated. 12: Return the output Xt = Xt

it of Algorithm 1. 13: end for evaluations which are poly-logarithmic, rather than linear, in T at each epoch. We are able to do better by exploiting the sum structure of Pt

k=0 ft and the fact that the t evolve slowly. See Section 4 for a detailed comparison.

2.4 Application to Bayesian logistic regression

Next, we show that Assumptions 1-3, and therefore Theorem 2.1, hold in the setting of online Bayesian logistic regression, when the data satisfy certain regularity properties. Logistic regression is a fundamental and widely used model in Bayesian statistics [AC93]. It has served as a model problem for methods in scalable Bayesian inference [WT11, HCB16, CB19, CB18], of which online sampling is one approach. Additionally, sampling from the logistic regression posterior is the key step in the optimal algorithm for online logistic regret minimization [FKL+18].

In Bayesian logistic regression, one models the data (ut 2 Rd, yt 2 { 1, 1}) as follows: there is some unknown 0 2 Rd such that given ut (the independent variable"), for all t 2 {1, . . . , T} the dependent variable yt follows a Bernoulli distribution with success probability φ(u>

t ) (yt = 1 with probability φ(u>

t ) and 1 otherwise) where φ(x) := 1/(1+e x). The problem we consider is:

Problem 2.2 (Bayesian logistic regression). Suppose the yt s are generated from ut s as Bernoulli random variables with success probability φ(u>

t ). At every epoch t 2 {1, . . . , T}, after observing (uk, yk)t

k=1, return a sample from the posterior distribution6 ˆ t( ) / e Pt

k=0 ˆ fk( ), where ˆf0( ) := e k k2/2 and ˆfk( ) := log[φ(yku>

We show that Algorithm 2 succeeds for Bayesian logistic regression under reasonable conditions on the data-generating distribution namely, that inputs are bounded and we see data in all directions.7

Theorem 2.3 (Online Bayesian logistic regression). Suppose that for some B, M, σ > 0, we have k 0k B and that ut Pu are iid, where Pu is a distribution satisfying the following: For u Pu, (1) kuk M ( bounded ) and (2) Eu[uu>

|u> 0| 2] σId ( restricted covariance matrix is bounded away from 0). Then for the functions ˆf0, . . . , ˆf T in Problem 2.2, and any " > 0, there exist parameters L, log(A), k 1, D = poly(M, σ 1, , B, d, " 1, log(T)) such that Assumptions 1, 2, and 3 hold for all t with probability at least 1 ". Therefore Alg. 2 gives "-approximate samples from t for t 2 [1, T] with poly(M, σ 1, , B, d, " 1, log(T)) gradient evaluations at each epoch.

6Here we use a Gaussian prior but this can be replaced by any e f0 where f0 is strongly convex and smooth. 7For simplicity, we state the result (Theorem 2.3) in the case where the input variables u are iid, but note that the result holds more generally (see Lemma E.1 in the supplement for a more general statement of our result).

In Section 5 we show that in numerical simulations, our algorithm achieves competitive accuracy with the same runtime compared to an algorithm specialized to logistic regression, the Pólya-Gamma sampler. However, the Pólya-Gamma sampler has two drawbacks: its running time at each epoch scales linearly as t (our algorithm scales as polylog(t)), and it is unknown whether Pólya-Gamma attains TV-error " in time polynomial in 1

", t, d, and other problem parameters.

3 Proof overview for online problem

For the online problem, information theoretic constraints require us to use information" from at least (t) gradients to sample with ﬁxed TV error at the t th epoch (see Appendix H). Thus, to use only

e OT (1) gradients at each epoch, we must reuse gradient information from past epochs. We accomplish this by reusing gradients computed at points in the Markov chain, including points at past epochs. This saves a factor of T over naive SGLD, but only if we can show that these past points in the chain track the distributions mode, and that our chain stays close to the mode (Lemma B.2 in supplement).

The distribution is concentrated to OT (1/

t) at the tth epoch (Assumption 2), and we need the Markov chain to stay within e OT (1/

t) of the mode. The bulk of the proof (Lemma B.3 in supplement) is to show that with high probability (w.h.p.) the chain stays within this ball. Once we establish that the Markov chain stays close, we combine our bounds with existing results on SGLD from [DMM19] to show that we only need e OT (1) steps per epoch (Lemma B.6). Finally, an induction with careful choice of constants ﬁnishes the proof (Theorem 2.1). Details of each of these steps follow.

Bounding the variance of the stochastic gradient (see Lemma B.2). We reduce the variance of our stochastic gradient by using the gradient evaluated at a past point uk and estimating the difference in the gradients between our current point Xt

i and past point uk. Using the L-Lipschitz property (Assumption 1) of the gradients, we show that the variance of this stochastic gradient is bounded by

b maxk k Xt

i ukk2. To obtain this bound, observe that the individual components {rfk(Xt

i) rfk(uk)}k2S of the stochastic gradient gt

i have variance at most = t2L2 maxk k Xt

i ukk2 by the Lipschitz property. Averaging with a batch saves a factor of b. For the number of gradient evaluations to stay nearly constant at each step, increasing the batch size is not a viable option to decrease our stochastic gradient s variance. Rather, showing that k Xt

i ukk decreases as k Xt

i ukk = e OT (1/

t), implies the variance of our stochastic gradient decreases at each epoch at the desired rate.

Bounding the escape time from a ball where the stochastic gradient has low variance (see Lemma B.3). Our main challenge is to bound the distance k Xi ukk. Because we do not assume strong convexity, we cannot use proof techniques of past papers analyzing variance-reduced SGLD methods. [CFM+18, NDH+17] used strong convexity to show that w.h.p., the Markov chain does not travel too far from its initial point, implying a bound on the variance of their stochastic gradients. Unfortunately, many important applications, including logistic regression, lack strong convexity.

To deal with the lack of strong convexity, we instead use a martingale exit time argument to show that the Markov chain remains inside a ball of radius r = e OT (1/

t) w.h.p. for a large enough time imax for the Markov chain to reach a point within TV distance " of the target distribution. Towards this end, we would like to bound the distance from the current state of the Markov chain to the mode k Xt

t k by e OT (1/

t), and bound kx?

t ukk by e OT (1/

t). Together, this allows us to bound the distance k Xt

i ukk = OT (1/

t). We can then use our bound on k Xt

i ukk = e OT (1/

t) together with Lemma B.2 to bound the variance of the stochastic gradient by roughly e OT (1/t).

Bounding kx?

t ukk. Since uk is a point of the Markov chain, possibly at a previous epoch t, roughly speaking we can bound this distance inductively by using bounds obtained at the previous epoch (Lemma B.6). Noting that uk = X

i for some i imax, we use our bound for kuk x?

k = OT (1/p ) = OT (1/

t) obtained at the previous epoch , together with Assumption 3 which says that kx?

t), to bound kx?

Bounding k Xt

t k. To bound the distance i := k Xt

t k to the mode, we would like to bound the increase i+1 i at each step i in the Markov chain. Unfortunately, the expected increase in the distance k Xt

t k is much larger when the Markov chain is close to the mode than when it is far away from the mode, making it difﬁcult to get a tight bound on the increase in the distance at each step. To get around this problem, we instead use a martingale exit time argument on k Xt

squared distance from the current state of the Markov chain to the mode. The advantage in using squared distance is that the expected increase in squared distance due to the Gaussian noise term p2 t i in the Markov chain update rule (Equation (1)) is the same regardless of the position of the chain, allowing us to obtain tighter bounds on the increase regardless of the Markov chain s current position. We then use weak convexity to bound the component of the increase in k Xt

t k2 that is due to the gradient term tgi, and apply Azuma s martingale concentration inequality to bound the exit time from the ball, showing the chain remains at distance of roughly e OT (1/

t) from the mode.

Bounding the TV error (Lemma B.6). We now show that if uk is close to x?

, then Xt will be a good sample from t. More precisely, we show that if at epoch t the Markov chain starts at Xt

0 such that k Xt

k R/pt+c (R to be chosen later), then

TV O("/log2(T )).

To do this, we use two bounds: a bound on the Wasserstein distance between the initial point Xt

0 and the target density t, and a bound on the variance of the stochastic gradient. We then plug the bounds into Corollary 18 of [DMM19] (reproduced as Theorem B.4 in the supplementary material), to show that imax = e O",T (poly(1/")) steps per epoch are sufﬁcient to obtain a bound of " on the TV error.

Bounding the number of gradient evaluations at each epoch. Working out constants, we see imax = poly(d, L, C, D, " 1, log(T)) sufﬁces to obtain TV-error " each epoch. A constant batch size sufﬁces, so the total number of evaluations is O(imaxb) = poly(d, L, C, D, " 1, log(T)).

4 Related work

Online convex optimization. Our motivation for studying the online sampling problem comes partly from the successes of online (convex) optimization [Haz16]. In online convex optimization, one chooses a point xt 2 K at each step and suffers a loss ft(xt), where K is a compact convex set and ft : K ! R is a convex function [Zin03]. The aim is to minimize the regret compared to the best point in hindsight, where Regret T = PT

t=1 ft(xt) minx PT

t=1 ft(x ). The same ofﬂine convex optimization algorithms such as gradient descent and Newton s method can be adapted to the online setting [Zin03, HAK07].

Online sampling. To the best of our knowledge, all previous algorithms with provable guarantees in our setting require computation time that grows polynomially with t. This is because any Markov chain taking all previous data into account needs T (t) gradient (or function) evaluations per step. On the other hand, there are many streaming algorithms that are used in practice which lack provable guarantees, or which rely on properties of the data (such as compressibility [HCB16, CB19]).

The most relevant theoretical work in our direction is [NR17]. The authors consider a changing log-concave distribution on a convex body, and show that under certain conditions, they can use the previous sample as a warm start and only take a constant number of steps of their Dikin walk chain at each stage. They consider the online sampling problem in the more general setting where the distribution is restricted to a convex body. However, [NR17] do not achieve optimal results in our setting, since they do not separately consider the case when Ft = Pt

k=0 fk has a sum structure and therefore require (t) function evaluations at epoch t. Moreover, they do not consider how concentration properties of the distribution translate into more efﬁcient sampling. When the ft are linear, they need OT (1) steps and OT (t) evaluations per epoch. However, in the general convex setting with smooth ft s, they need OT (t) steps per epoch and OT (t2) evaluations per epoch.

There are many other online sampling and other approaches to estimating changing distributions, used in practice. The Laplace approximation, perhaps the simplest, approximates the posterior distribution with a Gaussian [BDT16]; however, most distributions cannot be well-approximated by Gaussians. Stochastic gradient Langevin dynamics [WT11] can be used in an online setting; however, it suffers from large variance which we address in this work. The particle ﬁlter [DMHW+12, GDM+17] is a general algorithm to track changing distributions. Another approach (besides sampling) is variational inference, which has also been considered in an online setting ([WPB11], [BBW+13]).

Variance reduction techniques. Variance reduction techniques for SGLD were initially proposed in [DRW+16], when sampling from a ﬁxed distribution / e PT

t=0 ft. [DRW+16] propose two variance-reduced SGLD techniques, CV-ULD and SAGA-LD. CV-ULD re-computes the full gradient r F at an anchor point every r steps and updates the gradient at intermediate steps by subsampling the difference in the gradients between the current point and the anchor point. SAGA-LD, on the

Algorithm oracle calls per Other assumptions epoch Online Dikin walk OT (T) Strong convexity [NR17, 5.1] Bounded ratio of densities Langevin [DMM19, DCWY18] OT (T) SGLD [DMM19] OT (T)

SAGA-LD [CFM+18] OT (T) Strong convexity Lipschitz Hessian CV-ULD [CFM+18] OT (T) Strong convexity

This work polylog(T) bounded second moment bounded drift of minimizer

Table 1: Bounds on the number of gradient (or function) evaluations required by different algorithms to solve the online sampling problem. Lipschitz gradient is assumed for all algorithms. [NR17] analyzed the online Dikin walk for a different setting where the target has compact support; here we give the result one should obtain for support Rd, where it reduces to the ball walk. Thus it is possible the assumptions we give for the online Dikin walk can be weakened. Note that the number of gradient or function evaluations for the basic Langevin and SGLD algorithms and online Dikin walk depend multiplicatively on T (i.e., T poly(d, L, other parameters)), while the number of evaluations for variance-reduced SGLD methods depend only additively on T (i.e., T+poly(d, L, other parameters)).

other hand, keeps track of when each gradient rft was computed, and updates individual gradients with respect to when they were last computed. [CFM+18] show that CV-ULD can sample in the ofﬂine setting in roughly T + d2/"(L/m)6 gradient evaluations, and that SAGA-LD can sample in T + T

3/2(1 + LH) evaluations, where LH is the Lipschitz constant of the Hessian of log( ).8

5 Simulations

We test our algorithm against other sampling algorithms on a synthetic dataset for logistic regression. The dataset consists of T = 1000 data points in dimension d = 20. We compare the marginal accuracies of the algorithms.

The data is generated as follows. First, N(0, Id), b N(0, 1) are randomly generated. For each 1 t T, a feature vector xt 2 Rd and output yt 2 {0, 1} are generated by

xt,i Bernoulli

yt Bernoulli(φ( >xt + b)), (3)

where the sparsity is s = 5 in our simulations, and φ(x) = 1 1+e x is the logistic function. We chose xt 2 {0, 1}d because in applications, features are often indicators.

The algorithms are tested in an online setting as follows. At epoch t each algorithm has access to xs,i, ys for s t, and attempts to generate a sample from the posterior distribution pt( ) /

s=1 φ( >xt + b); the time is limited to t = 0.1 seconds. We estimate the quality of the samples at t = T = 1000, by saving the state of the algorithm at t = T 1, and re-running it 1000 times to collect 1000 samples. We replicate this entire simulation 8 times, and the marginal

accuracies of the runs are given in Figure 1.

The marginal accuracy (MA) is a heuristic to compare accuracy of samplers (see e.g. [DMS17], [FOW11] and [CR+17]). The marginal accuracy between the measure µ of a sample and the target is MA(µ, ) := 1 1

i=1 kµi ik TV, where µi and i are the marginal distributions of µ and for the coordinate xi. Since MALA is known to sample from the correct stationary distribution for the class of distributions analyzed in this paper, we let be the estimate of the true distribution obtained from 1000 samples generated from running MALA for a long time (1000 steps). We estimate the TV distance by the TV distance between the histograms when the bin widths are 0.25 times the sample standard deviation for the corresponding coordinate of .

8The bounds of [CFM+18] are given for sampling within a speciﬁed Wasserstein error, not TV error. The bounds we give here are the number of gradient evaluations one would need if one samples with Wasserstein error e" which roughly corresponds to TV error "; roughly, one requires e" = O("/

T) to sample with TV error ".

Algorithm Mean marginal accuracy SGLD 0.442 Online Laplace 0.571 MALA 0.901 Polya-Gamma 0.921 Online SAGA-LD 0.921 (our algorithm)

Full Laplace 0.924

Figure 1: Marginal accuracies of 5 different sampling algorithms on online logistic regression, with T = 1000 data points, dimension d = 20, and time 0.1 seconds, averaged over 8 runs. SGLD and online Laplace perform much worse and are not pictured. We compare our online SAGA-LD algorithm with SGLD, full and online Laplace approximation, Pólya-Gamma, and MALA. The Laplace method approximates the target distribution with a multivariate Gaussian distribution. Here, one ﬁrst ﬁnds the mode of the target distribution using a deterministic optimization technique and then computes the Hessian r2Ft of the log-posterior at the mode. The inverse of this Hessian is the covariance matrix of the Gaussian. In the online version of the algorithm, given in [CL11], to speed up optimization, only a quadratic approximation (with diagonal Hessian) to the log-posterior is maintained. The Pólya-Gamma chain [DFE18] is a Markov chain specialized to sample from the posterior for logistic regression. Note that in contrast, our algorithm works more generally for any smooth probability distribution over Rd.

Our results show that our online SAGA-LD algorithm is competitive with the best samplers for logistic regression, namely, the Pólya-Gamma Markov chain and the full Laplace approximation. We note that the full Laplace approximation requires optimizing a sum of t functions, which has runtime that scales linearly with t at each epoch, while our method only scales as polylog(t).

The parameters are as follows. The step size at epoch t is 0.1 1+0.5t for MALA, 0.01 1+0.5t for SGLD, and 0.05 1+0.5t for online SAGA-LD. A smaller step size must be used with SGLD because of the increased variance. For MALA, a larger step size can be used because the Metropolis-Hastings acceptance step ensures the stationary distribution is correct. The batch size for SGLD and online SAGA-LD is 64. The step sizes 0 were chosen by hand from testing various values in the range from 0.001 to 1.0. We found the reset step of our online SAGA-LD algorithm, and the random number of steps, to be unnecessary in practice, so the results are reported for our online SAGA-LD algorithm without these features. The experiments were run on Fujitsu CX2570 M2 servers with dual, 14-core 2.4GHz Intel Xeon E5 2680 v4 processors with 384GB RAM running the Springdale distribution of Linux.

6 Conclusion and future work

In this paper we obtain logarithmic-in-T bounds at each epoch when sampling from a sequence of log-concave distributions t / e Pt

k=0 fk, improving on previous results which are linear-in-T in the online setting. Since we do not assume the ft s are strongly convex, we also obtain bounds which have an improved dependence on T for a wider range of applications including Bayesian logistic regression. While our assumption of Lipschitz gradients requires the target to have full support on Rd, one can also consider extending our polylog(T) bounds to log-densities supported on a compact set. Restricting the distribution to have compact support can cause the target distribution s covariance matrix to become increasingly ill-conditioned as the number of functions t increases. To overcome this, we could modify our algorithm by including an adaptive pre-conditioner which changes along with the target distribution.

Acknowledgments

This research was partially supported by NSF CCF-1908347 and SNSF 200021_182527 grants.

[AC93] James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous

response data. Journal of the American statistical Association, 88(422):669 679, 1993.

[ADH10] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain

Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269 342, 2010.

[AWBR09] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar.

Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1 9, 2009.

[BBW+13] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I

Jordan. Streaming variational Bayes. In Advances in Neural Information Processing Systems, pages 1727 1735, 2013.

[BDT16] Rina Foygel Barber, Mathias Drton, and Kean Ming Tan. Laplace approximation in

high-dimensional Bayesian regression. In Statistical Analysis for High-Dimensional Data, pages 15 36. Springer, 2016.

[BNJ03] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation.

Journal of machine Learning research, 3(Jan):993 1022, 2003.

[CB18] Trevor Campbell and Tamara Broderick. Bayesian coreset construction via greedy

iterative geodesic ascent. In International Conference on Machine Learning, pages 697 705, 2018.

[CB19] Trevor Campbell and Tamara Broderick. Automated scalable Bayesian inference via

Hilbert coresets. The Journal of Machine Learning Research, 20(1):551 588, 2019.

[CBL06] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge

university press, 2006.

[CFM+18] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan.

On the theory of variance reduction for stochastic gradient Monte Carlo. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 764 773, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

[CL11] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In

Advances in neural information processing systems, pages 2249 2257, 2011.

[CR+17] Nicolas Chopin, James Ridgway, et al. Leave pima indians alone: binary regression

as a benchmark for Bayesian computation. Statistical Science, 32(1):64 87, 2017.

[DCWY18] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling:

Metropolis-Hastings algorithms are fast! In Proceedings of the 2018 Conference on Learning Theory, PMLR 75, 2018.

[DDFMR00] Arnaud Doucet, Nando De Freitas, Kevin Murphy, and Stuart Russell. Rao Blackwellised particle ﬁltering for dynamic Bayesian networks. In Proceedings of the Sixteenth conference on Uncertainty in artiﬁcial intelligence, pages 176 183. Morgan Kaufmann Publishers Inc., 2000.

[DFE18] Bianca Dumitrascu, Karen Feng, and Barbara E Engelhardt. PG-TS: Improved

Thompson sampling for logistic contextual bandits. In Advances in neural information processing systems, 2018.

[DMHW+12] Pierre Del Moral, Peng Hu, Liming Wu, et al. On the concentration properties

of interacting particle processes. Foundations and Trends R in Machine Learning, 3(3 4):225 389, 2012.

[DMM19] Alain Durmus, Szymon Majewski, and Bła zej Miasojedow. Analysis of Langevin

Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1 46, 2019.

[DMS17] Alain Durmus, Eric Moulines, and Eero Saksman. On the convergence of Hamiltonian

Monte Carlo. ar Xiv preprint ar Xiv:1705.00166, 2017.

[DRW+16] Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barnabas Poczos,

Alexander J Smola, and Eric P Xing. Variance reduction in stochastic gradient Langevin dynamics. In Advances in neural information processing systems, pages 1154 1162, 2016.

[DSCW18] Chris De Sa, Vincent Chen, and Wing Wong. Minibatch Gibbs sampling on large

graphical models. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1165 1173, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

[FE15] Maurizio Filippone and Raphael Engler. Enabling scalable stochastic gradient-based

inference for gaussian processes by employing the unbiased linear system solver (ulisse). In International Conference on Machine Learning, pages 1015 1024, 2015.

[FKL+18] Dylan J Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik Sridharan.

Logistic regression: The importance of being improper. Proceedings of Machine Learning Research vol, 75:1 42, 2018.

[FOW11] Christel Faes, John T Ormerod, and Matt P Wand. Variational Bayesian inference for

parametric and nonparametric regression with missing data. Journal of the American Statistical Association, 106(495):959 971, 2011.

[GDM+17] François Giraud, Pierre Del Moral, et al. Nonasymptotic analysis of adaptive and

annealed Feynman Kac particle models. Bernoulli, 23(1):670 709, 2017.

[GLR18] Rong Ge, Holden Lee, and Andrej Risteski. Simulated tempering Langevin Monte

Carlo II: An improved proof using soft Markov chain decomposition. ar Xiv preprint ar Xiv:1812.00793, 2018.

[HAK07] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for

online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

[Haz16] Elad Hazan. Introduction to online convex optimization. Foundations and Trends R

in Optimization, 2(3-4):157 325, 2016.

[HCB16] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable

Bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080 4088, 2016.

[HKL14] Elad Hazan, Tomer Koren, and Kﬁr Y Levy. Logistic regression: Tight bounds

for stochastic and online optimization. In Conference on Learning Theory, pages 197 209, 2014.

[KM15] Vladimir Koltchinskii and Shahar Mendelson. Bounding the smallest singular value of

a random matrix without concentration. International Mathematics Research Notices, 2015(23):12991 13008, 2015.

[Men14] Shahar Mendelson. Learning without concentration. In Conference on Learning

Theory, pages 25 39, 2014.

[NDH+17] Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian J Vollmer,

Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient Langevin dynamics. ar Xiv preprint ar Xiv:1706.02692, 2017.

[Nic12] Richard Nickl. Statistical theory. Statistical Laboratory, Department of Pure Mathe-

matics and Mathematical Statistics, University of Cambridge, 2012.

[NR17] Hariharan Narayanan and Alexander Rakhlin. Efﬁcient sampling from time-varying

log-concave distributions. The Journal of Machine Learning Research, 18(1):4017 4045, 2017.

[RVRK+18] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al.

A tutorial on Thompson sampling. Foundations and Trends R in Machine Learning, 11(1):1 96, 2018.

[WPB11] Chong Wang, John Paisley, and David Blei. Online variational inference for the hier-

archical Dirichlet process. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 752 760, 2011.

[WT11] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin

dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681 688, 2011.

[Zin03] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient

ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928 936, 2003.