# on_perturbed_proximal_gradient_algorithms__1542e76e.pdf

Journal of Machine Learning Research 18 (2017) 1-33 Submitted 1/15; Revised 11/16; Published 2/17

On Perturbed Proximal Gradient Algorithms

Yves F. Atchad e yvesa@umich.edu University of Michigan, 1085 South University, Ann Arbor, 48109, MI, United States,

Gersende Fort gersende.fort@telecom-paristech.fr LTCI, CNRS, Telecom Paris Tech, Universit e Paris-Saclay, 46, rue Barrault 75013 Paris, France,

Eric Moulines eric.moulines@polytechnique.edu CMAP, INRIA XPOP, Ecole Polytechnique, 91128 Palaiseau, France

Editor: Leon Bottou

We study a version of the proximal gradient algorithm for which the gradient is intractable and is approximated by Monte Carlo methods (and in particular Markov Chain Monte Carlo). We derive conditions on the step size and the Monte Carlo batch size under which convergence is guaranteed: both increasing batch size and constant batch size are considered. We also derive non-asymptotic bounds for an averaged version. Our results cover both the cases of biased and unbiased Monte Carlo approximation. To support our ﬁndings, we discuss the inference of a sparse generalized linear model with random eﬀect and the problem of learning the edge structure and parameters of sparse undirected graphical models. Keywords: Proximal Gradient Methods; Stochastic Optimization; Monte Carlo approximations; Perturbed Majorization-Minimization algorithms.

1. Introduction

This paper deals with statistical optimization problems of the form:

(P) min θ Rd F(θ) with F = f + g .

This problem occurs in a variety of statistical and machine learning problems, where f is a measure of ﬁt depending implicitly on some observed data and g is a regularization term that imposes structure to the solution. Typically, f is a diﬀerentiable function with a Lipschitz gradient, whereas g might be non-smooth (typical examples include sparsity inducing penalty).

H1 The function g : Rd [0, + ] is convex, not identically + , and lower semicontinuous. The function f : Rd R is convex, continuously diﬀerentiable on Rd

and there exists a ﬁnite non-negative constant L such that, for all θ, θ Rd,

f(θ) f(θ ) L θ θ ,

c 2017 Yves Atchad e, Gersende Fort and Eric Moulines.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/15-038.html.

Atchad e, Fort and Moulines

where f denotes the gradient of f.

We denote by Θ the domain of g: Θ def = {θ Rd : g(θ) < }.

H2 The set argminθ ΘF(θ) is a non empty subset of Θ.

In this paper, we focus on the case where f + g and f are both intractable. This setting has not been widely considered despite the considerable importance of such models in statistics and machine learning. Intractable likelihood problems naturally occur for example in inference for bayesian networks (e.g. learning the edge structure and the parameters in an undirected graphical models), regression with latent variables or random eﬀets, missing data, etc... In such applications, f is the negated log-likelihood of a conditional Gibbs measure πθ known only up to a normalization constant and the gradient of f(θ) is typically expressed as a very high-dimensional integral w.r.t. the associated Gibbs measure f(θ) = R Hθ(x)πθ(dx). Of course, this integral cannot be computed in closed form and should be approximated. Most often, some forms of Monte Carlo integration (such as Markov Chain Monte Carlo, or MCMC) is the only option. To cope with problems where f +g is intractable and possibly non-smooth, various methods have been proposed. Some of these works focused on stochastic sub-gradient and mirror descent algorithms; see Nemirovski et al. (2008); Duchi et al. (2011); Cotter et al. (2011); Lan (2012); Juditsky and Nemirovski (2012a,b). Other authors have proposed algorithms based on proximal operators to better exploit the smoothness of f and the properties of g (see e.g. Combettes and Wajs (2005); Hu et al. (2009); Xiao (2010); Juditsky and Nemirovski (2012a,b)). The current paper focuses on the proximal gradient algorithm (see e.g. Beck and Teboulle (2010); Combettes and Pesquet (2011); Parikh and Boyd (2013) for literature review and further references). The proximal map (Moreau (1962)) associated to g is deﬁned for γ > 0 and θ Rd by:

Proxγ,g(θ) def = argminϑ Θ

2γ ϑ θ 2 . (1)

Note that under H1, there exists an unique point ϑ minimizing the RHS of (1) for any θ Rd and γ > 0. The proximal gradient algorithm is an iterative algorithm which, given an initial value θ0 Θ and a sequence of positive step sizes {γn, n N}, produces a sequence of parameters {θn, n N} as follows:

Algorithm 1 (Proximal gradient algorithm) Given θn, compute

θn+1 = Proxγn+1,g (θn γn+1 f(θn)) . (2)

When γn = γ for any n, it is known that the iterates of the proximal gradient algorithm {θn, n N} (Algorithm 1) converges to θ , this point is a ﬁxed point of the proximal-gradient map

Tγ(θ) def = Proxγ,g (θ γ f(θ)) . (3)

On Perturbed Proximal Gradient Algorithms

Under H1 and H2, when γn (0, 2/L] and infn γn > 0, it is indeed known that the iterates of the proximal gradient algorithm {θn, n N} deﬁned in (2) converges to a point in the set L of the solutions of (P) which coincides with the ﬁxed points of the mapping Tγ for any γ (0, 2/L)

L def = argminθ Θ F(θ) = {θ Θ : θ = Tγ(θ)} . (4)

(see e.g. (Combettes and Wajs, 2005, Theorem 3.4. and Proposition 3.1.(iii))). Since f(θ) is intractable, the gradient f(θn) at n-th iteration is replaced by an approximation Hn+1:

Algorithm 2 (Perturbed Proximal Gradient algorithm) Let θ0 Θ be the initial solution and {γn, n N} be a sequence of posi tive step sizes. For n 1, given (θ0, . . . , θn) construct an approximation Hn+1 of f(θn) and compute

θn+1 = Proxγn+1,g (θn γn+1Hn+1) . (5)

We provide in Theorem 2 suﬃcient conditions on the perturbation ηn+1 = Hn+1 f(θn) to obtain the convergence of the perturbed proximal gradient sequence given by (5). We then consider an averaging scheme of the perturbed proximal gradient algorithm: given non-negative weights {an, n N}, Theorem 3 provides non-asymptotic bound of the deviation between Pn k=1 ak F(θk)/ Pn k=1 ak and the minimum of F. Our results complement and extend Rosasco et al. (2014); Nitanda (2014); Xiao and Zhang (2014). We then consider the case where the gradient f(θ) = R

X Hθ(x)πθ(dx) is deﬁned as an expectation (see H3 in section 3). In this case, at each iteration f(θn) is approximated by a Monte Carlo average Hn+1 = m 1 n+1 Pmn+1 j=1 Hθn(X(j) n+1) where

mn+1 is the size of the Monte Carlo batch and {X(j) n+1, 1 j mn+1} is the Monte Carlo batch. Two diﬀerent settings are covered. In the ﬁrst setting, the samples {X(j) n+1, 1 j mn+1} are conditionally independent and identically distributed (i.i.d.) with distribution πθn. In such case, the conditional expectation of Hn+1 given all the past iterations, denoted by E [Hn+1 | Fn] (see section 3), is equal to f(θn). In the second setting, the Monte Carlo batch {X(j) n+1, 1 j mn+1} is produced by

running a MCMC algorithm. In such case, the conditional distribution of X(j) n+1 given the past is no longer exactly equal to πθn which implies that E [Hn+1 | Fn] = f(θn). Theorem 4 (resp. Theorem 6) establish the convergence of the sequence {θn, n N} when the batch size mn is either ﬁxed or increases with the number of iterations n. When the Monte Carlo batch {X(j) n+1, 1 j mn+1} is i.i.d. conditionally to the past the two theorems essentially say that with probability one, {θn, n N} converges to an element of the set of minimizer L as soon as P n γn = + and P n γ2 n+1/mn+1 < . Hence, one can choose either a ﬁxed step size γn = γ and a batch size {mn, n N} increasing at least linearly (up to a logarithmic factor); or a decreasing step size and a ﬁxed batch size mn = m. When {X(j) n+1, 1 j mn+1} is produced by a MCMC algorithm (under appropriate assumptions) our theorems essentially say that the same convergence result holds if P n γn = and P n γ2 n+1 < when mn = m is constant across iterations or P n γn+1/mn+1 < if the batch size is increased.

Atchad e, Fort and Moulines

Theorem 4 and Theorem 6 also provide non asymptotic bounds for the diﬀerence n = Pn k=1 ak F(θk)/ Pn k=1 ak min F in Lq-norm for q 1. When the batch size sequence mn+1 increases linearly at each iteration while the step size γn+1 is held constant, n = O(ln n/n). We recover (up to a logarithmic factor) the rate of the proximal gradient algorithm. If we now compare the complexity of the algorithms in terms of the number of simulations N needed (and not the number of iterations), the error bound decreases like O(N 1/2). The same error bound can be achieved by choosing a ﬁxed batch size and a decreasing step size γn = O(1/ n). In section 4, these results are illustrated with the problem of estimating a highdimensional discrete graphical models. In section 5, we consider high-dimensional random eﬀect logistic regression model. All the proofs are postponed to section 6.

2. Perturbed proximal gradient algorithms

The key property to study the behavior of the sequence the perturbed proximal gradient algorithm is the following elementary lemma which might be seen as a deterministic version of the Robbins-Siegmund lemma (see e.g. (Polyak, 1987, Lemma 11, Chapter 2)). It replaces in our analysis (Combettes, 2001, Lemma 3.1) for quasi-Fejer sequences and modiﬁed Fejer monotone sequences (see Lin et al. (2015)). Compared to the Robbins-Siegmund Lemma, the sequence (ξn)n is not assumed to be nonnegative. When applied in the stochastic context as in Section 3, the fact that the result is purely deterministic and deals with signed perturbations ξn allows more ﬂexibility in the study of the dynamics.

Lemma 1 Let {vn, n N} and {χn, n N} be non-negative sequences and {ξn, n N} be such that P n ξn exists. If for any n 0,

vn+1 vn χn+1 + ξn+1

then P n χn < and limn vn exists.

Proof See Section 6.2.1

Applied with vn = θn θ for some θ L, this lemma is the key result for the proof of the following theorem, which provides suﬃcient conditions on the stepsize sequence {γn, n N} and on the approximation error:

ηn+1 def = Hn+1 f(θn) , (6)

for the sequence {θn, n N} to converge to a point θ in the set L of the minimizers of F. Denote by , the usual inner product on Rd associated to the norm .

Theorem 2 Assume H1 and H2. Let {θn, n N} be given by Algorithm 2 with step sizes satisfying γn (0, 1/L] for any n 1 and P n γn = + . If the following series converge X

n 0 γn+1 Tγn+1(θn), ηn+1 , X

n 0 γn+1ηn+1 , X

n 0 γ2 n+1 ηn+1 2 , (7)

then there exists θ L such that limn θn = θ .

On Perturbed Proximal Gradient Algorithms

Proof See Section 6.2.2.

Theorem 2 applied with ηn+1 = 0 provides suﬃcient conditions for the convergence of Algorithm 1 to L: the algorithm converges as soon as γn (0, 1/L] and P n γn = + . Suﬃcient conditions for the convergence of {θn, n N} are also provided in Combettes and Wajs (2005). When applied to our settings (Combettes and Wajs, 2005, Theorem 3.4.) requires P n ηn+1 < and infn γn > 0, which for instance cannot accommodate the ﬁxed Monte Carlo batch size stochastic algorithms considered in this paper. The same limitation applies to the analysis of the stochastic quasi-Fejer iterations (see Combettes and Pesquet (2015a)) which in our particular case requires P n γn+1 ηn+1 < . These conditions are weakened in Theorem 2. However in all fairness we should mention that unlike the present work, Combettes and Wajs (2005) and Combettes and Pesquet (2015a) deal with inﬁnite-dimensional problems which raises additional technical diﬃculties, and study algorithms that include a relaxation parameter. Furthermore, in the case where ηn 0, larger values of the stepsize γn are allowed (γn (0, 2/L]).

Let {a0, , an} be non-negative real numbers. Theorem 3 provides a control of the weighted sum Pn k=1 ak(F(θk) min F).

Theorem 3 Assume H1 and H2. Let {θn, n N} be given by Algorithm 2 with γn (0, 1/L] for any n 1. For any non-negative weights {a0, , an}, any θ L and any n 1, n X

k=1 ak {F(θk) min F} Un(θ )

where Tγ and ηn are given by (3) and (6) respectively and

Un(θ ) def = 1

θk 1 θ 2 + a0

k=1 ak Tγk(θk 1) θ , ηk +

k=1 akγk ηk 2 . (8)

Proof See Section 6.2.3.

When applied with ηn = 0, Theorem 3 gives an explicit bound of the diﬀerence n = A 1 n Pn j=1 aj F(θj) min F where An = Pn k=1 ak for the (exact) proximal gradient sequence {θn, n N} given by Algorithm 1. When the sequence {an/γn, n 1} is non decreasing, (8) shows that n = O(an A 1 n γ 1 n ). Taking ak = 1 for any k 0 provides a bound for the cumulative regret. When ak = 1, γk = 1/L for any k 0, (Schmidt et al., 2011, Proposition 1) provides a bound of order O(1) under the assumption that P n ηn+1 < . Using the inequality | T1/L(θk) θ , ηk+1 | θk θ ηk+1 (see Lemma 9), the upper bound Un(θ ) in (8) is also O(1). When an = γn for any n 0, then supn Un(θ ) < under the assumptions that the series X

n γn Tγn(θn 1) θ , ηn , X

n γ2 n ηn 2 ,

Atchad e, Fort and Moulines

converge. In this case, we have

Pn k=1 γk F(θk) Pn k=1 γk min F = O

Consider the weighted averaged sequence { θn, n N} deﬁned by

θn def = 1 An

k=1 akθk . (9)

Under H1 and H2, F is convex so that F θn A 1 n Pn k=1 ak F(θk). Therefore, Theorem 3 also provides convergence rates for F( θn) min F.

3. Stochastic Proximal Gradient algorithm

In this section, it is assumed that Hn+1 is a Monte Carlo approximation of f(θn), where f(θ) satisﬁes the following assumption:

H3 for all θ Θ,

X Hθ(x)πθ(dx) , (10)

for some probability measure πθ on a measurable space (X, X) and an integrable function (θ, x) 7 Hθ(x) from Θ X to Θ.

Note that X is not necessarily a topological space, even if, in many applications, X Rd. Assumption H3 holds in many problems (see section 4 and section 5). To approximate f(θ), several options are available. Of course, when the dimension of the state space X is small to moderate, it is always possible to perform a numerical integration using either Gaussian quadratures or low-discrepancy sequences. Another possibility is to approximate these integrals: nested Laplace approximations have been considered recently for example in Schelldorfer et al. (2014) and further developed in Ogden (2015). Such approximations necessarily introduce some bias, which might be diﬃcult to control. In addition, these techniques are not applicable when the dimension of the state space X becomes large. In this paper, we rather consider some form of Monte Carlo approximation. When sampling πθ is doable, then an obvious choice is to use a naive Monte Carlo estimator which amounts to sample a batch {X(j) n+1, 1 j mn+1} independently of the past values of the parameters {θj, j n} and of the past draws i.e. independently of the σ-algebra Fn def = σ(θ0, X(j) k , 0 k n, 0 j mk) . (11)

We then form

Hn+1 = m 1 n+1

j=1 Hθn(X(j) n+1) .

On Perturbed Proximal Gradient Algorithms

Conditionally to Fn, Hn+1 is an unbiased estimator of f(θn). The batch size mn+1 can either be chosen to be ﬁxed across iterations or to increase with n at a certain rate. In the ﬁrst case, Hn+1 is not converging. In the second case, the approximation error is vanishing. The ﬁxed batch-size case is closely related to Robbins-Monro stochastic approximation (the mitigation of the error is performed by letting the stepsize γn 0); the increasing batch-size case is related to Monte Carlo assisted optimization; see for example Geyer (1994). The situation that we are facing in section 4 and section 5 is more complicated because direct sampling from πθ is not an option. Nevertheless, it is fairly easy to construct a Markov kernel Pθ with invariant distribution πθ. Monte Carlo Markov Chains (MCMC) provide a set of principled tools to sample from complex distributions over large dimensional spaces. In such case, conditional to the past, {X(j) n+1, 1 j mn+1} is a realization of a Markov chain with transition kernel Pθn and started from X(mn) n (the last sample draws in the previous minibatch). Recall that a Markov kernel P is an application on X X, taking values in [0, 1] such that for any x X, P(x, ) is a probability measure on X; and for any A X, x 7 P(x, A) is measurable. Furthermore, if P is a Markov kernel on X, we denote by

P k the k-th iterate of P deﬁned recursively as P 0(x, A) def = 1A(x), and P k(x, A) def = R P k 1(x, dz)P(z, A), k 1. Finally, the kernel P acts on probability measure: for any probability measure µ on X, µP is a probability measure deﬁned by

µP(A) def = Z µ(dx)P(x, A), A X;

and P acts on positive measurable functions: for a measurable function f : X R+, Pf is a function deﬁned by

Pf(x) def = Z f(y) P(x, dy).

We refer the reader to Meyn and Tweedie (2009) for the deﬁnitions and basic properties of Markov chains. In this Markovian setting, it is possible to consider the ﬁxed batch case and the increasing batch case. From a mathematical standpoint, the ﬁxed batch case is trickier, because Hn+1 is no longer an unbiased estimator of f(θn), i.e. the bias Bn deﬁned by

Bn def = E [Hn+1 | Fn] f(θn) = m 1 n+1 Pmn+1 j=1 E h Hθn(X(j) n+1) Fn i f(θn)

= m 1 n+1 Pmn+1 j=1 P j θn Hθn(X(0) n+1) f(θn) , (12)

does not vanish. When mn = m is small, the bias can even be pretty large, and the way the bias is mitigated in the algorithm requires substantial mathematical developments, which are not covered by the results currently available in the literature (see e.g. Combettes and Pesquet (2015a); Rosasco et al. (2014); Combettes and Pesquet (2015b); Rosasco et al. (2015); Lin et al. (2015)). To capture in a common unifying framework these two diﬀerent situations we assume that

Atchad e, Fort and Moulines

H4 Hn+1 is a Monte Carlo approximation of the expectation f(θn) :

Hn+1 = m 1 n+1

j=1 Hθn(X(j) n+1) ;

for all n 0, conditionally to the past, {X(j) n+1, 1 j mn+1} is a Markov chain

started from X(mn) n and with transition kernel Pθn (we set X(m0) 0 = x X). For all θ Θ, Pθ is a Markov kernel with invariant distribution πθ.

For a measurable function V : X [1, ), a signed measure µ on the σ-ﬁeld of X, and a function f : X R, deﬁne

|f|V def = sup x X

V (x) , µ V def = sup f,|f|V 1

H5 There exist λ (0, 1), b < , p 2 and a measurable function W : X [1, + ) such that sup θ Θ |Hθ|W < , sup θ Θ PθW p λW p + b .

In addition, for any ℓ (0, p], there exist C < and ρ (0, 1) such that for any x X, sup θ Θ P n θ (x, ) πθ W ℓ Cρn W ℓ(x) . (13)

Suﬃcient conditions for the uniform-in-θ ergodic behavior (13) are given e.g. in (Fort et al., 2011, Lemma 2.3), in terms of aperiodicity, irreducibility and minorization conditions on the kernels {Pθ, θ Θ}. Examples of MCMC kernels Pθ satisfying this assumption can be found in (Andrieu and Moulines, 2006, Proposition 12), (Saksman and Vihola, 2010, Proposition 15), (Fort et al., 2011, Proposition 3.1.), (Schreck et al., 2013, Proposition 3.2.), (Allassonni ere and Kuhn, 2015, Proposition 1), and (Fort et al., 2015, Proposition 3.1.). The proof of the results below consists in verifying the conditions of Theorem 2 with the error term deﬁned by ηn+1 = m 1 n+1 Pmn+1 j=1 Hθn(X(j) n+1) f(θn). If the approximation is unbiased in the sense that E [ηn+1 | Fn] = 0, then {ηn, n N} is a martingale increment sequence. In all the other cases, we decompose ηn+1 as the sum of a martingale increment term and a remainder term. When the batch size {mn, n N} is increasing, the martingale increment sequence can be set to ηn+1 E [ηn+1 | Fn] and the remainder term E [ηn+1 | Fn] will be shown to be vanishingly small. When the batch size {mn, n N} is constant, then E [ηn+1 | Fn] does not vanish. A more subtle deﬁnition of the martingale increment has to be done, introducing the Poisson equation for Markov chain (see Proposition 19 in section 6).

3.1 Monte Carlo approximation with ﬁxed batch-size

We ﬁrst study the case when mn = m for any n N. Theorem 4 provides suﬃcient conditions for the convergence towards the limiting set L and for a bound for Pn k=1 ak F(θk) min F. Consider the following assumption

On Perturbed Proximal Gradient Algorithms

H6 (i) there exists a constant C such that for any θ, θ Θ

|Hθ Hθ |W + sup x Pθ(x, ) Pθ (x, ) W

W(x) + πθ πθ W C θ θ .

(ii) supγ (0,1/L] supθ Θ γ 1 Proxγ,g(θ) θ < .

(iii) P n |γn+1 γn| < .

Assumption H6-(i) requires a Lipschitz-regularity in the parameter θ of the Markov kernel Pθ which, for MCMC algorithms, is inherited under mild additional conditions from the Lipschitz regularity in W-norm of the target distribution. Such conditions have been worked out for general families of MCMC kernels including Hastings Metropolis dynamics, Gibbs samplers, and hybrid MCMC algorithm; see for example Proposition 12 in Andrieu and Moulines (2006), the proof of Theorem 3.4. in Fort et al. (2011), Lemmas 4.6. and 4.7. in Fort et al. (2015) and the references therein. It is a classical assumption when studying Stochastic Approximation with conditionally Markovian dynamic (see e.g. Benveniste et al. (1990), Andrieu et al. (2005), Fort et al. (2014)). We prove in Proposition 11 that when g is proper, convex, Lipschitz on Θ, then H6-(ii) is satisﬁed. In particular, if Θ is a closed convex set, H6-(ii) is satisﬁed with the Lasso or fused Lasso penalty. If Θ is a compact convex set, then H6-(ii) is satisﬁed by the elastic-net penalty. For a random variable Y , denote by Y Lq = (E[|Y |q])1/q.

Theorem 4 Assume Θ is bounded. Let {θn, n 0} be given by Algorithm 2 with γn (0, 1/L] for any n 0. Assume H1 H5, mn = m 1 and, if the Monte Carlo approximation is biased, assume also H6.

(i) Assume that P n γn = and P n γ2 n < . With probability one, there exists θ L such that limn θn = θ .

(ii) For any q (1, p/2] there exists a constant C such that for any non-negative numbers {a0, , an}

k=1 ak {F(θk) min F}

k=1 akγk + υ

k=1 |ak ak 1|

k=1 ak{E[F(θk)] min F}

k=1 akγk + υ

k=1 |ak ak 1|

where υ = 0 if the Monte Carlo approximation is unbiased and υ = 1 otherwise.

Atchad e, Fort and Moulines

Proof The proof is postponed to Section 6.3.

When an = 1 and γn = (n + 1) 1/2, Theorem 4 shows that when n , n 1 n X

k=1 F(θk) min F

An upper bound O(ln n/ n) can be obtained from Theorem 4 by choosing an = γn = (n + 1) 1/2.

3.2 Monte Carlo approximation with increasing batch size

The key property to discuss the asymptotic behavior of the algorithm is the following result

Proposition 5 Assume H3, H4 and H5. There exists a constant C such that w.p. 1 for any n 0,

E [ηn+1 | Fn] C m 1 n+1W(X(mn) n ) , E [ ηn+1 p|Fn] C m p/2 n+1 W p(X(mn) n ) .

Proof The ﬁrst inequality follows from (12) and (13). The second one is established in (Fort and Moulines, 2003, Proposition 12).

Theorem 6 Assume Θ is bounded. Let {θn, n 0} be given by Algorithm 2 with γn (0, 1/L] for any n 0. Assume H1 H5.

(i) Assume P n γn = + , P n γ2 n+1m 1 n+1 < and, if the approximation is biased, P n γn+1m 1 n+1 < . With probability one, there exists θ L such that limn θn = θ .

(ii) For any q (1, p/2], there exists a constant C such that for any non-negative numbers {a0, , an}

k=1 ak {F(θk) min F}

k=1 a2 km 1 k

k=1 akγkm 1 k + υ

k=1 akm 1 k

k=1 ak{E[F(θk)] min F}

k=1 akγkm 1 k + υ

k=1 akm 1 k

where υ = 0 if the Monte-Carlo approximation is unbiased and υ = 1 otherwise.

On Perturbed Proximal Gradient Algorithms

Proof See Section 6.4.

Theorem 6 shows that when n ,

k=1 ak F(θk) min F

Lq = O ln n

by choosing a ﬁxed stepsize γn = γ, a linearly increasing batch-size mn n and a uniform weight an = 1. Note that this is the rate after n iterations of the Stochastic Proximal Gradient algorithm but Pn k=1 mk = O(n2) Monte Carlo samples. Therefore, the rate of convergence expressed in terms of complexity is O(ln n/ n).

4. Application to network structure estimation

To illustrate the algorithm we consider the problem of ﬁtting discrete graphical models in a setting where the number of nodes in the graph is large compared to the sample size. Let X be a nonempty ﬁnite set, and p 1 an integer. We consider a graphical model on Xp with joint probability mass function

fθ(x1, . . . , xp) = 1

k=1 θkk B0(xk) + X

1 j<k p θkj B(xk, xj)

for a non-zero function B0 : X R and a symmetric non-zero function B : X X R. The term Zθ is the normalizing constant of the distribution (the partition function), which cannot (in general) be computed explicitly. The real-valued symmetric matrix θ deﬁnes the graph structure and is the parameter of interest. It has the same interpretation as the precision matrix in a multivariate Gaussian distribution. We consider the problem of estimating θ from N realizations {x(i), 1 i N} from (14) where x(i) = (x(i) 1 , . . . , x(i) p ) Xp, and where the true value of θ is assumed sparse. This problem is relevant for instance in biology (Ekeberg et al. (2013); Kamisetty et al. (2013)), and has been considered by many authors in statistics and machine learning (Banerjee et al. (2008); H oﬂing and Tibshirani (2009); Ravikumar et al. (2010); Guo et al. (2010); Xue et al. (2012)). The main diﬃculty in dealing with this model is the fact that the log-partition function log Zθ is intractable in general. As a result, most of the existing works estimate θ by using the sub-optimal approach of replacing the likelihood function by a pseudo-likelihood function. One notable exception that tackles the log-likelihood function is H oﬂing and Tibshirani (2009), using an active set strategy (to preserve sparsity), and the junction tree algorithm for computing the partial derivatives of the log-partition function. However, the success of this strategy depends crucially on the sparsity of the solution1. We will see that Algorithm 2 implemented with a MCMC

1. Indeed the implementation of their algorithm in the BMN package is very sensitive to the sparsity of the solution, and their solver typically fails to converge if the regularization parameter is not large enough to produce a suﬃciently sparse solution. In our numerical experiments, we were not able to obtain a successful run from their package for p = 100.

Atchad e, Fort and Moulines

approximation of the gradient gives a simple and eﬀective approach for computing the penalized maximum likelihood estimate of θ. Let Mp denote the space of p p symmetric matrices equipped with the (modiﬁed) Frobenius inner product

θ, ϑ def = X

1 k j p θjkϑjk, with norm θ def = p

Equipped with this norm, Mp is the same space as the Euclidean space Rd where d = p(p+1)/2. Using a ℓ1-penalty on θ, we see that the computation of the penalized maximum likelihood estimate of θ is a problem of the form (P) with F = ℓ+ g where

D θ, B(x(i)) E log Zθ and g(θ) = λ X

1 k j p |θjk| ;

the matrix-valued function B : Xp Rp p is deﬁned by

Bkk(x) = B0(xk) Bkj(x) = B(xk, xj) , k = j .

It is easy to see that in this example, Problem (P) admits at least one solution θ that satisﬁes λ P 1 k j p |θjk| p log |X|, where |X| denotes the size of X. To see this, note that since fθ(x) is a probability, ℓ(θ) = N 1 PN i=1 log fθ(x(i)) 0. Hence F(θ) g(θ) , as P 1 k j p |θjk| and since F is continuous, we conclude that it admits at least one minimizer θ that satisﬁes F(θ ) F(0) = log Z0 = p log |X|. As a result, and without any loss of generality, we consider Problem (P) with the penalty g replaced by g(θ) = λ P 1 k j p |θjk| + 1(θ), where 1(θ) = 0 if maxij |θij| (p/λ) log |X|, and 1(θ) = + otherwise. Hence in this problem, the domain of g is Θ = {θ Mp : maxij |θij| (p/λ) log |X|}. Upon noting that (14) is a canonical exponential model, (Shao, 2003, Section 4.4.2) shows that θ 7 ℓ(θ) is convex and

i=1 B(x(i)) Z

Xp B(z)fθ(z)µ(dz) , (15)

where µ is the counting measure on Xp. In addition, (see section B)

ℓ(θ) ℓ(ϑ) p (p 1)osc2(B) + osc2(B0) θ ϑ , (16)

where for a function B : X X R, osc( B) = supx,y,u,v X | B(x, y) B(u, v)|. Therefore, in this example, the assumption H1 and H2 are satisﬁed. The representation of the gradient in (15) shows that H3 holds, with πθ(dz) = fθ(z)µ(dz), and Hθ(z) = N 1 PN i=1 B(x(i)) B(z). Direct simulation from the distribution fθ is rarely feasible, so we turn to MCMC. These Markov kernels are easy to construct, and can be constructed in many ways. For instance if the set X is not too large, then a Gibbs sampler (see e.g. Robert and Casella (2005)) that samples from the full conditional distributions of fθ can be easily implemented. In the case of the

On Perturbed Proximal Gradient Algorithms

Gibbs sampler, since Xp is a ﬁnite set, Θ is compact, fθ(x) > 0 for all (x, θ) Xp Θ, and, θ 7 fθ(x) is continuously diﬀerentiable, the assumptions H4, H5 and H6(i)-(ii) automatically hold with W 1. We should point out that the Gibbs sampler is a generic algorithm that in some cases is known to mix poorly. Whenever possible we recommend the use of specialized problem-speciﬁc MCMC algorithms with better mixing properties.

Illustrative example We consider the particular case where X = {1, . . . , M}, B0(x) = 0, and B(x, y) = 1{x=y}, which corresponds to the well known Potts model. We report in this section some simulation results showing the performances of the stochastic proximal gradient algorithm. We use M = 20, B0(x) = x, N = 250 and for p {50, 100, 200}. We generate the true matrix θtrue such that it has on average p non-zero elements below the diagonal which are simulated from a uniform distribution on ( 4, 1) (1, 4). All the diagonal elements are set to 0. By trial-and-error we set the regularization parameter to λ = 2.5 p

log(p)/n for all the simulations. We implement Algorithm 2, drawing samples from a Gibbs sampler to approximate the gradient. We compare the following two versions of Algorithm 2:

1. Solver 1: A version with a ﬁxed Monte Carlo batch size mn = 500, and decreasing step size γn = 25

2. Solver 2: A version with increasing Monte Carlo batch size mn = 500 + n1.2, and ﬁxed step size γn = 25

We run Solver 2 for Niter = 5p iterations, where p {50, 100, 200} is as above. And we set the number of iterations of Solver 1 so that both solvers draw approximately the same number of Monte Carlo samples. For stability in the results, we repeat the solvers 30 times and average the sample paths. We evaluate the convergence of each solver by computing the relative error θn θ / θ , along the iterations, where θ denotes the value returned by the solver on its last iteration. Note that we compare the optimizer output to θ , not θtrue. Ideally, we would like to compare the iterates to the solution of the optimization problem. However in the present setting a solution is not available in closed form (and there could be more than one solution). Furthermore, whether the solution of the optimization problem approaches θ is a complicated statistical problem2 that is beyond the scope of this work. The relative errors are presented on Figure 1 and suggest that, when measured as function of resource used, Solver 1 and Solver 2 have roughly the same convergence rate.

We also compute the statistic Fn def = 2Senn Precn Senn+Precn which measures the recovery of the sparsity structure of θ along the iteration. In this deﬁnition Senn is the sensitivity, and Precn is the precision deﬁned as

P j<i 1{|θn,ij|>0}1{|θ ,ij|>0} P j<i 1{|θ ,ij|>0} , and Precn =

P j<i 1{|θn,ij|>0}1{|θ ,ij|>0} P j<i 1{|θ ,ij|>0} .

2. this depends heavily on n, p, the actual true matrix θtrue, and depends also heavily the choice of the regularization parameter λ

Atchad e, Fort and Moulines

0 20 40 60 80

0.0 0.2 0.4 0.6 0.8 1.0

0 20 40 60 80

0.0 0.2 0.4 0.6 0.8 1.0

Solver 1 Solver 2

0 50 100 150

0.0 0.2 0.4 0.6 0.8 1.0

0 50 100 150

0.0 0.2 0.4 0.6 0.8 1.0

Solver 1 Solver 2

0 500 1000 1500 2000 2500 3000

0.0 0.2 0.4 0.6 0.8 1.0

0 500 1000 1500 2000 2500 3000

0.0 0.2 0.4 0.6 0.8 1.0

Solver 1 Solver 2

Figure 1: Relative errors plotted as function of computing time for Solver 1 and Solver 2.

0 20 40 60 80

0.80 0.85 0.90 0.95 1.00 1.05 1.10

0 20 40 60 80

0.80 0.85 0.90 0.95 1.00 1.05 1.10

Solver 1 Solver 2

0 50 100 150

0.80 0.85 0.90 0.95 1.00 1.05 1.10

0 50 100 150

0.80 0.85 0.90 0.95 1.00 1.05 1.10

Solver 1 Solver 2

0 500 1000 1500 2000 2500 3000

0.80 0.85 0.90 0.95 1.00 1.05 1.10

0 500 1000 1500 2000 2500 3000

0.80 0.85 0.90 0.95 1.00 1.05 1.10

Solver 1 Solver 2

Figure 2: Statistic Fn plotted as function of computing time for Solver 1 and Solver 2.

The values of Fn are presented on Figure 2 as function of computing time. It shows that for both solvers, the sparsity structure of θn converges very quickly towards that of θ . We note also that Figure 2 seems to suggest that Solver 2 tends to produce solutions with slightly more stable sparsity structure than Solver 1 (less variance on the red curves). Whether such subtle diﬀerences exist between the two algorithms (a diminishing step-size and ﬁxed Monte Carlo size versus a ﬁxed step-size and increasing Monte Carlo size) is an interest question. Our analysis does not deal with the sparsity structure of the solutions, hence cannot oﬀer any explanation.

5. A non convex example: High-dimensional logistic regression with random eﬀects

We numerically investigate the extension of our results to a situation where the assumptions H2 and H3 hold but H1 is not in general satisﬁed and the domain Θ is not bounded. The numerical study below shows that the conclusions reached in sec-

On Perturbed Proximal Gradient Algorithms

tion 2 and section 3 provide useful information to tune the design parameters of the algorithms.

5.1 The model

We model binary responses {Yi}N i=1 {0, 1} as N conditionally independent realizations of a random eﬀect logistic regression model,

Yi|U ind. Ber s(x iβ + σz i U) , 1 i N , (17)

where xi Rp is the vector of covariates, zi Rq are (known) loading vector, Ber(α) denotes the Bernoulli distribution with parameter α [0, 1], s(x) = ex/(1 + ex) is the cumulative distribution function of the standard logistic distribution. The random eﬀect U is assumed to be standard Gaussian U Nq(0, I). The log-likelihood of the observations at θ = (β, σ) Rp (0, ) is given by

ℓ(θ) = log Z N Y

i=1 s(x iβ + σz iu)Yi 1 s(x iβ + σz iu) 1 Yi φ(u)du , (18)

where φ is the density of a Rq-valued standard Gaussian random vector. The number of covariates p is possibly larger than N, but only a very small number of these covariates are relevant which suggests to use the elastic-net penalty

2 β 2 2 + α β 1

where λ > 0 is the regularization parameter, β r = (Pp i=1 |βi|r)1/r and α [0, 1] controls the trade-oﬀbetween the ℓ1 and the ℓ2 penalties. In this example,

g(θ) = λ 1 α

2 β 2 2 + α β 1

+ 1(0,+ )(σ) , (20)

where 1A(x) = + is x / A and 0 otherwise. Deﬁne the conditional log-likelihood of Y = (Y1, . . . , YN) given U (the dependence upon Y is omitted) by

Yi x iβ + σz iu ln 1 + exp x iβ + σz iu ,

and the conditional distribution of the random eﬀect U given the observations Y and the parameter θ πθ(u) = exp (ℓc(θ|u) ℓ(θ)) φ(u) . (21)

The Fisher identity implies that the gradient of the log-likelihood (18) is given by

ℓ(θ) = Z θℓc(θ|u) πθ(u) du = Z ( N X

i=1 (Yi s(x iβ + σz iu)) xi z iu

Atchad e, Fort and Moulines

The Hessian of the log-likelihood ℓis given by (see e.g.(Mc Lachlan and Krishnan, 2008, Chapter 3))

2ℓ(θ) = Eπθ 2 θℓc(θ|U) + Covπθ ( θℓc(θ|U))

where Eπθ and Covπθ denotes the expectation and the covariance with respect to the distribution πθ, respectively. Since

2 θℓc(θ|u) =

i=1 s(x iβ + σz iu) 1 s(x iβ + σz iu) xi z iu

and supθ Θ R u 2πθ(u) du < (see section A), 2ℓ(θ) is bounded on Θ. Hence, ℓ(θ) satisﬁes the Lipschitz condition showing that H1 is satisﬁed.

5.2 Numerical application

The assumption H3 is satisﬁed with πθ given by (21) and

i=1 (Yi F(x iβ + σz iu)) xi z iu

The distribution πθ is sampled using the MCMC sampler proposed in Polson et al. (2013) based on data-augmentation. We write ℓ(θ) = R

Rq RN Hθ(u) πθ(u, w) dudw where πθ(u, w) is deﬁned for u Rq and w = (w1, , w N) RN by

i=1 πPG wi; x iβ + σz iu !

in this expression, πPG( ; c) is the density of the Polya-Gamma distribution on the positive real line with parameter c given by

πPG(w; c) = cosh(c/2) exp wc2/2 ρ(w)1R+(w) ,

where ρ(w) P k 0( 1)k(2k+1) exp( (2k+1)2/(8w))w 3/2 (see (Biane et al., 2001, Section 3.1)). Thus, we have

πθ(u, w) = Cθφ(u)

i=1 exp σ(Yi 1/2)z iu wi(x iβ + σz iu)2/2 ρ(wi)1R+(wi) ,

where ln Cθ = N ln 2 ℓ(θ) + PN i=1(Yi 1/2)x iβ. This target distribution can be sampled using a Gibbs algorithm: given the current value (ut, wt) of the chain, the next point is obtained by sampling ut+1 under the conditional distribution of u given wt, and wt+1 under the conditional distribution of w given ut+1. In the present case, these conditional distributions are given respectively by

πθ(u|w) Nq (µθ(w); Γθ(w)) πθ(w|u) =

i=1 πPG(wi; |x iβ + σz iu|)

On Perturbed Proximal Gradient Algorithms

i=1 wiziz i

, µθ(w) = σΓθ(w)

(Yi 1/2) wix iβ zi .

(23) Exact samples of these conditional distributions can be obtained (see (Polson et al., 2013, Algorithm 1) for sampling under a Polya-Gamma distribution). It has been shown by Choi and Hobert (2013) that the Polya-Gamma Gibbs sampler is uniformly ergodic. Hence H5 is satisﬁed with W 1. Checking H6 is also straightforward. We test the algorithms with N = 500, p = 1, 000 and q = 5. We generate the N p covariates matrix X columnwise, by sampling a stationary RN-valued autoregressive model with parameter ρ = 0.8 and Gaussian noise p

1 ρ2 NN(0, I). We generate the vector of regressors βtrue from the uniform distribution on [1, 5] and randomly set 98% of the coeﬃcients to zero. The variance of the random eﬀect is set to σ2 = 0.1. We consider a repeated measurement setting so that zi = e iq/N where {ej, j q} is the canonical basis of Rq and denotes the upper integer part. With such a simple expression for the random eﬀect, we will be able to approximate the value F(θ) in order to illustrate the theoretical results obtained in this paper. We use the Lasso penalty (α = 1 in (19)) with λ = 30. We ﬁrst illustrate the ability of Monte Carlo Proximal Gradient algorithms to ﬁnd a minimizer of F. We compare the Monte Carlo proximal gradient algorithm

(i) with ﬁxed batch size: γn = 0.01/ n and mn = 275 (Algo 1); γn = 0.5/n and mn = 275 (Algo 2).

(ii) with increasing batch size: γn = γ = 0.005, mn = 200 + n (Algo 3); γn = γ = 0.001, mn = 200 + n (Algo 4); and γn = 0.05/ n and mn = 270 + n (Algo 5).

Each algorithm is run for 150 iterations. The batch sizes {mn, n 0} are chosen so that after 150 iterations, each algorithm used approximately the same number of Monte Carlo samples. We denote by β the value obtained at iteration 150. A path of the relative error βn β / β is displayed on Figure 3[right] for each algorithm; a path of the sensitivity Senn and of the precision Precn (see section 4 for the deﬁnition) are displayed on Figure 4. All these sequences are plotted versus the total number of Monte Carlo samples up to iteration n. These plots show that with a ﬁxed batch-size (Algo 1 or Algo 2), the best convergence is obtained with a step size decreasing as O(1/ n); and for an increasing batch size (Algo 3 to Algo 5), it is better to choose a ﬁxed step size. These ﬁndings are consistent with the results in section 3. On Figure 3[left], we report on the bottom row the indices j such that βtrue,j is non null and on the rows above, the indices j such that β ,j given by Algo 1 to Algo 5 is non null. We now study the convergence of {F(θn), n N} where θn is obtained by one of the algorithms described above. We repeat 50 independent runs for each algorithm and estimate E [F(θn)] by the empirical mean over these runs. On Figure 5[left], n 7 F(θn) is displayed for several runs of Algo 1 and Algo 3. The ﬁgure shows

Atchad e, Fort and Moulines

0 100 200 300 400 500 600 700 800 900 1000

0 0.5 1 1.5 2 2.5 3 3.5 4

0 0.5 1 1.5 2 2.5 3 3.5 4

Algo 1 Algo 2

Algo 3 Algo 4 Algo5

Figure 3: [left] The support of the sparse vector β obtained by Algo 1 to Algo 5; for comparison, the support of βtrue is on the bottom row. [right] Relative error along one path of each algorithm as a function of the total number of Monte Carlo samples.

0 0.5 1 1.5 2 2.5 3 3.5 4

Algo 1 Algo 2

0 0.5 1 1.5 2 2.5 3 3.5 4

Algo 3 Algo 4 Algo5

0 0.5 1 1.5 2 2.5 3 3.5 4

Algo 1 Algo 2

0 0.5 1 1.5 2 2.5 3 3.5 4

Algo 3 Algo 4 Algo5

Figure 4: The sensitivity Senn [left] and the precision Precn [right] along a path, versus the total number of Monte Carlo samples up to time n

On Perturbed Proximal Gradient Algorithms

0 50 100 150 10 2

0 0.5 1 1.5 2 2.5 3 3.5 4

Algo 1 Algo 3

Algo 1 Algo 3

0 0.5 1 1.5 2 2.5 3 3.5 4

Algo 1 Algo 3 Algo 4

Figure 5: [left] n 7 F(θn) for several independent runs. [right] E [F(θn)] versus the total number of Monte Carlo samples up to iteration n

that all the paths have the same limiting value, which is approximately F = 311; we observed the same behavior on the 50 runs of each algorithm. On Figure 5[right], we report the Monte Carlo estimation of E[F(θn)] versus the total number of Monte Carlo samples used up to iteration n for the best strategies in the ﬁxed batch size case (Algo 1) and in the increasing batch size case (Algo 3 and Algo 4).

6.1 Preliminary lemmas

Lemma 7 Assume that g is lower semi-continuous and convex. For θ, θ Θ and γ > 0

g Proxγ,g(θ) g(θ ) 1

γ Proxγ,g(θ) θ , Proxγ,g(θ) θ . (24)

For any γ > 0 and for any θ, θ Θ,

Proxγ,g(θ) Proxγ,g(θ ) 2+ Proxγ,g(θ) θ Proxγ,g(θ ) θ 2 θ θ 2 . (25)

Proof See (Bauschke and Combettes, 2011, Propositions 4.2., 12.26 and 12.27).

Lemma 8 Assume H1 and let γ (0, 1/L]. Then for all θ, θ Θ,

2γ F(Proxγ,g(θ)) F(θ )

Proxγ,g(θ) θ 2 + 2 Proxγ,g(θ) θ , θ γ f(θ ) θ . (26)

If in addition f is convex, then for all θ, θ , ξ Θ,

2γ F Proxγ,g(θ) F(θ ) Proxγ,g(θ) θ 2

+ 2 Proxγ,g(θ) θ , ξ γ f(ξ) θ θ ξ 2 . (27)

Atchad e, Fort and Moulines

Proof Since f is Lipschitz, the descent lemma implies that for any γ 1 L

f(p) f(θ ) f(θ ), p θ + 1

2γ p θ 2 . (28)

This inequality applied with p = Proxγ,g(θ) combined with (24) yields (26). When f is convex, f(ξ) + f(ξ), θ ξ f(θ ) 0 which, combined again with (24) and (28) applied with (p, θ ) (Proxγ,g(θ), ξ) yields the result.

Lemma 9 Assume H1. Then for any γ > 0, θ, θ Θ,

θ γ f(θ) θ + γ f(θ ) (1 + γL) θ θ , (29)

Tγ(θ) Tγ(θ ) (1 + γL) θ θ . (30)

If in addition f is convex then for any γ (0, 2/L],

θ γ f(θ) θ + γ f(θ ) θ θ , (31)

Tγ(θ) Tγ(θ ) θ θ . (32)

Proof (30) and (32) follows from (29) and (31) respectively by the Lipschitz property of the proximal map Proxγ,g (see Lemma 7). (29) follows directly from the Lipschitz property of f. It remains to prove (31). Since f is a convex function with Lipschitzcontinuous gradients, (Nesterov, 2004, Theorem 2.1.5) shows that, for all θ, θ Θ, L f(θ) f(θ ), θ θ f(θ) f(θ ) 2. The result follows.

Lemma 10 Assume H1. Set Sγ(θ) def = Proxγ,g(θ γH) and η def = H f(θ). For any θ Θ and γ > 0, Tγ(θ) Sγ(θ) γ η . (33)

Proof We have Tγ(θ) Sγ(θ) = Proxγ,g(θ γ f(θ)) Proxγ,g(θ γH) and (33) follows from Lemma 7.

6.2 Proof of section 2

6.2.1 Proof of Lemma 1

Set wn = vn + P k n+1 ξk + M with M def = infn P k n ξk so that infn wn 0. Then

0 wn+1 vn χn+1 + ξn+1 + X

k n+2 ξk + M wn χn+1 .

{wn, n N} is non-negative and non increasing; therefore it converges. Furthermore, 0 Pn k=0 χk w0 so that P n χn < . The convergence of {wn, n N} also implies the convergence of {vn, n N}. This concludes the proof.

On Perturbed Proximal Gradient Algorithms

6.2.2 Proof of Theorem 2

Let θ L, which is not empty by H2; note that F(θ ) = min F. We have by (27) applied with θ θn γn+1Hn+1, ξ θn, θ θ , γ γn+1

θn+1 θ 2 θn θ 2 2γn+1 (F(θn+1) min F) 2γn+1 θn+1 θ , ηn+1 .

We write θn+1 θ = θn+1 Tγn+1(θn) + Tγn+1(θn) θ . By Lemma 10, θn+1 Tγn+1(θn) γn+1 ηn+1 so that,

θn+1 θ , ηn+1 γn+1 ηn+1 2 Tγn+1(θn) θ , ηn+1 .

θn+1 θ 2 θn θ 2 2γn+1 (F(θn+1) min F)

+ 2γ2 n+1 ηn+1 2 2γn+1 Tγn+1(θn) θ , ηn+1 . (34)

Under (7) and (34), Lemma 1 shows that P n γn (F(θn) min F) < and limn θn θ exists. This implies that supn θn < . Since P n γn = + , there exists a subsequence {θφn, n N} such that limn F(θφn) = min F. The sequence {θφn, n 0} being bounded, we can assume without loss of generality that there exists θ Rd

such that limn θφn = θ . Let us prove that θ L. Since g is lower semi-continuous on Θ, lim infn g(θφn) g(θ ) so that θ Θ. Since F is lower semi-continuous on Θ, we have

min F = lim inf n F(θφn) F(θ ) min F ,

showing that F(θ ) = min F. By (34), for any m and n φm

θn+1 θ 2 θφm θ 2 2

k=φm γk+1{ Tγk+1(θk) θ , ηk+1 + γk+1 ηk+1 2} .

For any ϵ > 0, there exists m such that the RHS is upper bounded by ϵ. Hence, for any n φm, θn+1 θ 2 ϵ, which proves the convergence of {θn, n N} to θ .

6.2.3 Proof of Theorem 3

Let θ L; note that F(θ ) = min F. We ﬁrst apply (27) with θ θj γj+1Hj+1, ξ θj, θ θ , γ γj+1:

F(θj+1) min F (2γj+1) 1 θj θ 2 θj+1 θ 2 θj+1 θ , ηj+1 .

Multiplying both sides by aj+1 gives:

aj+1 F(θj+1) min F 1

θj θ 2 + aj

2γj+1 θj+1 θ 2 aj+1 θj+1 θ , ηj+1 .

Atchad e, Fort and Moulines

Summing from j = 0 to n 1 gives

an 2γn θn θ 2 +

j=1 aj{F(θj) min F} 1

j=1 aj θj θ , ηj + a0

2γ0 θ0 θ 2 . (35)

We decompose θj θ , ηj as follows:

θj θ , ηj = θj Tγj(θj 1), ηj + Tγj(θj 1) θ , ηj .

By Lemma 10, we get θj Tγj(θj 1), ηj γj ηj 2 which concludes the proof.

6.3 Proof of Section 3.1

The proof of Theorem 4 is given in the case m = 1; we simply denote by Xn the sample X(1) n . The proof for the case m > 1 can be adapted from the proof below, by substituting the functions Hθ(x) and W(x) by

Hθ(x1, , xm) = 1

k=1 Hθ(xk) W(x1, , xm) = 1

k=1 W(xk) ;

the kernel Pθ and its invariant measure πθ by

P θ(x1, , xm; B) = Z Z Pθ(xm, dy1)

k=2 Pθ(yk 1, dyk)1B(y1, . . . , ym) ,

πθ(B) = Z Z πθ(dy1)

k=2 Pθ(yk 1, dyk)1B(y1, . . . , ym) ,

for any (x1, . . . , xm) Xn and B X n.

6.3.1 Preliminary results

Proposition 11 Assume that g is proper convex and Lipschitz on Θ with Lipschitz constant K. Then, for all θ Θ,

Proxγ,g(θ) θ Kγ . (36)

Proof For all θ Θ, we get by Lemma 7

0 γ 1 θ Proxγ,g(θ) 2 g(θ) g(Proxγ,g(θ)) K θ Proxγ,g(θ) .

On Perturbed Proximal Gradient Algorithms

Proposition 12 Assume H1, H2 and Θ is bounded. Then

sup γ (0,1/L] sup θ Θ Tγ(θ) < .

If in addition H6-(ii) holds, then there exists a constant C such that for any θ, θ Θ, γ, γ (0, 1/L] Tγ(θ) T γ( θ) C γ + γ + θ θ .

Proof Let θ such that for any γ > 0, θ = Tγ(θ ) (such a point exists by H2 and (4)). We write Tγ(θ) = (Tγ(θ) θ )+θ . By Lemma 9, there exists a constant C such that for any θ Θ and any γ (0, 1/L], Tγ(θ) θ 2 θ θ 2 θ + 2 θ . This concludes the proof of the ﬁrst statement. We write Tγ(θ) T γ( θ) = Tγ(θ) T γ(θ) + T γ(θ) T γ( θ). By Lemma 7 T γ(θ) T γ( θ) θ θ γ f(θ) + γ f( θ) θ θ + γ sup θ Θ f(θ) .

By H1 and since Θ is bounded, supθ Θ f(θ) < . In addition, using again Lemma 7,

Tγ(θ) T γ(θ) (γ + γ) sup θ Θ f(θ) + Proxγ,g(θ) Prox γ,g(θ) .

We conclude by using

Prox γ,g(θ) Proxγ,g(θ) Prox γ,g(θ) θ + θ Proxγ,g(θ)

(γ + γ) sup γ (0,1/L] sup θ Θ γ 1 Proxγ,g(θ) θ .

Lemma 13 Assume H5 and H6-(i).

(i) There exists a measurable function (θ, x) 7 b Hθ(x) such that supθ Θ b Hθ W <

and for any (θ, x) Θ X,

b Hθ(x) Pθ b Hθ(x) = Hθ(x) Z Hθ(y)πθ(dy) . (37)

(ii) There exists a constant C such that for any θ, θ Θ, Pθ b Hθ Pθ b Hθ W C θ θ .

Proof See (Fort et al., 2011, Lemma 4.2).

Atchad e, Fort and Moulines

Lemma 14 Assume H4 and H5. Then, supn E [W p(Xn)] < .

Proof The conditional distribution of Xj given the past Fj 1 is Pθj 1(Xj 1, ). Therefore, we write

E [W p(Xn)] = E [E [W p(Xn) | Fn 1]] = E Pθn 1W p(Xn 1) .

We then use the drift inequality to obtain E [W p(Xn)] λE [W p(Xn 1)] + b. The proof then follows from a trivial induction.

Lemma 15 Assume H1, H6-(ii) and Θ is bounded. There exists a constant C such that w.p.1, for all n 0,

θn+1 θn Cγn+1 (1 + ηn+1 ) .

Proof We write

θn+1 θn = θn+1 Proxγn+1,g(θn) + Proxγn+1,g(θn) θn.

Since by Lemma 7, θ 7 Proxγ,g(θ) is Lipschitz for any γ > 0, we get

θn+1 Proxγn+1,g(θn)

= Proxγn+1,g(θn γn+1ηn+1 γn+1 f(θn)) Proxγn+1,g (θn)

γn+1 ηn+1 + f(θn) γn+1

ηn+1 + sup θ Θ f(θ) .

By H1, w.p.1. supθ Θ f(θ) < ; hence, there exists C1 such that w.p.1. for all n 0, θn+1 Proxγn+1,g(θn) C1γn+1 (1 + ηn+1 ). Finally, under H6-(ii), there exists a constant C2 such that, w.p.1.,

sup n γ 1 n+1 Proxγn+1,g(θn) θn sup γ (0,1/L] sup θ Θ γ 1 Proxγ,g(θ) θ C2 .

This concludes the proof.

Lemma 16 Assume H1, H4, H5 and Θ is bounded. There exists a constant C such that w.p.1, for all n 0, ηn+1 CW(Xn+1).

Proof By H4 and H5, ηn+1 (supθ Θ |Hθ|W ) W(Xn+1) + supθ Θ f(θ) . The result follows since f is Lipschitz by H1, and since W 1.

On Perturbed Proximal Gradient Algorithms

6.3.2 Proof of Theorem 4

The proof of the almost-sure convergence consists in verifying the assumptions of Theorem 2. Let us start with the proof that almost-surely, P n γ2 n+1 ηn+1 2 < . This property is a consequence of Lemma 17 applied with an γ2 n. It remains to prove that almost-surely X

n γnηn < , X

n γn+1 Tγn+1(θn), ηn+1 < ;

note that they are both of the form P n γn+1Aγn+1(θn)ηn+1 with, respectively, Aγ(θ) equal to the identity matrix, and Aγ(θ) = Tγ(θ). In the case the Monte Carlo is unbiased, we apply Proposition 18 with an γn and Aγ(θ) equal to the identity matrix and we obtain the almost-sure convergence of P n γnηn; we then apply Proposition 18 with an γn and Aγ(θ) = Tγ(θ), and we obtain the almost-sure convergence of P n γn+1 Tγn+1(θn), ηn+1 - note that by Proposition 12, Tγ(θ) satisﬁes the assumptions on Aγ(θ). In the case the Monte Carlo is biased, the steps are the same except we use Proposition 19 instead of Proposition 18. For the control of the moments, we use Theorem 3 and again Lemma 17 and Proposition 18 for the unbiased case (or Proposition 19 for the biased case).

Lemma 17 Assume H1, H4, H5 and Θ is bounded.

(i) If ak 0 and P k=1 ak < then with probability one, P n 1 an ηn 2 < .

(ii) for any q [1, p/2], there exists a constant C such that for any non-negative numbers {a1, , an},

k=1 ak ηk 2 Lq C

Proof We write

n 0 an+1 ηn+1 2

By Lemma 14 and Lemma 16, supn ηn+1 L2 < so the RHS is ﬁnite. By the Minkovski inequality, we write since ak > 0,

k=0 ak+1 ηk+1 2 Lq sup n ηn 2 L2q

The supremum is ﬁnite by Lemma 14 and Lemma 16.

Proposition 18 Assume H1, H3, H4, H5, Θ is bounded and the Monte Carlo approximation is unbiased. Let {an, n N} be a deterministic positive sequence and {Aγ(θ), γ (0, 1/L] , θ Θ} be deterministic matrices such that

sup γ (0,1/L] sup θ Θ Aγ(θ) < . (38)

Atchad e, Fort and Moulines

(i) If P n 0 a2 n < , then the series P n 0 an+1Aγn+1(θn)ηn+1 converges P-a.s.

(ii) For any q (1, p/2], there exists a constant C such that

k=0 ak+1Aγk+1(θk)ηk+1

Proof Since θn Fn, we have E an+1Aγn+1(θn) ηn+1|Fn = 0, thus showing that {Mn = Pn k=0 ak+1Aγk+1(θk)ηk+1, n N} is a martingale. This martingale converges almost-surely if S = P n 0 a2 n+1 Aγn+1(θn) 2 ηn+1 2 < P-a.s. (see e.g. (Hall and Heyde, 1980, Theorem 2.17)). Using (38) and Lemma 17, S < P-a.s. Consider now the Lq-moment of Mn. We apply (Hall and Heyde, 1980, Theorem 2.10): for any q (1, p/2], there exists a constant C such that for any n 0,

k=0 ak+1Aγk+1(θk)ηk+1

ak+1Aγk+1(θk)ηk+1 2 Lq

Lemma 14 and Lemma 16 imply that supn ηn+1 Lq < ; we then conclude with (38).

Proposition 19 Assume H1, H3 H6 and Θ is bounded. Let {an, n 0} be a positive sequence and {Aγ(θ), γ (0, 1/L] , θ Θ} be (deterministic) function-valued matrices such that there exists CA and for any γ, γ (0, 1/L] and θ, θ Θ

sup γ (0,1/L] sup θ Θ Aγ(θ) < , Aγ(θ) A γ( θ) CA γ + γ + θ θ . (39)

(i) If P n anγn < , P n a2 n < and P n |an+1 an| < then the series P n 0 an+1Aγn+1(θn)ηn+1 converges P-a.s.

(ii) For any q (1, p/2], there exists a constant C such that

k=0 ak+1Aγk+1(θk)ηk+1

k=1 |ak+1 ak| +

(i) By H4 and Lemma 13-(i), we write

ηn+1 = b Hθn(Xn+1) Pθn b Hθn(Xn+1)

= b Hθn(Xn+1) Pθn b Hθn(Xn) + Pθn b Hθn(Xn) Pθn+1 b Hθn+1(Xn+1)

+ Pθn+1 b Hθn+1(Xn+1) Pθn b Hθn(Xn+1) .

On Perturbed Proximal Gradient Algorithms

We prove successively that w.p.1,

n an+1Aγn+1(θn) b Hθn(Xn+1) Pθn b Hθn(Xn) < , (40)

n 0 an+1Aγn+1(θn) Pθn b Hθn(Xn) Pθn+1 b Hθn+1(Xn+1) < , (41)

n 0 an+1Aγn+1(θn) Pθn+1 b Hθn+1(Xn+1) Pθn b Hθn(Xn+1) < . (42)

Proof [Proof of (40)] By H4, { b Hθn(Xn+1) Pθn b Hθn(Xn), n N} is a martingale increment w.r.t. the ﬁltration {Fn, n 0}. The proof is along the same lines as the proof of Proposition 18 upon noting that by Lemma 13 and H5, there exists C such that w.p.1 for all n 0,

b Hθn(Xn+1) Pθn b Hθn(Xn) C {W(Xn+1) + W(Xn)} .

Proof [Proof of (41)] The sum is equal to P n 0 n+1Pθn b Hθn(Xn) with n+1 = an+1Aγn+1(θn) an Aγn(θn 1). On one hand, by Lemma 13 and H5, there exists C such that w.p.1 for all n 0, Pθn b Hθn(Xn) C W(Xn) .

On the other hand, by (39), Lemma 15 and Lemma 16, there exists C such that a.s.

for all n 0, n+1 C |an+1 an| + an (γn + γn+1) W(Xn) .

By Lemma 14, supn E W 2(Xn) < . Therefore, by (39) and the assumptions on

{an, n 0}, we have P n E h n+1 Pθn b Hθn(Xn) i < ; which concludes the proof.

Proof [Proof of (42)] By (39) and Lemma 13, there exists a constant C such that w.p.1 for any n Aγn+1(θn) Pθn+1 b Hθn+1(Xn+1) Pθn b Hθn(Xn+1) C θn+1 θn W(Xn+1) .

By Lemma 15 and Lemma 16, there exists a constant C such that w.p.1,

for all n 0, θn+1 θn W(Xn+1) Cγn+1 W 2(Xn+1) .

From Lemma 14 and the assumptions on {an, n 0}, P n an+1 γn+1E W 2(Xn+1) < from which (42) follows.

Atchad e, Fort and Moulines

(ii) We start from the same decomposition of ηn+1 in three terms. The ﬁrst one is a martingale, and following the same lines as in the proof of Proposition 18, we obtain

k=0 ak+1Aγn+1(θn) b Hθn(Xn+1) Pθn b Hθn(Xn) Lq C

For the second term, we write

k=0 ak+1Aγk+1(θk) Pθk b Hθk(Xk) Pθk+1 b Hθk+1(Xk+1)

a1Aγ1(θ0)Pθ0 b Hθ0(X0) an+1Aγn+1(θn)Pθn+1 b Hθn+1(Xn+1)

k=1 k+1 Pθk b Hθk(Xk) .

By the Minkovski inequality, it is easily seen that there exists a constant C such that

k=0 ak+1Aγk+1(θk) Pθk b Hθk(Xk) Pθk+1 b Hθk+1(Xk+1) Lq

|ak+1 ak| + ak (γk + γk+1) !

Finally, for the last term, following the same computations as above, we have by the Minkovski inequality

k=0 ak+1Aγk+1(θk) Pθk+1 b Hθk+1(Xk+1) Pθk b Hθk(Xk+1) Lq C

k=0 ak+1γk+1 .

6.4 Proof of Theorem 6

We write ηn+1 = Bn + (ηn+1 Bn) where Bn is given by (12). Observe that {ηn+1 Bn, n N} is a martingale-increment sequence. Suﬃcient conditions for the almostsure convergence of a martingale and the control of Lq-moments can be found in (Hall and Heyde, 1980, Theorems 2.10 and 2.17). Then the proof follows from Proposition 5 and Lemma 14.

Appendix A. Proofs of section 4

By using the Cauchy-Schwartz inequality, it holds Z exp(ℓc(θ|u)) φ(u)du Z exp(0.5ℓc(θ|u)) φ(u)du 1/2

On Perturbed Proximal Gradient Algorithms

Z exp(ℓc(θ|u)) u 2 φ(u) du 2

Z exp(0.5ℓc(θ|u))φ(u) du Z exp(3ℓc(θ|u)/2) u 4φ(u)du

which implies that Z u 2πθ(u)du =

R exp(ℓc(θ|u)) u 2φ(u)du R exp(ℓc(θ|v))φ(v)dv

Z exp (3ℓc(θ|u)/2) u 4 φ(u) du 1/2

Since exp(ℓc(θ|u)) 1 and R u 4φ(u)du = q(2 + q), we have

Z u 2πθ(u)du p

Appendix B. Proof of section 5

For θ, ϑ Θ, the (i, j)-th entry of the matrix ℓ(θ) ℓ(ϑ) is given by

( ℓ(θ) ℓ(ϑ))ij = Z

Xp Bij(x)πϑ(dx) Z

Xp Bij(x)πθ(dx).

For t [0, 1] let

πt(dz) def = exp B(z), tϑ + (1 t)θ R exp B(x), tϑ + (1 t)θ µ(dx),

deﬁnes a probability measure on Xp. It is straightforward to check that

( ℓ(θ) ℓ(ϑ))ij = Z Bij(x)π1(dx) Z Bij(x)π0(dx),

and that t 7 R Bij(x)πt(dx) is diﬀerentiable with derivative

Z Bij(x)πt(dx)

= Z Bij(x) B(x) Z B(z)πt(dz), ϑ θ πt(dx),

= Covπt Bij(X), B(X), ϑ θ ,

where the covariance is taken assuming that X πt. Hence ( ℓ(θ) ℓ(ϑ))ij =

0 dt Covt Bij(X), B(X), ϑ θ

osc( Bij) s X

k l osc2( Bkl) θ ϑ 2.

This implies the inequality (16). Acknowledgments: We are grateful to George Michailidis for very helpful discussions. This work is partly supported by NSF grant DMS-1228164.

Atchad e, Fort and Moulines

S. Allassonni ere and E. Kuhn. Convergent Stochastic Expectation Maximization algorithm with eﬃcient sampling in high dimension. Application to deformable template model estimation. Comput. Stat. Data An., 91:4 19, 2015.

C. Andrieu and E. Moulines. On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab., 16(3):1462 1505, 2006.

C. Andrieu, E. Moulines, and P. Priouret. Stability of stochastic approximation under veriﬁable conditions. SIAM J. Control Optim., 44(1):283 312, 2005.

O. Banerjee, L. El Ghaoui, and A. d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res., 9:485 516, 2008.

H. Bauschke and P.L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. CMS Books in Mathematics/Ouvrages de Math ematiques de la SMC. Springer, New York, 2011. ISBN 978-1-4419-9466-0. With a foreword by H edy Attouch.

A. Beck and M. Teboulle. Gradient-based algorithms with applications to signalrecovery problems. In Convex optimization in signal processing and communications, pages 42 88. Cambridge Univ. Press, Cambridge, 2010.

A. Benveniste, M. M etivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer Verlag, Berlin, 1990.

P. Biane, J. Pitman, and M. Yor. Probability laws related to the Jacobi theta and Riemann zeta functions, and Brownian excursions. Bull. Amer. Math. Soc. (N.S.), 38(4):435 465 (electronic), 2001. ISSN 0273-0979.

H.M. Choi and J. P. Hobert. The polya-gamma gibbs sampler for bayesian logistic regression is uniformly ergodic. Electronic Journal of Statistics, 7:2054 2064, 2013.

P.L. Combettes. Inherently parallel Algorithms in Feasibility and Optimization and their Applications, chapter Quasi-Fejerian analysis of some optimization algorithms, pages 115 152. Elsevier Science, 2001.

P.L. Combettes and J.C. Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, volume 49 of Springer Optim. Appl., pages 185 212. Springer, New York, 2011.

P.L. Combettes and J.C. Pesquet. Stochastic Quasi-Fejer block-coordinate ﬁxed point iterations with random sweeping. SIAM J. Optim., 25(2):1221 1248, 2015a.

P.L. Combettes and J.C. Pesquet. Stochastic Approximations and Perturbations in Forward-Backward Splitting for Monotone Operators. Technical report, ar Xiv:1507.07095v1, 2015b.

On Perturbed Proximal Gradient Algorithms

P.L. Combettes and V. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4):1168 1200, 2005.

A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 1647 1655. Curran Associates, Inc., 2011.

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121 2159, 2011. ISSN 1532-4435.

M. Ekeberg, C. L ovkvist, Y. Lan, M. Weigt, and E. Aurell. Improved contact prediction in proteins: Using pseudolikelihoods to infer potts models. Phys. Rev. E, 87: 012707, 2013.

G. Fort and E. Moulines. Convergence of the Monte Carlo expectation maximization for curved exponential families. Ann. Statist., 31(4):1220 1259, 2003. ISSN 00905364.

G. Fort, E. Moulines, and P. Priouret. Convergence of adaptive and interacting Markov chain Monte Carlo algorithms. Ann. Statist., 39(6):3262 3289, 2011. ISSN 0090-5364.

G. Fort, E. Moulines, M. Vihola, and A. Schreck. Convergence of Markovian Stochastic Approximation with discontinuous dynamics. Technical report, ar Xiv math.ST 1403.6803, 2014.

G. Fort, B. Jourdain, E. Kuhn, T. Leli evre, and G. Stoltz. Convergence of the Wang Landau algorithm. Mathematics of Computation, 84:2297 2327, 2015.

C.J. Geyer. On the convergence of Monte Carlo maximum likelihood calculations. J. Roy. Statist. Soc. Ser. B, 56(1):261 274, 1994.

J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint structure estimation for categorical Markov networks. Technical report, Univ. of Michigan, 2010.

P. Hall and C.C. Heyde. Martingale Limit Theory and its Application. Academic Press, 1980.

H. H oﬂing and R. Tibshirani. Estimation of sparse binary pairwise Markov networks using pseudo-likelihoods. J. Mach. Learn. Res., 10:883 906, 2009.

C. Hu, W. Pan, and J.T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Y. Bengio, D. Schuurmans, J. Laﬀerty, C. K. I Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, pages 781 789, 2009.

A. Juditsky and A. Nemirovski. First-order methods for nonsmooth convex large-scale optimization, i: General purpose methods. In S. Sra, S. Nowozin, and S. Wright, editors, Oxford Handbook of Innovation, pages 121 146. MIT Press, Boston, 2012a.

Atchad e, Fort and Moulines

A. Juditsky and A. Nemirovski. First-order methods for nonsmooth convex largescale optimization, ii: Utilizing problem s structure. In S. Sra, S. Nowozin, and S. Wright, editors, Oxford Handbook of Innovation, pages 149 181. MIT Press, Boston, 2012b.

H. Kamisetty, S. Ovchinnikov, and D. Baker. Assessing the utility of coevolutionbased residue-residue contact predictions in a sequence-and structure-rich era. Proceedings of the National Academy of Sciences, 2013. doi: 10.1073/pnas.1314045110.

G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(1-2, Ser. A):365 397, 2012. ISSN 0025-5610.

J. Lin, L. Rosasco, S. Villa, and D.X. Zhou. Modiﬁed Fejer Sequences and Applications. Technical report, ar Xiv:1510:04641v1 math.OC, 2015.

G.J. Mc Lachlan and T. Krishnan. The EM algorithms and Extensions. Wiley Interscience; 2 edition, 2008.

S. Meyn and R.L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. ISBN 978-0-521-73182-9. With a prologue by Peter W. Glynn.

J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien. CR Acad. Sci. Paris S er. A Math, 255:2897 2899, 1962.

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574 1609, 2008. ISSN 1052-6234.

Y.E. Nesterov. Introductory Lectures on Convex Optimization, A basic course. Kluwer Academic Publishers, 2004.

A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1574 1582. Curran Associates, Inc., 2014.

H.E. Ogden. A sequential reduction method for inference in generalized linear mixed models. Electron. J. Statist., 9(1):135 152, 2015.

N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123 231, 2013.

N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using Polya-Gamma latent variables. J. Am. Stat. Assoc., 108(504):1339 1349, 2013.

B.T. Polyak. Introduction to Optimization. xx, 1987.

P. Ravikumar, M.J. Wainwright, and J.D. Laﬀerty. High-dimensional Ising model selection using ℓ1-regularized logistic regression. Ann. Statist., 38(3):1287 1319, 2010.

On Perturbed Proximal Gradient Algorithms

C.P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer; 2nd edition, 2005.

L. Rosasco, S. Villa, and B.C. Vu. Convergence of a Stochastic Proximal Gradient Algorithm. Technical report, ar Xiv:1403.5075v3, 2014.

L. Rosasco, S. Villa, and B.C. Vu. A Stochastic Inertial Forward-Backward Splitting Algorithm for multi-variate monotone inclusions. Technical report, ar Xiv:1507.00848v1, 2015.

E. Saksman and M. Vihola. On the ergodicity of the adaptive Metropolis algorithm on unbounded domains. Ann. Appl. Probab., 20(6):2178 2203, 2010.

J. Schelldorfer, L. Meier, and P. B uhlmann. GLMMLasso: an algorithm for highdimensional generalized linear mixed models using ℓ1-penalization. J. Comput. Graph. Statist., 23(2):460 477, 2014.

M.W. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximalgradient methods for convex optimization. In NIPS, pages 1458 1466, 2011. see also the technical report INRIA-00618152.

A. Schreck, G. Fort, and E. Moulines. Adaptive Equi-energy sampler : convergence and illustration. ACM Transactions on Modeling and Computer Simulation (TOMACS), 23(1):Art 5., 2013.

J. Shao. Mathematical Statistics. Springer texts in Statistics, 2003.

L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res., 11:2543 2596, 2010. ISSN 1532-4435.

L. Xiao and T. Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM J. Optim., 24:2057 2075, 2014.

L. Xue, H. Zou, and T. Cai. Non-concave penalized composite likelihood estimation of sparse ising models. Ann. Statist., 40:1403 1429, 2012.