# concentration_analysis_of_multivariate_elliptic_diffusions__92dbee07.pdf

Journal of Machine Learning Research 24 (2023) 1-38 Submitted 6/22; Revised 3/23; Published 4/23

Concentration analysis of multivariate elliptic diﬀusions

Lukas Trottner trottner@math.au.dk Department of Mathematics Aarhus University Aarhus, Denmark

Cathrine Aeckerle-Willems aeckerle@uni-mannheim.de Department of Economics University of Mannheim Mannheim, Germany

Claudia Strauch strauch@math.au.dk Department of Mathematics Aarhus University Aarhus, Denmark

Editor: Gabor Lugosi

We prove concentration inequalities and associated PAC bounds for both continuousand discrete-time additive functionals for possibly unbounded functions of multivariate, nonreversible diﬀusion processes. Our analysis relies on an approach via the Poisson equation allowing us to consider a very broad class of subexponentially ergodic, multivariate diﬀusion processes. These results add to existing concentration inequalities for additive functionals of diﬀusion processes which have so far been only available for either bounded functions or for unbounded functions of processes from a signiﬁcantly smaller class. We demonstrate the power of these exponential inequalities by two examples of very diﬀerent areas. Considering a possibly high-dimensional, parametric, nonlinear drift model under sparsity constraints we apply the continuous-time concentration results to validate the restricted eigenvalue condition for Lasso estimation, which is fundamental for the derivation of oracle inequalities. The results for discrete additive functionals are applied for an investigation of the unadjusted Langevin MCMC algorithm for sampling of moderately heavy tailed densities π. In particular, we provide PAC bounds for the sample Monte Carlo estimator of integrals π(f) for polynomially growing functions f that quantify suﬃcient sample and step sizes for approximation within a prescribed margin with high probability.

Keywords: Concentration inequality, elliptic diﬀusions, Lasso estimation, MCMC, PAC bounds

1. Introduction

Concentration inequalities for additive functionals belong to the fundamental probabilistic tools in statistics and related areas such as statistical learning and reinforcement learning since they allow exact quantiﬁcation of the deviation of estimators from a given target. In particular, concentration inequalities for independent data such as Hoeﬀding, Bernstein and Mc Diarmid inequalities are of central importance for deriving PAC guarantees in classiﬁ-

2023 Lukas Trottner, Cathrine Aeckerle-Willems and Claudia Strauch.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0666.html.

Trottner, Aeckerle-Willems and Strauch

cation and regression contexts (see, e.g., Devroye et al. (1996); Wainwright (2019)). While such questions have been well understood for decades in classical settings for independent or strongly mixing data see also the recent investigations of Bernstein and Hoeﬀding inequalities and related applications in statistical learning for Markov chains with spectral gap in Jiang et al. (2018); Fan et al. (2021) the general picture for additive functionals of diﬀusion processes is less clear. Particularly when it comes to unbounded functionals, whose deviation properties around their ergodic mean are fundamentally important in a multitude of applications, useful results are rather scarce. Important achievements in this direction can be found in Cattiaux and Guillin (2008); Gao et al. (2014), where for a restricted class of reversible diﬀusion processes exponential inequalities are derived by means of functional inequalities. While these results are mathematically elegant and explicitly quantify the contribution of the asymptotic variance, they come at the price of structural constraints on the diﬀusion coeﬃcients which are hard to verify and often inappropriate for speciﬁc applications.

The goal of this paper is therefore to derive usable exponential concentration inequalities for unbounded functionals, both for continuous as well as discrete multivariate diﬀusion data, under comparatively weak assumptions on the coeﬃcients and the speed of ergodicity. With our particular focus on applications, we translate these inequalities into PAC bounds for the approximation task and demonstrate their usefulness in speciﬁc high-dimensional applications to (i) penalized drift estimation under sparsity constraints, where we extend results for the classical Ornstein Uhlenbeck model in Ga ıﬀas and Matulewicz (2019); Cio lek et al. (2020) to more ﬂexible parametrized models with relaxed ergodicity assumptions, and (ii) performance guarantees for unadjusted Langevin MCMC algorithms for heavy-tailed target sampling, which is a setting that substantially diﬀers from the related pioneering work Dalalyan (2017); Durmus and Moulines (2017) for strongly log-concave targets. Here, for a given quantity of interest π and a sample based estimator bπt with t T where T = [0, ) or T = N0 for some sampling distance > 0, depending on whether continuous or discrete data is available , we say that bπt satisﬁes an (ε, δ)-PAC bound for t T(ε, δ) T, given ε > 0, δ (0, 1), if

t T(ε, δ), P |bπt π| ε 1 δ,

i.e., given a sample length of at least T(ε, δ), bπt approximates the target π within an εmargin with probability at least 1 δ. Such results are statistically much more insightful than upper bounds on the mean deviation, which do not reveal detailed information on the distribution of the loss.

In our particular context, the objectives are exponential inequalities and associated PAC bounds of sample mean estimators of the quantity π = µ(f) := R f(x) µ(dx), where µ is the stationary distribution of a subexponentially ergodic elliptic diﬀusion X and f is a polynomially growing function. That is, we provide an in-depth analysis of the deviations around π of bπt = t 1/2Gt(f), where

0 f(Xs) ds, (1)

Concentration analysis of elliptic diffusions

given continuous data (Xs)0 s t, and of bπn = (n ) 1/2Gn, (f), where

Gn, (f) := 1

k=1 f(Xk ) , (2)

given discrete data (Xk )k=1,...,n, as well as their burned-in versions. Since our speciﬁc framework is what sets this paper apart from related studies such as Gao et al. (2014), we will now introduce both the class of processes we are working with as well as the Poisson equation and its solution studied in Pardoux and Veretennikov (2001), which is at the heart of our theoretical analysis based on martingale approximation.

Basic framework Consider a d-dimensional elliptic diﬀusion that is given as the weak solution to the SDE d Xt = b(Xt) dt + σ(Xt) d Wt, (3)

where b: Rd Rd is a locally Lipschitz drift vector such that b(x) 1 + x q for some q 0 and σ: Rd Rd d is a uniformly continuous, bounded and locally Lipschitz d d-matrix-valued function such that a := σσ is uniformly elliptic, i.e.,

a(x)η/ η , η/ η λ, x Rd, η Rd \ {0},

for some constant λ > 0. We denote by (X, (Px)x Rd) the Markovian weak solution of (3) such that under Px the process X solves (3) with initial condition X0 = x and has continuous paths almost surely. Note that X has the Feller property, cf. (Stroock and Varadhan, 2006, Corollary 11.1.5), and is therefore Borel right such that it falls into the general framework for stability of Markov processes. Without loss of generality, we may assume that there exists a family of shift operators (θt)t 0 for X, that is, Xt θs = Xt+s for any s, t 0. Let λ , λ+, Λ be the tightest constants such that, for any x = 0,

0 < λ a(x)x/ x , x/ x λ+, tr(a(x))/d Λ,

where our assumptions guarantee that such constants always exist since we may always choose Λ = d 1 supx Rd tr(a(x)) < , λ = λ and λ+ = supx Rd σ(x) 2 < . Our subsequent analysis substantially relies on the following growth condition on the drift,

(A (q)) if x M0, then b(x), x/ x r x q,

where q [ 1, 1), M0 0, r > 0. For q = 0, this condition equals the standard ergodicity condition in many recent investigations of multivariate diﬀusion processes exploiting the exponential β-mixing property. As will be discussed in Section 2, the case q > 0 corresponds to a subexponential ergodic behaviour of the diﬀusion. Our approach to deviation inequalities is driven by the martingale approximation technique, which has been employed for the same purpose in the literature under more restrictive structural assumptions. Aeckerle-Willems and Strauch (2021) study concentration inequalities in the context of scalar exponentially ergodic diﬀusions in the regime q = 0 with polynomially growing drift, and in Nickl and Ray (2020), multivariate diﬀusions with unit

Trottner, Aeckerle-Willems and Strauch

diﬀusion matrix and periodic Lipschitz drift are considered. Galtchouk and Pergamenshchikov (2007) essentially treat the scalar dissipative case with q = 1. All of these papers put a special emphasis on uniformicity of the concentration inequalities with respect to the diﬀusion coeﬃcients in order to apply them to statistical minimax estimation problems. Moreover, the martingale approximation is employed in Mattingly et al. (2010) for providing L2 convergence guarantees of Monte Carlo estimators for well-behaved SDEs on the torus based on samples obtained by numerical approximation schemes. Central to the martingale approximation technique is the existence of a solution to the Poisson equation Lu = f, (4)

for appropriate functions f where, given u L1 loc(Rd) having weak partial derivatives up to second order belonging to L1 loc(Rd),

Lu(x) = b(x), u(x) + 1

2tr a(x)D2u(x) , x Rd,

is a second order local operator. Note that, on the domain C2 0(Rd), L is the inﬁnitesimal generator of the diﬀusion process. In the scalar case, (4) has an explicit C2-solution, which is used in Aeckerle-Willems and Strauch (2021) to obtain sup-norm moment bounds for empirical processes that are uniform over a class of SDE coeﬃcients. Such results can then be employed for minimax optimal sup-norm adaptive drift estimation as demonstrated in Aeckerle-Willems and Strauch (2022). For multivariate diﬀusions, such explicit solutions are not obtainable in general such that one needs to deal with the Poisson equation in a more abstract manner. In Pardoux and Veretennikov (2001), the authors demonstrate that in our framework, for any f : Rd R such that |f(x)| L(1+ x η) for some ﬁnite constants L > 0, η 0, there exists a solution u[f] T p>1 W2,p loc (Rd) that is unique in the local Sobolev space W2,p loc (Rd) for any p > d. This solution is given as

u[f](x) = Z

0 Ex[ f(Xt)] dt, x Rd,

i.e., u[f](x) is expressed as the potential of f under Px. Therefore, for such f we denote

L 1[f](x) := Z

0 Ex[ f(Xt)] dt, x Rd,

such that LL 1[f] = f, λ-a.e., where λ denotes the Lebesgue measure on Rd. The Sobolev regularity of L 1[f] is an essential property for our purposes, since it allows us to apply the It o Krylov formula for martingale approximation. This approach will enable us to conclude the desired deviation inequalities from moment bounds for the martingale approximation.

Outline and main results In Section 2, we collect some essential known facts on the subexponentially ergodic nature of the diﬀusion X implied by the drift condition (A (q)) and put them into a form suited to our needs. In Section 3.1, we present our ﬁrst main result, the concentration inequality for the continuous-time scaled additive functional Gt(f) for polynomially growing f (Theorem 3) which is based on our derivation of the martingale

Concentration analysis of elliptic diffusions

approximation Gt(f) and bounds on the solution to the Poisson equation and its gradient going back to Pardoux and Veretennikov (2001). We translate these inequalities into stationary and non-stationary PAC bounds in Corollary 8 and 9, respectively. In Section 3.2, we then proceed to derive explicit deviation bounds in terms of the sampling frequency and number of observations n for the discrete scaled additive functional Gn, (f) by combining Theorem 3 with an approximation argument, see Theorem 11. As for the continuous data, we use this result to infer PAC bounds for the sample mean estimator and its burn-in version. In Section 4.1, we apply the continuous-time results to the problem of estimating the coeﬃcients in a possibly high-dimensional drift model of the form bθ0 = PN j=1 θjψj, θ0 = (θ1, ..., θN) RN, given a dictionary (ψj)1 j N of Lipschitz continuous functions ψj : Rd Rd under sparsity constraints on the coeﬃcients via a Lasso approach. Our concentration inequality is the key to showing that the central restricted eigenvalue condition is in place, which then in turn yields oracle inequalities in line with those well known in the classical regression context. Finally, Section 4.2 is devoted to an application of our discrete deviation results, where we study the convergence properties of the unadjusted Langevin algorithm for moderately heavy-tailed target distributions π, in terms of suﬃcient sample and step size conditions for sampling within an ε-margin in total variation as well as for ensuring an (ε, δ)-PAC bound of the sample Monte Carlo estimator of a given target integral π(f), again for polynomially bounded functions f.

2. Subexponential ergodicity of the diﬀusion

We now give an exact quantiﬁcation of the stability of X, which underlies the arguments from Pardoux and Veretennikov (2001) and also plays the central technical role in our approach. For details on terms from Markov stability theory such as petite sets or Harris recurrence, we refer to Douc et al. (2009). Deﬁne q+ = q 0. For q (0, 1), choose ι = ι(q) > 0 small enough such that r > ιλ+(1 q)/2. In this framework, it was shown in (Douc et al., 2009, Proposition 5.1, Theorem 5.4) as a reﬁnement of results in Malyshkin (2000) that X possesses a unique invariant distribution µ and that there exists some constant C(q) such that, for Vq(x) := exp(ι x 1 q), we have Px(Xt ) µ TV C(q)Vq(x)(1 + t)

2q 1+q e (ι t)(1 q)/(1+q), x Rd, t 0, (5)

with ι = ι (q) := ι(1+q)/(1 q)(1 + q)(r λ+ι(1 q)/2) and ν TV := sup f 1|ν(f)| for a signed ﬁnite measure ν. Thus, for q (0, 1), X is subexponentially ergodic. In case q [ 1, 0], X is exponentially ergodic, i.e., for some constants C(0), ι and ι (not explicitly related to the constants C(q), ι(q), ι (q) from above), Px(Xt ) µ TV C(0)eι x e ι t, x Rd, t 0, (6)

cf. (Pardoux and Veretennikov, 2001, Proposition 1). Moreover, (Douc et al., 2009, Theorem 5.3) and (Pardoux and Veretennikov, 2001, Proposition 1) establish that Vq+ L1(µ) (here, V0( ) = exp(ι )), i.e.,

Eµ[Vq+(X0)] = Z

Rd exp(ι x 1 q+) µ(dx) < . (7)

Trottner, Aeckerle-Willems and Strauch

It will be central for us to trade oﬀsubexponential ergodicity at a slower temporal rate with a less punishing penalty function. Let e Vα(x) := 1 + x α for α 0. Then, for ζ > 0 and γ > 2(1 + ζ), Proposition 1 in Pardoux and Veretennikov (2001) demonstrates that Px(Xt ) µ TV C(γ, ζ)e Vγ(x)(1 + t) (1+ζ), x Rd, t 0,

i.e., polynomial convergence with polynomial penalty function whose degree depends on the degree of the temporal rate that can be freely chosen. We will need to make use of polynomial convergence with respect to a stronger norm than the total variation norm considered above. To this end, let H1, H2 be a pair of Young functions on R+, which are in particular invertible and satisfy

xy H1(x) + H2(y), x, y 0, (8)

and let I be the family of pairs of inverse Young functions augmented by (1, Id) and (Id, 1). The prototypical example for such pairs are H1(x) = xp/p, H2(x) = xq/q with p, q conjugate H older exponents such that 1/p + 1/q = 1, in which case (8) is simply Young s inequality. More generally, one may pair any convex function with its Legendre transform to obtain (8). Following earlier work on discrete and continuous-time Markov models Douc et al. (2004, 2008); Jarner and Roberts (2002); Tuominen and Tweedie (1994); Fort and Roberts (2005), such inverse Young functions are used in Douc et al. (2009) for subgeometrically ergodic Markov models to quantify the trade-oﬀbetween speed of convergence and strength of the underlying f-norm, which we introduce next. For a measurable function f 1 and a signed measure ν on (Rd, B(Rd)), its f-norm is deﬁned by

ν f := sup |g| f |ν(g)|.

In particular, TV = 1. Let us also deﬁne the δ-delayed ﬁrst hitting time of a set B B(Rd) by τB(δ) = inf{t δ : Xt B}, for δ 0, with τB = τB(0). Moreover, we say that B with µ(B) > 0 is accessible, since µ is a maximal irreducibility measure of X. Also note that µ as the invariant distribution of a Feller process is maximal Harris, such that in particular for any accessible set B we have Px(τB(δ) < ) = 1 for any x Rd, δ 0. With the techniques from Douc et al. (2009), we obtain the following result on polynomial f-norm convergence and modulated moments, whose proof is given in Appendix A. This explicit ergodicity result is central both for our derivation of the concentration inequality for continuous data and for its subsequent discrete extension. It will turn out that appropriate choices for the pairing of Young functions to optimize the trade-oﬀbetween convergence rate and strength of the f-norm will be essential when dealing with polynomially bounded test functions. We therefore truly need the full generality of the statement, which underlines the power of the approach in Douc et al. (2009) for concrete applications.

Proposition 1 Let γ 1 + q and q ( 1, 1). Then, there exist functions rγ,q(t) (1 + t)(γ (1+q))/(1+q) and fγ,q e Vγ (1+q) such that, for any pair of inverse Young functions Ψ = (Ψ1, Ψ2) I and some constant C(Ψ), we have

(Ψ1(rγ,q(t)) 1) Px(Xt ) µ 1 Ψ2 fγ,q C(Ψ)e Vγ(x), x Rd, t 0, (9)

Concentration analysis of elliptic diffusions

and, for any accessible set B B(Rd) and any δ > 0, there exists a constant c(Ψ) > 0 such that

Exh Z τB(δ)

0 Ψ1(rγ,q(t))Ψ2(fγ,q(Xt)) dt i c(Ψ)e Vγ(x), x Rd. (10)

Moreover, if q = 1 and γ > 0, then, for any α (0, rγ), there exist a function rα(t) exp( αt) and fγ e Vγ such that (1) and (9) are true with fγ,q and rγ,q replaced by fγ and rα, respectively.

Let us note that, for any η 0,

sup t 0 Ex e Xt η c(η) 1 + x η , x Rd,

where e Xt = Xτ 1(t) for the time change τ(t) := R t 0 σ (Xs)Xs/ Xs 2 ds, cf. (Pardoux and Veretennikov, 2001, Proposition 1). Setting Ψ1 = 1 and Ψ2 = Id in (9), it follows for the process on its unchanged time scale that, for any η > 0,

sup t 0 Ex Xt η C(η)(1 + x η+1+q), x Rd. (11)

3. Concentration of additive diﬀusion functionals

Recall the deﬁnition of the scaled additive functionals Gt(f) and Gn, (f) from (1) and (2), respectively. Motivated by the existence of a regular solution to the Poisson equation for polynomially bounded functions, we study deviations of Gt(f) and its discrete version Gn, (f) for functions f belonging to the function class F(η, L) given by

F(η, L) := ef µ( ef) : ef G(η, L) ,

for G(η, L) := f : Rd R : |f(x)| L(1 + x η), x Rd},

for some ﬁnite constant L > 0, η 0. There is a vast amount of literature on concentration inequalities for path integrals of general Markov processes. The most powerful results are generally established under the assumption of functional inequalities such as Poincar e or log-Sobolev. However, the elliptic diﬀusions considered in this paper generally do not satisfy such rather strong functional inequalities. In this regard, Cattiaux and Guillin (2008) establish concentration inequalities for bounded functionals under a so-called weak Poincar e inequality, which is demonstrated to be equivalent to an α-mixing assumption on the process, cf. (Cattiaux and Guillin, 2008, Proposition 3.4). Recall that a stationary Markov process (Yt)t 0 with natural ﬁltration (Ft)t 0 and initial distribution ν is said to be α-mixing if the mixing coeﬃcient αν(t) := sups 0 sup A Fs,B Fs+t|Pν(A B) Pν(A)Pν(B)| tends to zero as t . It follows from (5), (6) and (7) that the stationary β-mixing coeﬃcient β(t) := R

Rd Px(Xt ) µ TV µ(dx) of our diﬀusion process satisﬁes

β(t) c exp ι t(1 q+)/(1+q+) ,

for any ι (0, ι ) and some constant c depending on ι , i.e., the stationary diﬀusion X is subexponentially β-mixing. Consequently, using the well-known fact that αµ(t) β(t), (Cattiaux and Guillin, 2008, Proposition 3.9) yields the following result for bounded f.

Trottner, Aeckerle-Willems and Strauch

Theorem 2 (Cattiaux and Guillin, 2008, Proposition 3.9) For ι (0, ι ), deﬁne

c(q, ι ) := 1

1/(1 q+) (1 q+)ι

(1+q+)/(2(1 q+)) . (12)

For any such ι , there exists a constant c > 0 such that, for all f F(0, L) and (u, t) R2 + such that c(1 + q+)(1 q+) (1 q+)/2 u < c(q, ι ) t /

t 1 q+, (13)

it holds Pµ |Gt(f)| > 2L c(q, ι ) 1u

1 1 q+ + t 1/2 2e u.

In the above result, the restriction on u in (13) is explained by the proof technique that makes use of general moment bounds for discrete α-mixing sequences from Rio (2017). This approach requires the integral to be divided into a ﬁnite number of blocks with a carefully chosen length that determines the degree of mixing of the block sequence. In the following, we add to this result by allowing polynomially growing integrands f. It is well-known that dropping the boundedness assumption poses major challenges in deriving concentration inequalities, some of which have been elegantly solved in Gao et al. (2014) for symmetric Markov processes satisfying (strong) functional inequalities. It should also be noted that in (Cattiaux and Guillin, 2008, Section 3.2) some arguments are provided how conclusions for unbounded integrands f can be drawn from Theorem 2 by employing a truncation technique. However, there appears to be a gap in the proposed strategy, which prevents it from being applicable for u > 0 such that u/

t is small. Since our ultimate focus is on applications of our concentration inequalities to the inference of PAC bounds for t 1/2Gt(f), we do not further pursue an approach relying on discrete mixing results, but employ a diﬀerent technique that is embedded more naturally in the continuous framework.

3.1 Continuous observations

Our main result for continuous observations is the following exponential concentration bound for polynomially bounded functions.

Theorem 3 There exists a constant W, depending on q, η and the diﬀusion coeﬃcients b and σ, such that, for any p 2, t > 0 and f F(η, L), we have

Gt(f) Lp(Pµ) LWp

1 2 + η+q +q+1

1 q+ . (14)

As a consequence, for any t > 0,

Pµ |Gt(f)| > e LWu

1 2 + η+q +q+1

1 q+ e u, u 2. (15)

The proof will be given by combining a sequence of technical lemmas that we develop in the following. An interpretation of the result will be stated later in Remark 7 since this requires making explicit reference to the proof. The ﬁrst result that we need are bounds on the Lp-norms of the invariant measure µ which are implied by its subexponential tails.

Concentration analysis of elliptic diffusions

Lemma 4 For all p 1, it holds

Eµ X0 p 1/p cq+p1/(1 q+),

cq+ = ee/2+(1 q+)/12((1 q+)ιe) 1/(1 q+) s

2π 1 q+ Eµ[Vq+(X0)].

Proof Let aq+ = ((1 q+)ιe) 1/(1 q+). Using Markov s inequality, it follows that, for any u 1, Pµ X0 e1/(1 q+)aq+u Eµ[Vq+(X0)] exp u1 q+/(1 q+) ,

with Eµ[Vq+(X0)] < due to (7). The assertion now follows from (Foucart and Rauhut, 2013, Proposition 7.13).

Next, we state the martingale approximation of the additive functional Gt(f) for polynomially bounded f with the help of the It o Krylov formula, which extends the usual It o formula for diﬀusion processes from functions with C2-regularity to functions with slightly weaker Sobolev regularity. This is necessary in light of the regularity of the solution to the Poisson equation that we described in Section 2.

Lemma 5 For any f F(η, L), we have a decomposition

t Mt(f) + 1

where (Mt(f))t 0 is a continuous square-integrable Pµ-martingale and both f 7 M (f) and f 7 R (f) are linear. Moreover, there exists a global constant c 1 such that, for any p 1, t 0 Eµ |Mt(f)|p 1/p cλ1/2 + p1/2

t L 1[f] Lp 2(µ), (16)

and Eµ |Rt(f)|p 1/p 2 L 1[f] Lp(µ). (17)

Proof For any p 1, L 1[f] W2,p loc (Rd), the coeﬃcients b, σ are locally bounded, σσ

is uniformly positive deﬁnite and (7) guarantees Eµ[ R t 0 b(Xs) 2 ds] < since b e Vq . Thus, we can apply the It o Krylov formula (cf. (Krylov, 2009, Theorem 2.10.1)) to obtain the Pµ-a.s. identities

L 1[f](Xt) = L 1[f](X0) + Z t

L 1[f](Xs) σ(Xs) d Ws + Z t

0 LL 1[f](Xs) ds

= L 1[f](X0) + Z t

L 1[f](Xs) σ(Xs) d Ws + Z t

0 f(Xs) ds, t 0.

Here, the second equality follows from LL 1[f] = f, λ-a.e., and Px(Xt ) λ for any (t, x) (0, ) Rd, which implies

0 LL 1[f](Xs) ds Z t

0 f(Xs) ds i Z t

Rd|LL 1[f](y) f(y)|ps(x, y) dy ds = 0.

Trottner, Aeckerle-Willems and Strauch

Thus, R t 0 LL 1[f](Xs) dt = R t 0 f(Xs) ds Px-a.s. for any x Rd and hence also Pµ-a.s. follows. Consequently,

0 f(Xs) ds = 1

t Mt(f) + 1

t Rt(f), t 0,

Mt(f) = Z t

L 1[f](Xs) σ(Xs) d Ws, t 0,

and Rt(f) = L 1[f](Xt) L 1[f](X0), t 0.

The square-integrable martingale property of M (f) follows from

L 1[f](x) e Vη+q +q+1(x), x Rd,

which is demonstrated in the proof of Lemma 6, such that (7) implies that Z t

0 Eµ L 1[f](Xs) σ(Xs) 2 ds tλ+µ e V2(η+q +q+1) < , t 0.

The bound (17) is now an immediate consequence of stationarity under Pµ and Minkowski s inequality. Moreover, using the Burkholder Davis Gundy inequality for continuous martingales started in 0 in the form given in (Barlow and Yor, 1982, Proposition 4.2), it follows that, for some c 1 and any p 1,

Eµ |Mt(f)|p cppp/2Eµ M (f) p/2 t

= cppp/2Eµh Z t

0 σ (Xs) L 1[f](Xs) 2 ds p/2i

cppp/2λp/2 + Eµh Z t

0 L 1[f](Xs) 2 ds p/2i .

Consequently, for p 2, ﬁrst using Jensen s inequality and then Fubini together with stationarity gives

Eµ |Mt(f)|p cppp/2λp/2 + tp/2Eµh1

0 L 1[f](Xs) p ds i

= cppp/2λp/2 + tp/2 L 1[f] p Lp(µ),

and hence Eµ |Mt(f)|p 1/p cλ1/2 + p1/2

t L 1[f] Lp(µ), t 0.

In case p [1, 2), we get from (18) with another application of Jensen s inequality and Fubini

Eµ |Mt(f)|p cppp/2λp/2 + Eµh1

L 1[f](Xs) 2 ds ip/2

= cppp/2λp/2 + tp/2 L 1[f] p L2(µ).

Concentration analysis of elliptic diffusions

Thus, for any p 1, (16) follows.

In view of (16) and (17), to exploit the martingale approximation we need concrete bounds on the solution of the Poisson equation L 1[f] and its gradient L 1[f]. This is the content of the next lemma, which can essentially be obtained from combining Lemma 4 with the Sobolev estimates from Pardoux and Veretennikov (2001) and Bogachev et al. (2018). For later reference and some clariﬁcation concerning the role of the drift growth, we give a full proof that simpliﬁes some arguments from Pardoux and Veretennikov (2001) thanks to Proposition 1.

Lemma 6 Let p 1. There exist constants U(q, η), V(q, q , η) (independent of p) such that, for any f = ef µ( ef) F(η, L),

L 1[f] Lp(µ) LU(q, η)p

and L 1[f] Lp(µ) LV(q, q , η)p

1 q+ . (20)

Proof By a slight adjustment to the proof of Proposition 4.1 in Bogachev et al. (2018)1, we obtain for any r > d L 1[f](x) 1 + sup y B(x,1) |b(y)| L 1[f] Lr(B(x,1)) + f Lr(B(x,1)). (21)

Therefore, using H older s inequality and the growth condition on the drift b, L 1[f] Lp(µ) (1 + q ) L2p(µ) L 1[f] Lr(B( ,1)) L2p(µ) + f Lr(B( ,1)) Lp(µ).

If q > 1, let γ > 2(1 + q). Then we can calculate as in the proof of (Pardoux and Veretennikov, 2001, Theorem 1) to obtain

|L 1[f](x)| Z

Rd| ef(y)||pt(x, y) ρ(y)| dy dt

Rd|pt(x, y) ρ(y)| dy 1/2 Z

Rd| ef(y)|2(pt(x, y) + ρ(y)) dy 1/2 dt

Pt(x, ) µ TV 1/2 Z

Rd| ef(y)|2(pt(x, y) + ρ(y)) dy 1/2 dt

LC (γ, q)(e Vγ(x))1/2 Z

0 (1 + t) γ (1+q)

Rd e Vη(y)2(pt(x, y) + ρ(y)) dy 1/2 dt

LC(η, γ, q)(e Vγ(x))1/2(1 + x η+(1+q)/2) Z

1 t (γ (1+q))/(1+q) dt

= LC (η, γ, q)e Vη+(1+q+γ)/2(x),

1. as the authors point out, the gradient bounds derived in (Pardoux and Veretennikov, 2001, Theorem 1) are only valid in case of bounded drift

Trottner, Aeckerle-Willems and Strauch

where we used Cauchy Schwarz for the second inequality and (9) for the third inequality. The last inequality arises from (7) and (11). A similar calculation, using exponential ergodicity with arbitrary polynomial penalty e Vγ in case q = 1 with any γ > 0, shows that the above estimate remains valid for q = 1. By the strong Markov property, we have for any R > 0 and τR := τB(0,R),

L 1[f](x) = Ex L 1[f](XτR) + Exh Z τR

0 f(Xt) dt i .

By the above, u is locally bounded and thus the ﬁrst term is bounded for any R > 0. For the second term, we can employ the It o formula argument from (Pardoux and Veretennikov, 2001, Theorem 2) to improve this bound to |L 1[f]| Le Vη+1+q. Alternatively, apart from the case η = 0, q = 1, we may simply note that, by setting Ψ1 = 1 and Ψ2 = Id, (10) yields that for any δ > 0

0 f(Xt) dt i Exh Z τR(δ)

0 |f(Xt)| dt i LC(η, q, R, δ)e Vη+1+q(x), x Rd.

Here we used that |f| Le Vη and that B(0, R) is accessible by λ-irreducibility of X implied by uniform positive deﬁniteness of σσ , cf. (Stramer and Tweedie, 1997, Theorem 2.3). Thus, |L 1[f](x)| CLe Vη+1+q(x), x Rd,

follows for some constant C depending on η and q. Consequently, for any p 1, Lemma 4 yields

L 1[f] Lp(µ) CL e Vη+1+q Lp(µ) CLcη+q+1 q p

|e Vη+q+1(x+y)| 2η+q(e Vη+q+1(x)+ e Vη+q+1(y)) 2η+q(2+ e Vη+q+1(x)), x Rd, y B(0, 1),

it also follows that

L 1[f] Lr(B( ,1)) L2p(µ) CL2η+q+1 e Vη+q+1 L2p(µ) LC2η+q+1cη+q+1 q (2p)

Moreover, | ef(x + y)| L2η(2 + e V (x)) for y B(0, 1) implies

f Lr B( ,1) Lp(µ) L2η 2 + (1 + λ(B(0, 1))) e Vη Lp(µ)

L2η(1 λ(B(0, 1)))(1 + cη q+pη/(1 q+)),

such that (22) allows us to conclude that

L 1[f] Lp(µ) LV(q, q , η)p

We are now ready to infer Theorem 3 from the previous results.

Concentration analysis of elliptic diffusions

Proof [Proof of Theorem 3] The moment bounds (14) are an immediate consequence of the combined statements of Lemma 5 and Lemma 6. By Markov s inequality, (14) implies (15).

Remark 7 It would be desirable that the concentration rate provided by Theorem 3 matches the rate in Theorem 2 for the bounded case η = 0 and the rates for polynomially growing integrands for scalar, exponentially ergodic diﬀusions with at most linear drift (i.e., d = 1, q = 0, q = 1) from (Aeckerle-Willems and Strauch, 2021, Proposition 7). The reason for the gap in the rate can be traced down to the Sobolev estimates (21), where the gradient L 1[f] is bounded in terms of L 1[f]. In contrast, the strategy in Aeckerle-Willems and Strauch (2021), see also Galtchouk and Pergamenshchikov (2007), works in the other direction. That is, by exploiting the explicit solution of the Poisson equation in d = 1, tight pointwise bounds on the gradient L 1[f] are established ﬁrst, which are then used to bound the remainder term L 1[f](Xt) L 1[f](X0) = R Xt X0 L 1[f](x) dx in the martingale approximation. Such a strategy is not feasible in the multivariate setting since L 1[f] is not explicitly known. Improving our concentration result Theorem 3 would therefore require tighter estimates on the solution of the Poisson equation and its gradient than those that can be achieved with the ideas from Pardoux and Veretennikov (2001). This is a challenging and interesting question for future research.

Stationary and non-stationary PAC bounds As an immediate consequence of Theorem 2 and Theorem 3, we can derive the following quantitative version of the ergodic theorem and a stationary PAC bound for (sub-) geometric diﬀusions. Let us deﬁne the rate function

ς(η, q, q ) :=

( 1 1 q+ , if η = 0,

1 2 + η+q +q+1

1 q+ , if η > 0,

and the sample length function

c(q,ι ) 1(log(2/δ))1/(1 q+)+1

1 ε/(2L) 2 , if η = 0, e LW(log(1/δ)) 1 2 + η+q +q+1

2 , if η > 0,

with c(q, ι ) deﬁned in (12) and W denoting the constant established in Theorem 3.

Corollary 8 Let f G(η, L), ε > 0 and δ (0, 1) such that δ < 2 exp( c(1 + q+)(1 q+) (1 q+)/2) if η = 0 and δ < e 2 if η > 0. Then, for t Ψ(ε, δ), it holds

0 f(Xs) ds µ(f) ε 1 δ. (23)

Moreover, for any increasing sequence (tn)n N R+ with infn N(tn tn 1) > 0, it holds for any δ0 > 0

lim n tn(log tn) (ς(η,q,q )+δ0) 1

0 f(Xs) ds µ(f) = 0, Pµ-a.s. (24)

Trottner, Aeckerle-Willems and Strauch

If f is bounded, we even have, for any eq (q+, 1),

t(log t) 1 1 eq 1

0 f(Xs) ds µ(f) = 0, Pµ-a.s.

Proof The ﬁrst two assertions immediately follow from Theorem 2 and Theorem 3. For any ε > 0 and δ > 0, there exists t(ε) e such that, for any t t(ε), we have

L{(e W) (c(q, ι ) 1 + 1)}(2 log t) δ0 1 q+ ε.

Consequently, by Theorem 2 and Theorem 3, it follows that for

t(2 log t) (ς(η,q,q )+δ0) 1

0 f(Xs) ds µ(f)

and t t(ε) such that, in case η = 0, additionally 2 log t c(1 + q+)(1 q+) (1 q+)/2,

Pµ(|Ut| > ε) Pµ |Gt(f)| > L{(e W) (c(q, ι ) 1 + 1)}(2 log t)ς(η,q,q ) t 2.

Thus, Pµ(|Ut| > ε) 1[0,t(ε)) + t 21[t(ε), ) =: gε(t), t > 0.

Since gε L1(R+) and is decreasing, it follows for a := infn N(tn+1 tn) > 0

tn gε(t) dt X

m n (tm+1 tm)gε(tm+1) a X

m n+1 gε(tm) a X

m n+1 Pµ(|Utm| > ε).

Hence, for any ε > 0, P n N Pµ(|Utn| > ε) < such that Borel Cantelli implies that limn Utn = 0, Pµ-a.s., which gives (24). This argument is borrowed from the proof of Lemma 3.1 in Bosq (1997). By the same lemma, it follows from the above that we even have convergence along any sequence (etn)n N, Pµ-a.s., provided that the map t 7 Ut is uniformly continuous Pµ-a.s. This can be easily veriﬁed when f is bounded (see, e.g., the proof of Proposition 4.3 in Bosq (1997)), which proves the last assertion.

To get a non-stationary PAC bound, we consider the burn-in sample average

Hv,t(f) := 1

t Gt(f) θv = 1

v f(Xs) ds, t > 0, v 0,

with burn-in length v. Our naming convention follows the MCMC literature, where a standard procedure of dealing with non-stationary simulation procedures is to run the simulation algorithm for a certain amount of time before collecting samples, which is usually referred to as the burn-in.

Corollary 9 Let ε > 0, δ (0, 1) such that δ < 2e 2 if η > 0 and δ < 4 exp( c(1+q+)(1 q+) (1 q+)/2) if η = 0. Let also ν be some probability distribution such that Vq+ L1(ν).

Concentration analysis of elliptic diffusions

Choose some ι (0, ι ), and deﬁne C := C(q+)c(ι ) Vq+ L1(ν), where c(ι ) is some constant such that

t 1 : (1 + t)

2q+ 1+q+ e (ι t)(1 q+)/(1+q+) c(ι )e (ι t)(1 q+)/(1+q+).

Then, for t Ψ(ε, δ/2) and burn-in length v 1 (log(2C/δ))(1+q+)/(1 q+)/ι , we have, for any f G(η, L), Pν(|Hv,t(f) µ(f)| ε) 1 δ.

Proof Under the given assumptions, (5), (6) imply

Px(Xt ) µ TV C Vq+(x) Vq+ L1(ν) e (ι t)(1 q+)/(1+q+). (25)

Deﬁne g(y) := Py(|t 1 R t 0 f(Xs) ds µ(f)| > ε). By the Markov property, (25) and the magnitude of the burn-in v, for any x Rd,

Px(|Hv,t(f) µ(f)| > ε) Pµ 1

0 f(Xs) ds µ(f) > ε

= Ex[g(Xv)] µ(g) Px(Xv ) µ TV C Vq+(x) Vq+ L1(ν) e (ι v)(1 q+)/(1+q+)

Vq+(x) Vq+ L1(ν)

Pν(|Hv,t(f) µ(f)| > ε) Pµ 1

0 f(Xs) ds µ(f) > ε

Px(|Hv,t(f) µ(f)| > ε) Pµ 1

0 f(Xs) ds µ(f) > ε ν(dx) δ

Consequently, using t Ψ(ε, δ/2) and (23), it follows by the triangle inequality that Pν(|Hv,t(f) µ(f)| > ε) δ.

3.2 Discrete observations

We now derive concentration inequalities for discrete observations from our continuous observation results by using the approximation strategy from Galtchouk and Pergamenshchikov (2013). In Galtchouk and Pergamenshchikov (2013), only bounded functions f and scalar exponentially ergodic diﬀusions in the quite strong regime (A (q)) with q = 1 are considered, which in particular implies sub-Gaussian tails of the invariant density. We demonstrate how this can be extended to the multivariate case for unbounded functions f under less restrictive ergodicity assumptions. For this purpose, the following technical key result from Galtchouk and Pergamenshchikov (2013) is of central importance.

Trottner, Aeckerle-Willems and Strauch

Lemma 10 (Galtchouk and Pergamenshchikov, 2013, Proposition A.1) Let a ﬁltered probability space (Ω, F, (Fj)j=1,...,n, P) a random vector (Xj)j=1,...,n be given, such that for all j {1, . . . , n}, Xj is Fj-measurable and in Lp(P) for some p 2. Then, for

bj,n(p) := E h |Xj|

k=j |E[Xk|Fj]| p/2i 2/p , j = 1, . . . , n,

j=1 Xj Lp(P) 2p

j=1 bj,n(p) 1/2 .

Let = n (0, 1] be some ﬁxed sampling distance, and suppose that we have partial observations (X k)k=1,...,n of the subexponentially ergodic diﬀusion process X satisfying the coeﬃcient assumptions from Section 2. Recall that

Gn, (f) = 1

denotes the discretized version of the scaled additive functional Gn (f). Then, for ﬁxed f = ef µ( ef), we may write

Gn, (f) = Gn (f) + 1

n An, , (26)

with discretization error

(k 1) ( ef(Xk ) ef(Xt)) dt.

With our results from Section 3, it is now clear that we must analyze the concentration of An, around 0 to obtain concentration inequalities for the discrete additive functional Gn, (f). To do so for unbounded functionals ef, we exploit polynomial f-norm convergence from Proposition 1.

Theorem 11 Let η1, η2, η3 0 and ef G(η1, L) W2,p loc (Rd), p d, with ef L2d loc(Rd) such that ef(x) 1 + x η2 and, for all i, j = 1, . . . , n, | xi,xj ef(x)| 1 + x η3. Deﬁne α = α(q , η2, η3) := (q + η2) η3. In case q > 1, let eγ > 1 + q, r > 1 such that eγ (1 + q) > r(α (1 + q)/(r 1)). If q = 1, set eγ = α. Then, for f = ef µ( ef), there exists a constant D that is independent of n, p, such that, for any p 2,

Gn, (f) Lp(Pµ) D n 3/2 + p

max{(eγ+2α+1 q+)/2,η2+1 q+}

1 2 + η+q +q+1

1 q+ =: Φ(n, , p).

Consequently, Pµ(|Gn, (f)| > eΦ(n, , u)) e u, u 2.

Concentration analysis of elliptic diffusions

Proof Let us write tk = k and a b if a Cb for some constant C independent of p, n, . The It o Krylov formula gives for t [0, tk]

ef(Xk ) ef(Xt) = Z tk

t L ef(Xs) ds + Z tk

t ef(Xs) σ(Xs) d Ws

= µ(L ef)(t tk) + Φk(t) + ωk(t),

where Φk(t) := R tk t φ(Xs) ds for φ(y) = L ef(y) µ(L ef) and ωk(t) := R tk t ef(Xs) σ(Xs) d Ws. Thus, setting Xk = R tk tk 1 Φk(t) dt and χk = R tk tk 1 ωk(t) dt, we have

An, = µ(L ef)n 2

k=1 χk. (27)

The polynomial bounds on the gradient and the Hessian of ef together with supx Rd σ(x) < and b(x) 1 + x q imply that |L ef(x)| 1 + x α for α := (q + η2) η3. Suppose ﬁrst q > 1, and let eγ > 1+q and r > 1 such that eγ (1+q) > r(α (1+q)/(r 1)). This implies that (eγ (1 + q))/r > α and (eγ (1 + q))/(s(1 + q)) > 1 for s = r/(r 1). Thus, if we choose the inverse Young functions Ψ1(x) = (sx)1/s and Ψ2(x) = (rx)1/r, it follows for feγ,q(x) = x eγ (1+q) that |L ef| 1 Ψ2 feγ,q. Proposition 1 then yields

Px(Xt ) µ 1 Ψ2 feγ,q e Veγ(x)(1 + t) (eγ (1+q))/(s(1+q)), x Rd, t 0. (28)

Let F = (Ft)t 0 be the natural ﬁltration of (Xt)t 0. Then, using the Markov property and (28), we obtain for t > u

|Eµ[φ(Xt)|Fu]| = |EXu[φ(Xt u)]| = |EXu[L ef(Xt u)] µ(L ef)|

PXu(Xt u ) µ 1 Ψ2 feγ,q e Veγ(Xu)(1 + (t u)) (eγ (1+q))/(s(1+q)).

For k > j, this gives

Eµ[Xk|Ftj] e Veγ(Xtj) Z tk

t (1 + (u tj)) (eγ (1+q))/(s(1+q)) du dt

e Veγ(Xtj) 2(1 + (k 1 j) ) (eγ (1+q))/(s(1+q)).

Hence, for j < n,

k=j+1 |Eµ[Xk|Ftj]| e Veγ(Xtj) 2 Z

0 (1 + t) (eγ (1+q))/(s(1+q)) dt = e Veγ(Xtj) s(1+q) eγ (1+q)(1+s),

where we used that (eγ (1+q))/(s(1+q)) > 1. Consequently, letting bj,n(p) be the functional from Lemma 10, it follows for j < n from the Cauchy Schwarz inequality, stationarity and Lemma 4

bj,n(p) Xj 2 L2p(Pµ)

k=j+1 |Eµ[Xk|Ftj]| Lp(Pµ) Xj 2 L2p(Pµ) e Veγ(X0) Lp(Pµ)

Trottner, Aeckerle-Willems and Strauch

eγ 1 q+ Xj 2 Lp(Pµ).

In case q = 1, we simply observe that Proposition 1 implies that there exists β > 0 such that Px(Xt ) µ e Vα e Vα(x) exp( βt) and hence, proceeding as above, we end up with

bj,n(p) Xj 2 L2p(Pµ) e Vα(X0) Lp(Pµ) p

α 1 q+ Xj 2 Lp(Pµ).

Now, stationarity under Pµ, H older s inequality together with Fubini and Lemma 4 yield

Xj p Lp(µ) = Eµh Z

t φ(Xs) ds dt pi 2(p 1) Z

t Eµ |φ(Xs)|p ds dt

= 2p φ(X0) p Lp(Pµ) cp 2pppα/(1 q+),

for some constant c > 0. Thus, we obtain

bj,n(p) 3p(eγ+2α)/(1 q+),

and hence by Lemma 10

k=1 Xk Lp(Pµ)

n 3p(eγ+2α+1 q+)/(2(1 q)). (29)

Let us now treat Pn k=1 χk. As in the proof of Lemma 5, we obtain by the Burkholder Davis Gundy inequality, (Barlow and Yor, 1982, Proposition 4.2) and Lemma 4 that

Eµ[|ωk|p] cppp/2λp + p/2 ef p Lp(µ) C(η2)pλp + p/2pp/2+pη2/(1 q+).

Therefore, with H older s inequality,

Eµ[|χk|p] p 1 Z tk

tk 1 Eµ[|ωk(t)|p] dt C(η2)pλp + 3p/2pp/2+pη2/(1 q+).

Let bχ n,p be the functional from Lemma 10 with respect to (χk)k=1,...,n and (Ftk)k=1,...,n, and note that, for k > j and t [tj, tk], Eµ[ωk(t)|Ftj] = 0 since ( R t 0 ef(Xs) σ(Xs) d Ws)t 0 is an F-martingale. Thus,

bχ j,n(p) = Eµ[|χj|p]2/p 3p(2η2+(1 q+))/(1 q+).

Consequently, by Lemma 10,

k=1 χk Lp(Pµ) n 3/2p(η2+1 q+)/(1 q+). (30)

Taking into account that µ(|L ef|) < , (27), (29) and (30) imply that 1

n An Lp(Pµ) n 3/2 + p

max{(eγ+2α+1 q+)/2,η2+1 q+}

Plugging this bound into (26) and using Theorem 3, it follows that there exists some constant D that is independent of n, p, such that, for p 1, Gn, (f) Lp(Pµ) D n 3/2 + p

max{(eγ+2α+1 q+)/2,η2+1 q+}

1 2 + η+q +q+1

Markov s inequality now yields the asserted concentration inequality.

Concentration analysis of elliptic diffusions

PAC bounds Similarly to Corollary 8 and Corollary 9, we can derive PAC bounds for the discrete ergodic average and its burn-in version. The proof is identical and therefore omitted.

Corollary 12 Let η1, η2, η3 0 and f G(η1, L) W2,p loc (Rd), p d, with f L2d loc(Rd) such that f(x) 1 + x η2 and, for all i, j = 1, . . . , d, | xi,xjf(x)| 1 + x η3. Deﬁne α, eγ as in Theorem 11, and denote

ϱ = ϱ(α, η2, eγ, q) := max{(eγ + 2α + 1 q+)/2, η2 + 1 q+}

eς = eς(η1, q, q ) := 1

2 + η + q + q + 1

For ε > 0, δ (0, e 2), suppose that < ε/(3e D) and

n Ψ( , ε, δ) := 1

3e D max (log(1/δ))ϱ, (log(1/δ))eς

k=1 f(Xk ) µ(f) ε 1 δ.

Moreover, let the discrete burn-in estimator be given by

Hm,n, (f) := 1

n Gn, (f) θm = 1

k=m+1 f(Xk ).

Then, given the constants ι , C from Corollary 9 and some initial distribution ν such that

Vq+ L1(ν), for any n Ψ( , ε, δ/2) and burn-in length m 1 1(log(2C/δ))

1+q+ 1 q+ /ι , it holds Pν |Hm,n, (f) µ(f)| ε 1 δ.

4. Applications

We now demonstrate the usefulness of our probabilistic results in two concrete applications. While exponential inequalities are important for a multitude of statistical problems (e.g., in the context of adaptive nonparametric estimation or for the veriﬁcation of uniform convergence results), we will focus in Section 4.1 on the analysis of a high-dimensional diﬀusion model under sparsity constraints, which in particular necessitates the use of inequalities for unbounded functions. Speciﬁcally, we will see that Theorem 3 allows us to derive nonasymptotic error bounds for penalised estimators, which, to the best of our knowledge, are so far only available for Ornstein Uhlenbeck processes. In Section 4.2, we use our discrete concentration results from Section 3.2 to derive explicit convergence guarantees for an MCMC algorithm designed to sample from target densities with subexponential tails.

Trottner, Aeckerle-Willems and Strauch

4.1 Lasso estimation for parametrized drift coeﬃcients

As opposed to the now very well understood high-dimensional discrete models (cf., e.g., B uhlmann and van de Geer (2011) or Wainwright (2019)), for which a wealth of estimation algorithms including corresponding theoretical results are available, there are still few in-depth studies of estimation problems for high-dimensional continuous-time processes. Important references in this context are Ga ıﬀas and Matulewicz (2019) and Cio lek et al. (2020), who investigate drift estimation in a high-dimensional Ornstein Uhlenbeck (OU) model under sparsity constraints. A remarkable feature is that the restricted eigenvalue property, which usually has to be veriﬁed explicitly in discrete models such as linear regression, is already implied by the ergodicity in the speciﬁed diﬀusion model. This ﬁnding is based on the use of suﬃciently sharp probabilistic tools in the form of concentration inequalities suited to the model: while Ga ıﬀas and Matulewicz (2019) provide a proof based on functional inequalities allowing to cover only the reversible case, Cio lek et al. (2020) use Malliavin calculus methods to show that the restricted eigenvalue property is satisﬁed in the general ergodic OU case. At the same time, they point out (cf. their Remark 4.4) that other mathematical methods are needed for proving such concentration phenomena in more general diﬀusion models. Motivated by the considerations in Pokern et al. (2009), we outline in this section how our results from Section 3.1 can be used to study more general high-dimensional diﬀusion models. Suppose that the data XT = (Xt)0 t T has been generated by the following It o SDE, d Xt = b0(Xt) dt + σ0(Xt) d Wt, (31)

W = (Wt)t 0 a standard d-dimensional Brownian motion. The diﬀusion matrix σ0 is assumed to be known and we wish to estimate the drift vector b0. Suppose that both σ0 and b0 are globally Lipschitz, that σ0 is bounded, that a0 := σ0σ 0 is uniformly elliptic, i.e.,

λ , λ+ > 0 x, η Rd : λ η 2 η, a0(x)η λ+ η 2,

and that the drift condition

(L (q)) there exists M0, r > 0 such that

x > M0 : b0(x), x/ x r x q

is satisﬁed for some q [ 1, 1), such that the process falls into the ergodic framework of Section 2. Denote by µ0 and ρ0 its invariant measure and the corresponding invariant density, respectively. We further assume that XT is the stationary solution, i.e., X0 µ0, and denote P := Pµ. Denote by Pb the law of Y T , where Y T = (Yt)0 t T is the strong solution to the SDE d Yt = b(Yt) dt + σ0(Yt) d Wt, Y0 = X0. Then, the Radon Nikodym derivative of Pb with respect to P0 is given as

d Pb d P0 (XT ) = exp 1

0 b (Xt)a 1 0 (Xt)b(Xt) dt + Z T

0 b (Xt)a 1 0 (Xt) d Xt

Concentration analysis of elliptic diffusions

(see (Liptser and Shiryaev, 2001, Section 7.6.4)). Given the data XT , one can derive the time scaled negative of the log likelihood functional for the unknown drift b. This functional is given by

b (Xt)a 1 0 (Xt)b(Xt) dt 2b (Xt)a 1 0 (Xt) d Xt . (32)

The log likelihood function (32) for b is unbounded from below in general if the data is ﬁnite, T < . However, letting T , (32) tends to a functional whose unique minimizer is b0. More precisely, it is shown in Lemma 6.1 in Pokern et al. (2009) that LT (b) converges a.s. towards the functional

1 2b (x)a 1 0 (x)(b(x) b0(x)) ρ0(x) dx.

In order to regularize (32), Pokern et al. (2009) suggest to assume a parametric structure of the drift coeﬃcient. For the class of generalised OU processes fulﬁlling the linear SDE

d Xt = AXt dt + σ d Wt, t 0, (33)

A and σ some d d-matrices and W a d-dimensional Brownian motion, this assumption is obviously satisﬁed. A more general, but still treatable class of processes is obtained as follows: Given a system (ψj)1 j N of Lipschitz continuous basis functions ψj : Rd Rd, introduce

V := n bθ( ) =

j=1 θjψj( ), θ RNo .

Let ψ( ) := (ψ1( ), . . . , ψN( )) be the dictionary matrix and deﬁne for x Rd, Ψ(x) := (σ 1 0 (x)ψ(x)) σ 1 0 (x)ψ(x). Let us also deﬁne the matrices

0 Ψ(Xs) ds = (ψij,T )1 i,j N and Ψ := E[Ψ(X0)] = (ψij, )1 i,j N

with entries

ψi(Xs), a 1 0 (Xs)ψj(Xs) ds,

ψi(x), a 1 0 (x)ψj(x) ρ0(x) dx, i, j = 1, . . . , N.

We impose the following assumptions on the dictionary:

(L 1) There exist L > 0 and η [0, 1] such that the maximal eigenvalue of Ψ(x) satisﬁes

λmax(Ψ(x)) L(1 + x 2η), x Rd;

(L 2) the random matrix ΨT is positive deﬁnite P-a.s.

Trottner, Aeckerle-Willems and Strauch

Assumption (L 1) allows a maximal polynomial drift of order η of the basis functions, where η [0, 1] is consistent with their assumed Lipschitz continuity. Assumption (L 2) is a necessary technical condition on the positive semideﬁnite matrix ΨT that we need for the penalized MLE to be well-deﬁned. It can be veriﬁed given suﬃcient smoothness of the dictionary, see Example 1. Moreover, since by stationarity Ψ = E[ΨT ], (L 2) implies that Ψ is positive deﬁnite. Therefore, if we denote the minimal eigenvalue of Ψ by λmin(Ψ ) =: e , then e > 0. Let us also set D := maxi=1,...,N ψii, . We now give an example of a dictionary that satisﬁes the above assumptions and can be used to model drifts satisfying the drift condition (L (q)).

Example 1 Let

Ei = 11+T(i d2Ti/d2U)/d U,1+(i 1) mod d, i = 1, . . . , nd2,

where 1k,l is the d d matrix whose (k, l)-th entry is 1 and all other entries are 0, and Tx U = max{z Z : z < x}. Set then, for eqi [ 1, 1) and eαi > 0,

ψi(x) = Eix(eαi + x ) (eqi+1), where eqi = eqj and eαi = eαj if Ti/d2U = Tj/d2U,

which is nothing else but saying that any b V can be written as

i=1 θiψi(x) =

i=1 Ai(θ)x(αi + x ) (qi+1), x Rd,

where N = nd2,

(Ai(θ))k,l = θ(i 1)d2+(k 1)d+l, i = 1, . . . , n and k, l = 1, . . . , d,

and qi = eq1+d2Ti/d2U, αi = eα1+d2Ti/d2U. Suppose that qi < qj for i > j, the matrices Ai(θ0) corresponding to the true value θ0 are symmetric, and that there exists k0 {1, . . . , n} such that λmax(Ak0(θ0)) < 0 and, for all k0 < k n, it holds λmax(Ak(θ0)) = 0. Then, it follows from the Courant Fischer theorem that, for any x = 0,

bθ0(x), x/ x =

i=1 x (αi + x ) (1+qi) x/ x , Ai(θ0)x/ x

i=1 x (αi + x ) (1+qi)λmax(Ai(θ0)).

This implies that there exists M0, c > 0 such that, for r = cλmax(Ak0(θ0)) > 0, the drift condition (L (q)) is satisﬁed for q = qk0. Let µ be the invariant distribution of the associated diﬀusion X. Also note that (L 1) holds for η = ( q1)+. To see that (L 2) is satisﬁed, note ﬁrst that, for any θ = 0, there exists some j {1, . . . , d} such that x 7 (bθ(x))j is analytic and not identical to zero on Rd \ {0}. Consequently, (bθ( )j) 1({0}) \ {0} = {x = 0 : (ψ(x)θ)j = 0} is contained in a countable union of smooth manifolds of dimension d 1, i.e., in a countable union of smooth hypersurfaces. Assume now that ΨT is not positive deﬁnite a.s. Since θ ΨT θ = R T 0 (σ 1 0 ψ)(Xs)θ 2 ds

Concentration analysis of elliptic diffusions

and, moreover, the matrix is positive semideﬁnite, the paths of X are continuous and Pµ(X0 = 0) = 0, this implies that there exists a measurable set Ω0 {X0 = 0} with Pµ(Ω0) > 0 such that, for any ω Ω0, the whole path (Xs(ω))s [0,T] is contained in (bθ( )j) 1({0}) for some θ = 0 and j {1, . . . , d}. It follows from above that on Ω0, the process stays in some smooth hypersurface for a strictly positive amount of time. Such path behaviour is however impossible a.s. for an elliptic diﬀusion process. Thus, ΨT must be positive deﬁnite P-a.s.

Note that the above example includes the OU models investigated in Ga ıﬀas and Matulewicz (2019); Cio lek et al. (2020) as a special case. Under Pb0, the above parametrisation yields the functional

σ 1 0 bθ (Xt) d Wt Z T

σ 1 0 (bθ bθ0)(Xt) 2 dt + Z T

σ 1 0 bθ0 (Xt) 2 dt

2θ ΨT θ θ h,

h denoting the vector with components

ψi(Xs), a 1 0 (Xs) d Xs , i = 1, . . . , N.

Using almost sure positive deﬁniteness of ΨT , it follows that on a set of full P-measure, the MLE is the unique minimizer of LT ( ), given by

bθMLE := Ψ 1 T h.

While this approach yields a well-deﬁned estimator, the MLE will perform quite inaccurately in high-dimensional settings. Our concern is to investigate the estimation of bθ in the large N/large T regime. More precisely, we want to study the statistical properties of penalized estimators bθT , deﬁned as

bθT = arg minθ RN {LT (θ) + λ θ 1}, (34)

λ > 0 some regularisation parameter. Strictly speaking, since positive deﬁniteness of ΨT holds only a.s., this estimator may only be well deﬁned in an almost sure sense, but by an appropriate restriction of the underlying probability space we can and will assume that it is well-deﬁned everywhere without loss of generality. Denote

θ1 θ2 2 L2 := 1

σ 1 0 (bθ1 bθ2)(Xt) 2 dt = (θ1 θ2) ΨT (θ1 θ2), θ1, θ2 RN.

Then, for any θ RN,

bθT θ0 2 L2 θ θ0 2 L2 + 2

σ 1 0 bbθT bθ (Xt) d Wt + 2λ θ 1 bθT 1

Trottner, Aeckerle-Willems and Strauch

In order to obtain error bounds for the Lasso estimator bθT , the martingale part appearing on the rhs of (35) needs to be controlled which is usually done by means of Bernstein s inequality for continuous martingales. Another important part of the derivation of error bounds is the veriﬁcation of the restricted eigenvalue condition which in our setting amounts in showing that

inf θ S1(s),η S2(s,θ) θ η 2 L2 θ η 2 is bounded away from 0 with high probability,

where, for θ 0 := P i 1{θi =0}, ﬁxed c0 > 0 and Is(θ) denoting a set of coordinates of s largest elements of θ,

C(s, c0) := n ζ RN : ζ 1 (1 + c0) ζ|Is(ζ) 1

S1(s) := θ RN : θ 0 = s and S2(s, θ) := η RN : θ η C(s, c0) .

To start with, we will demonstrate how our previous general developments can be used to verify these assumptions. In fact, our error bounds for the Lasso estimator formulated below are based on the following direct application of Theorem 3.

Lemma 13 There exists a constant W > 0 such that, for any vectors ζ RN with ζ 1 and R 2/

Ψ ΨT ζ > R exp

κ(q,η) , where κ(q, η) := 2(1 q+) 6η + 2q + 3 q+ .

Proof Observe ﬁrst that it suﬃces to prove the lemma for ζ = 1. Fix any such ζ and set efζ(x) = ζ Ψ(x)ζ and fζ = efζ µ0( efζ). By assumption (L 1), we have for any x Rd

| efζ(x)| = σ 1 0 (x)ψ(x)ζ 2 σ 1 0 (x)ψ(x) 2 = λmax(Ψ(x)) L(1 + x 2η).

Moreover, using σ0(x) = σ0(x) ,

max i=1,...,N ψi(x) p

λ+ max i=1,...,n σ 1 0 (x)ψi(x) = p

λ+ max i=1,...,N σ 1 0 (x)ψ(x)ei

λ+ σ 1 0 (x)ψ(x) p

Lλ+(1 + x η),

such that bθ0(x) p

Lλ+ θ0 1(1 + x η) follows. Consequently, Theorem 3 implies that there exists some constant W independent of ζ such that

Ψ ΨT ζ > R = P T 1/2|GT (fζ)| > R exp

We are now ready to verify the restricted eigenvalue property and state deviation bounds for the martingale term.

Concentration analysis of elliptic diffusions

Proposition 14 (a) For any ε0 (0, 1) and T T0(ε0, s, c0, LW), it holds

P inf θ S1(s),η S2(s,θ) θ η 2 L2 θ η 2 e

T0(ε0, s, c0, c) := n log 212s d ed

2s 2s log ε0 o 2 κ(q,η) 182(c0 + 2)2e2c2

(b) For s, c0 > 0, deﬁne the event

E (s, c0) := inf θ η C(s,c0) θ η 2 L2 θ η 2 e

max i=1,...,N ψii,T D + e

sup θ =η RN

1 T R T 0 σ 1 0 bθ bη (Xt) d Wt θ η 1 λ

Then, for any ε0 (0, 1), T T0(ε0

3 , s, c0, LW) and

it holds P(E (s, c0)) 1 ε0.

Proof Introduce K(s) := ζ RN \ {0} : ζ 0 s . Using Lemmata F.1 and F.3 of Basu and Michailidis (2015), it follows

sup ζ C(s,c0)

Ψ ΨT ζ ζ 2 3(c0 + 2) sup ζ K(2s)

Furthermore, for any subset E RN,

inf ζ E ζ 2 L2 ζ 2 = inf ζ E: ζ 1 ζ ΨT ζ

and for ζ = 0,

ζ 2 L2 ζ 2 = ζ Ψ ζ

Ψ ΨT ζ ζ 2 λmin(Ψ ) ζ

Ψ ΨT ζ ζ 2 .

The proof of Lemma F.2 in Basu and Michailidis (2015) allows to deduce from (36) that, for any R 2/

sup ζ K(s), ζ 1

Ψ ΨT ζ > 3R

Trottner, Aeckerle-Willems and Strauch

P inf ζ C(s,c0) ζ 2 L2 ζ 2 > e

sup ζ C(s,c0), ζ 1

sup ζ K(2s), ζ 1

Ψ ΨT ζ e 6(c0 + 2)

Te 18(c0 + 2)e LW

resulting in the asserted condition on the sample size T. For proving part (b), note ﬁrst that the relation max i=1,...,N

ψii,T ψii, > e

sup ζ C(s,c0)

in particular implies that, for T T0(ε0

3 , s, c0, LW),

P max i=1,...,N ψii,T > D + e

It remains to control the deviation of the martingale term. Given θ, η RN, we write

σ 1 0 (bθ bη) (Xt) d Wt = 2(θ η) (ε1,T , . . . , εN,T ) ,

σ 1 0 ψi (Xs) d Ws, i = 1, . . . , N. (38)

Note that the quadratic variation of this continuous martingale is given by

σ 1 0 ψi (Xs) σ 1 0 ψi (Xs) ds

ψi, a 1 0 ψi (Xs) ds = 1

Using Bernstein s inequality for continuous martingales and taking into account the condition on λ, we thus obtain for for T T0(ε0

3 , s, c0, LW)

sup θ =η RN

2 T R T 0 σ 1 0 (bθ bη) (Xt) d Wt θ η 1 > λ

2(θ η) (ε1,T , . . . , εN,T )

θ η 1 > λ, max i=1,...,N ψii,T < D + e

+ P max i=1,...,N ψii,T > D + e

Concentration analysis of elliptic diffusions

i=1 P 2|εi,T | > λ, 2εi T 4

On the basis of the given key deviation inequalities, the machinery of high-dimensional statistics now allows the derivation of oracle inequalities. Our proof strategy follows the one developed in Cio lek et al. (2020).

Theorem 15 (oracle inequality) Assume that we are given a continuous record of observations of the solution of (31), where b0 = bθ0 V with θ0 0 s0. Fix γ > 0 and ε0 (0, 1), and consider the Lasso estimator bθT introduced in (34). Then, for

and T T0 ε0

3 , s0, 3 + 4

γ , LW , (39)

with probability at least 1 ε0, we have

bθT θ0 2 L2 (1 + γ) inf θ RN: θ 0 s0

θ θ0 2 L2 + 9(2 + γ)2

2γ(1 + γ)e θ 0λ2 . (40)

Furthermore, for any λ fulﬁlling (39) and T T0 ε0

3 , s0, 3, LW , with probability at least 1 ε0, bθT θ0 2 L2 λ2 18s0

By specifying λ as proposed in (39), the previous result implies that an upper bound of order (s0 log N)/T on the squared L2 risk of the Lasso estimator bθT holds with high probability. In particular, in terms of the rate of convergence, our techniques give the same results as the concentration inequalities tailored to the speciﬁc OU model used in Ga ıﬀas and Matulewicz (2019) and Cio lek et al. (2020), respectively. Proof [Proof of Theorem 15] Recall the deﬁnition of the event E (s, c0) according to (37), and let s s0. On E (s, c0), the basic inequality (35) implies for any θ RN that

bθT θ0 2 L2 + λ bθT θ 1 θ θ0 2 L2 + 2λ θ 1 bθT 1 + bθT θ 1

θ θ0 2 L2 + 4λ bθT |supp(θ) θ 1. (42)

Assume now that θ RN fulﬁlls θ 0 s0, and consider the event

4λ bθT |supp(θ) θ 1 > γ θ θ0 2 L2. (43)

If this does not occur, (40) holds true since we obtain from (42) that bθT θ0 2 L2 (1 + γ) θ θ0 2 L2.

Trottner, Aeckerle-Willems and Strauch

Otherwise, if (43) holds, we have on E (s, c0) that

λ bθT θ 1 θ θ0 2 L2 + 4λ bθT |supp(θ) θ 1

bθT |supp(θ) θ 1 + 4λ bθT |supp(θ) θ 1

i.e., bθT θ C(s, c0) for the choice c0 = 3 + 4

γ , such that, after using Cauchy Schwarz,

bθT |supp(θ) θ 1 bθT θ p

Summing up,

bθT θ0 2 L2 θ θ0 2 L2 + 3λ

bθT θ0 L2 + θ θ0 L2 .

Applying the Young inequalities

bθT θ0 L2 ax

2 + bθT θ0 2 L2 1 2ax, θ θ0 L2 ax

2 + θ θ0 2 L2 1 2ax,

with a = (2 + γ)/(2γ) and x = 3λ p

2 θ 0/e , we ﬁnally obtain

bθT θ0 2 L2 (1 + γ) θ θ0 2 L2 + 9(2 + γ)2

2γ(1 + γ)e λ2 θ 0

For the proof of (41), note that, taking θ = θ0, (42) implies that, on E (s0, c0),

bθT θ0 2 L2 + λ bθT θ0 1 4λ bθT |supp(θ0) θ0 1.

Now, since bθT θ0 C(s0, 3) on E (s0, c0),

bθT θ0 2 L2 3λ bθT |supp(θ0) θ0 1 3λ r

which already gives the asserted inequality.

Remark 16 While our results are non-asymptotic, we do face a restriction in that the constant W appearing in the lower bound for the required sample size (see (39)) is not explicit. However, it appears to be very demanding to work out explicit constants in a general framework. In the spirit of Pokern et al. (2009), our arguments could also be carried out for the more restricted class of reversible diﬀusion processes by assuming a parametric form of the potential function and then considering a parametrized drift function bθ(x) = 1

2 div(a0(x)) 1

2a0(x) Vθ(x) for Vθ V. Although functional inequalities (e.g., of Poincar e-type) are applicable in this reversible framework, the control of the constants involved still constitutes a fundamental challenge.

Concentration analysis of elliptic diffusions

We conclude this section by brieﬂy categorising the results and sketching potential future research. Note ﬁrst that Theorem 2.7 in Dexheimer and Strauch (to appear) provides a lower bound on the Frobenius norm for the estimation of the matrix A in the d-dimensional OU model (33) with σ = Id over the class of s0-sparse matrices. Translating the number of parameters into our framework, the lower bound is of order s0 log(N/s0)/T. Compared to the upper bound of order (s0 log N)/T, there is thus only a logarithmic gap, appearing in this very form also in Ga ıﬀas and Matulewicz (2019) and Cio lek et al. (2020). As demonstrated in Dexheimer and Strauch (to appear) in the context of drift estimation for L evy-driven OU processes, the key to eliminating the logarithmic gap lies in a reﬁned deviation inequality for the stochastic error (in our context speciﬁed as εi,T as deﬁned in (38)). In fact, the combination of concentration inequalities in the sense of Lemma 13 (which is a rather straightforward consequence of our general Theorem 3) with the techniques from Section 3.2 of Dexheimer and Strauch (to appear) can be expected to allow the derivation of minimax optimal penalized estimators also for general diﬀusion models.

4.2 MCMC for moderately heavy-tailed targets

In general, Markov chain Monte Carlo (MCMC) is a collective term for algorithms relying on ergodicity of Markov chains that are (i) easy to simulate and (ii) speciﬁcally designed such that their invariant distribution approximates a given target density, for which samples are to be obtained. These algorithms have a long and rich history. At this point, we cannot give a detailed account of the literature which would do justice to the ﬁeld, but only want to point out its fundamental importance in connected areas such as Bayesian optimization or inverse problems in high dimensional contexts, where the posterior distribution becomes the target. Other than the fundamental theoretical work in Dalalyan (2017); Durmus and Moulines (2017) that will be discussed below, we refer to Dalalyan and Karagulyan (2019); Durmus et al. (2019); Durmus and Moulines (2019); Erdogdu et al. (2018); Erdogdu and Hosseinzadeh (2021); Teh et al. (2016); Vollmer et al. (2016) for some recent contributions that motivated our study. Our particular interest lies on the so called Unadjusted Langevin Algorithm (ULA), which we describe next. Suppose that we are given a target density π exp( U(x)) for some continuously diﬀerentiable function U : Rd R, which is usually referred to as the potential. Let us also assume that U is L-Lipschitz continuous such that the (unadjusted or overdamped) Langevin diﬀusion d Xt = U(Xt) dt +

2 d Wt, t 0,

has a unique strong solution, which is a Feller Markov process with invariant distribution

π(dx) = 1 R

Rd exp( U(y)) dy exp( U(x)) dx, x Rd.

To obtain samples with approximate distribution π and to approximate integrals π(f) for π-integrable functions f via the corresponding Monte Carlo estimator, in practice one needs to discretize the SDE to make simulation procedures feasible. The ULA uses a simple Euler discretization scheme as numerical SDE approximation, where the Euler discretization with step size is the Markov chain given by the stochastic diﬀerence equation

ϑ( ) n+1 = ϑ( ) n U(ϑ( ) n ) +

2 ξn+1, n N0, ϑ( ) 0 d= X0,

Trottner, Aeckerle-Willems and Strauch

where (ξn)n N is a sequence of i.i.d. standard normal random variables on Rd independent of ϑ( ) 0 . Sampling such a chain is computationally eﬃcient, provided the gradient U can be cheaply evaluated. By considering the time-inhomogeneous Markov process given as the strong solution to the SDE

d Z( ) t = b(Z( ), t) dt +

2 d Wt, t 0, Z( ) 0 = X0,

with non-anticipatory drift

k=0 U(zk )1[k ,(k+1) )(t), (z, t) C(R+, Rd) R+,

it is straightforward to show that the laws of (ϑ( ) n )n N0 and (Z( ) n )n N0 coincide. It has been observed in the literature Dalalyan (2017); Durmus and Moulines (2017) that for potentials U that are either strongly convex i.e., π is strongly log-concave or that are convex and superexponential outside some ball, explicit requirements on the step length and sample size n can be formulated to guarantee sampling with ε-precision in total variation or Wasserstein distance. For strongly log-concave densities, the natural connection to the gradient descent in a convex setting is pointed out in Dalalyan (2017). We now apply our previous results to obtain PAC bounds and related suggestions for sample size n, burn-in m and discretization step for the ULA Monte Carlo estimator of polynomially growing functions for moderately heavy-tailed target densities π such that

ι > 0, q (0, 1) such that Z

Rd exp(ι x 1 q) π(dx) < .

As follows from (7), this is the case if U satisﬁes (A (q)) with q (0, 1), i.e., there exists some M0, r > 0 such that

(U (q)) U(x), x/ x r x q, x M0.

This setting diﬀers substantially from the (strongly) convex setting in Dalalyan (2017); Durmus and Moulines (2017), whose assumptions imply (U (q)) with q [ 1, 0) and therefore, in particular, require the targets to have exponential moments, i.e., light tails. Heavy-tailed target densities implied by our assumption q (0, 1) become relevant, e.g., in Bayesian inverse problems with heavy-tailed noise or prior. As the following result demonstrates, the Euler discretized Markov chain ϑ( ) under (U (q)) inherits the subexponential ergodic behaviour from the original Langevin diﬀusion X, provided that U does not grow too fast. The proof is a straightforward application of the results from Douc et al. (2004) which is the discrete-time counterpart to Douc et al. (2009) and is postponed to Appendix A. Let (Px)x Rd be a family of probability measures such that ϑ( ) is started in x under Px.

Proposition 17 Let q (0, 1). Suppose that U satisﬁes (U (q)) and, moreover, for some M1 > 0, U(x) x β for β (1 q)/2. Then, for any > 0 in case β < (1 q)/2 or any r in case β = (1 q)/2, there exists an invariant probability measure π( ) for the chain ϑ( ) and there are constants c = c(q, ) > 0 and ec = ec(q, ) such that, for fq(x) (1 + x ) 2q exp(ec x 1 q) and rq(n) n 2q/(1+q) exp(cn(1 q)/(1+q)), we have for any x Rd and pairs of inverse Young functions (Ψ1, Ψ2) I

lim n Ψ1(rq(n)) Px(ϑ( ) n ) π( ) Ψ2 fq = 0.

Concentration analysis of elliptic diffusions

This convergence behaviour is in line and in fact states more precisely the ﬁndings from (Roberts and Tweedie, 1996, Section 3) for the ULA in d = 1 for the model class of sub Weibull distributions. It is vital to note that π and π( ) do not coincide, so even if the ULA converges at subgeometric rate for ﬁxed step size , we need to choose appropriately small to obtain useful approximations. We make this tuning parameter choice precise in the following. Typical potentials satisfying (U (q)) such as U(x) x 1 q outside some ball centered around 0 are not convex, and their gradient may converge at inﬁnity. In fact, if we have lim x U(x) = 0, then (Roberts and Tweedie, 1996, Theorem 2.4) implies that the Langevin diﬀusion X is not exponentially ergodic. Hence, we cannot expect π to have exponentially decaying tails. Therefore, in contrast to the usually encountered potentials exhibiting some degree of convexity, it is quite natural for our purposes to assume that U is bounded under (U (q)) for q (0, 1). This makes it easy to prove the following result quantifying convergence of ULA to the target π and the performance of the ULA Monte Carlo estimator with burn-in m,

Hϑ( ) m,n, (f) := 1

k=m+1 f ϑ( ) k ,

based on our results from Section 3.2 and the Girsanov argument underlying the total variation convergence result from Dalalyan (2017) for strongly convex potentials. Denote by Px,n X and Px,n Z( ) the laws of (Xt)t [0, n] and (Z( ) t )t [0,n ], respectively, under Px.

Proposition 18 Suppose that U C1(Rd) has an L-Lipschitz continuous and bounded gradient that satisﬁes (U (q)) for some q (0, 1).

(i) For any (0, 1] and initial distribution ν such that Vq L1(ν), it holds for any n N, Pν ϑ( ) n π TV Cν(Vq) exp (ι n )(1 q)/(1+q)

(1 + U( ) 2 )d L2n 2

for some constant C > 0 and ι (0, ι(1+q)/(1 q)(1 + q)(r ι(1 q))) for some ι < r/(1 q).

(ii) Let η1, η2, η3 0, f G(η1, L) W2,p loc (Rd), p d, with f L2d loc(Rd) such that f(x) 1 + x η2 and for all i, j = 1, . . . , d, | xi,xjf(x)| 1 + x η3. Let also C, ι , α, eγ, eς = eς(η1, q, 0), ϱ = ϱ(α, η2, eγ, q) be the constants from Corollary 12, adapted to the speciﬁc parameters of the Langevin diﬀusion. Then, for satisfying both < min{1, ε/(3e D), (log(1/δ))eς ϱ} and

2(1 + U( ) )d L2 (log(4/δ))2eς + ε2 2 + (log(4C/δ))(1+q)/(1 q)/ι ,

sample size n = n( , ε, δ) = Ψ( , ε, δ/4)

Trottner, Aeckerle-Willems and Strauch

and burn-in m = m( , ε, δ) = 1(log(4C/δ))(1+q)/(1 q)/ι ,

it holds for any initial distribution ν such that Vq L1(ν) that

Pν Hϑ( ) m,n, (f) π(f) ε 1 δ.

Proof As in the proof of Lemma 2 in Dalalyan (2017), see also Dalalyan and Tsybakov (2012), using L-Lipschitz continuity of U and Girsanov s theorem, it follows that the Kullback Leibler divergence of Px,n X wrt Px,n Z( ) fulﬁlls

KL Px,n X Px,n Z( ) L2 3

k=0 Ex U(Z( ) k ) 2 + d L2n 2

Thus, using U( )

d U( ) and Pinsker s inequality, it follows

Px(Xn ) Px ϑ( ) n TV Px,n X Px,n Z( ) TV

(1 + U( ) 2 )d L2n 2

(45) By triangle inequality, subexponential convergence in (5) with the parameters adapted to the Langevin diﬀusion and (45), we immediately obtain (44). Moreover, for given as in part (ii), the choice n = n( , ε, δ) and m = m( , ε, δ), if we deﬁne

gε((xt)t [0,(n+m) ]) = 1(ε, ) 1

k=m+1 (f(xk ) π(f)) ,

for a path (xt)t [0,(n+m) ] C([0, (n + m) ], Rd), it follows from (45) that

|Pν(|Hm,n, (f) π(f)| > ε) Pν(|Hϑ( ) n,m, (f) π(f)| > ε)|

= Eν gε (Xt)t [0,(n+m) ] Eν gε (Z( ) t )t [0,(n+m) ]

Px,(n+m) X Px,(n+m) Z( ) TV ν(dx)

(1 + U( ) 2 )d L2(n + m) 2

(1 + U( ) 2 )d L2 2

2 + (log(4/δ))2eς

ε2 + (log(4C/δ))(1+q)/(1 q)/ι 1/2

Statement (ii) now follows from Corollary 12 and triangle inequality.

The above result gives lower bounds on the required step length, sample size and burn-in for an ε-precise integral approximation of π(f) with probability at least 1 δ for polynomially bounded f with polynomially bounded weak derivative and Hessian. These are summarized in Table 1. An obvious application of this result are explicit ﬁnite sample guarantees for MCMC moment approximations of the target π.

Concentration analysis of elliptic diffusions

step length sample size n burn-in m

ε-prec. sampling ε2 d(log(C/ε))(1+q)/(1 q))

d(log(C/ε))2(1+q)/(1 q)

(ε, δ)-PAC bound (δε)2

d(log(1/δ))2(η1+(q+3)/2)/(1 q)

d D2(log(1/δ))(4(η1+(q+3)/2))/(1 q)

d(log(1/δ))2(η1+q+2)/(1 q)

Table 1: Order of suﬃcient sampling frequency , sample size n and burn-in m for (ε, δ)- PAC bounds and sampling within ε-TV margin

Remark 19 It should be noted that the exact dimensional dependence of D is not clear, which, similarly to the previous section, is an eﬀect of unspeciﬁed constants in the ergodicity and Sobolev bounds used for the derivation of the concentration inequalities. Overcoming this issue is highly non-trivial and subject of ongoing research eﬀorts. In contrast, the convex, respectively strongly convex, settings in Durmus and Moulines (2017); Dalalyan (2017) give rise to Poincar e, respectively log-Sobolev, inequalities with explicit constants such that the investigated required number of iterations for sampling within an ε-margin in total variation can be made explicit in terms of the dimension in these papers. According to the above, the simulation grid should be signiﬁcantly ﬁner and the sample size signiﬁcantly larger to obtain exact PAC-guarantees compared to the case when one would just be interested in sampling with ε-precision in total variation. Here, the dependence on the level ε is a natural correspondence to the sample sizes (and hence necessary number of gradient evaluations) found in Dalalyan (2017); Durmus and Moulines (2017).

Our results yield explicit and useful guarantees for a sampling scenario that is quite diﬀerent from what is usually encountered in the theoretical MCMC literature. Still, we expect that the dependence of ( , n, m) on δ for the PAC bounds can be improved in the sense that the δ2-dependency is likely too strict. Its occurrence is explained by our strategy to control the total variation distance between the law of the Langevin diﬀusion X and its numerical approximation Z( ) in terms of their KL-divergence using Pinsker s inequality. This leads to a suboptimal bound on the total variation distance, causing the additional dependence on δ2. We are not aware of any other approaches in the MCMC literature to control this loss on the path level. This issue can be possibly circumvented by deriving concentration inequalities for Hϑ( ) m,n, (f) around its mean directly and lift these to

concentration inequalities of Hϑ( ) m,n, (f) around the target π(f) by establishing appropriate bias estimates. This is the strategy pursued in (Durmus and Moulines, 2015, Proposition 18) an earlier preprint version of Durmus and Moulines (2017) where the authors infer a suﬃcient sample size n d log(1/δ)/ε4 and sampling frequency ε2/d for the ULA MC estimator of the integral π(f) for a strictly log-concave density (in particular, q = 1) and bounded f. Since we focus on applications that can be treated with our theoretical results from Section 3.2, we do not push further the issue of improving our bounds in the setting of a heavy-tailed target π and unbounded integrands f. Instead, we leave it open as an interesting question for future research.

Trottner, Aeckerle-Willems and Strauch

Acknowledgments

CS and LT gratefully acknowledge ﬁnancial support of Carlsberg Foundation Young Researcher Fellowship grant CF20-0640 Exploring the potential of nonparametric modelling of complex systems via SPDEs .

Appendix A. Remaining proofs

Proof [Proof of Proposition 1] By (Douc et al., 2009, Proposition 1), every compact set is petite and any skeleton chain is irreducible. Moreover, if we let V C2(Rd) such that V = x γ for x M0 and V 1, and we can show that LV is locally bounded and

LV (x) φ V (x)(1 + o(1)), x M0, (46)

for φ(x) = rγx(γ (1+q))/γ which is increasing, diﬀerentiable and concave on (0, ), it will follow from (Douc et al., 2009, Theorem 3.4) that, for any ε (0, 1), the condition D(Cε, V, φε, aε) is satisﬁed for φε = (1 ε)φ, Cε = B(0, Mε) for Mε M0 large enough and aε = sup x Mε|LV (x) + φε V (x)|. This then implies the result using Theorem 3.2 and Proposition 4.6 from Douc et al. (2009). (Note that in the notation of Douc et al. (2009), f = φε V fγ,q, H 1 φε (t) = (1 + (1 + q)(1 ε)t/γ)γ/(1+q) for q ( 1, 1) and H 1 φε (t) = exp( rγ(1 ε)t) for q = 1, hence r (t) = φε H 1 φε (t) rγ,q(t).) Since b, σ are locally bounded and L is a local operator, it is immediate that LV is locally bounded as well. Further, for x M0, (A (q)) implies

b(x), V (x) = γ x γ 1 b(x), x/ x rγ x γ 1 q = φ V (x),

and the assumptions on the diﬀusion matrix yield

|tr a(x)D2V (x) | =

i,j=1 ai,j(x) 1{i=j}γ x γ 2 + γ(γ 2)xixj x γ 4

(Λγd + γ|γ 2|λ+) x γ 2 = o(φ V (x)).

This gives (46) and therefore the result.

Proof [Proof of Proposition 17] Let P ( )(x, B) = Px(ϑ( ) n B), (x, B) Rd B(Rd), be the transition kernel of the Markov chain ϑ( ). Since P ( )(x, ) = N(x h U(x), 2h Id), it follows from classical Meyn Tweedie arguments (cf. (Hansen, 2003, Theorem 3.1) for the precise statement) that P ( ) is an aperiodic and λ-irreducible Markov kernel and that all compact sets are small and hence petite. Let Φ(x) = x U(x) such that we may write ϑ( ) n+1 = Φ(ϑ( ) n ) +

2 ξn+1. By our assumptions on the gradient U, we can choose M M0 M1 1 large enough such that, for x M, we have

Φ(x) 2 = x 2 2 x, U(x) + 2 U(x) 2

x 2 2 r x 1 q + 2 x 2β

x 2 1 r x (1+q)

Concentration analysis of elliptic diffusions

2 x (1+q) 2.

Hence, Assumption 3.4 from Douc et al. (2004) is fulﬁlled with R0 = M, ρ = 1+q, r = r /2. Moreover, since the noise (ξn)n N is i.i.d. Gaussian, Assumption 3.3 from Douc et al. (2004) is satisﬁed for any z0 > 0 and γ0 = 1. It thus follows from (Douc et al., 2004, Theorem 3.3) that their central drift condition D(φ, V, C) holds for φ(v) = cv(1 + log v) 2q/(1 q),

V (x) = ez x 1 q and the compact set C = B(0, f M), for some c, z > 0 and f M M. Consequently, for Hφ(t) = R t 1 1/φ(v) dv, we have rq φ H 1 φ and fq φ V for appropriate choices of the constants c(q, ), ec(q, ). Since P ( ) is irreducible and aperiodic and C is petite, we can now apply (Douc et al., 2004, Theorem 2.8) to prove the claim.

C. Aeckerle-Willems and C. Strauch. Concentration of scalar ergodic diﬀusions and some statistical implications. Ann. Inst. Henri Poincar e Probab. Stat., 57(4):1857 1887, 2021.

C. Aeckerle-Willems and C. Strauch. Sup-norm adaptive drift estimation for multivariate nonreversible diﬀusions. Ann. Statist., 50(6):3484 3509, 2022.

M. T. Barlow and M. Yor. Semimartingale inequalities via the Garsia-Rodemich-Rumsey lemma, and applications to local times. J. Functional Analysis, 49(2):198 229, 1982.

S. Basu and G. Michailidis. Regularized estimation in sparse high-dimensional time series models. Ann. Statist., 43(4):1535 1567, 2015.

V. I. Bogachev, M. R ockner, and S. V. Shaposhnikov. The Poisson equation and estimates for distances between stationary distributions of diﬀusions. J. Math. Sci. (N.Y.), 232(3, Problems in mathematical analysis. No. 92 (Russian)):254 282, 2018.

D. Bosq. Parametric rates of nonparametric estimators and predictors for continuous time processes. Ann. Statist., 25(3):982 1000, 1997.

P. B uhlmann and S. van de Geer. Statistics for high-dimensional data. Springer Series in Statistics. Springer, Heidelberg, 2011. Methods, theory and applications.

P. Cattiaux and A. Guillin. Deviation bounds for additive functionals of Markov processes. ESAIM Probab. Stat., 12:12 29, 2008.

G. Cio lek, D. Marushkevych, and M. Podolskij. On Dantzig and Lasso estimators of the drift in a high dimensional Ornstein-Uhlenbeck model. Electron. J. Stat., 14(2):4395 4420, 2020.

A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and logconcave densities. J. R. Stat. Soc. Ser. B. Stat. Methodol., 79(3):651 676, 2017.

A. S. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Process. Appl., 129(12):5278 5311, 2019.

Trottner, Aeckerle-Willems and Strauch

A. S. Dalalyan and A. B. Tsybakov. Sparse regression learning by aggregation and Langevin Monte-Carlo. J. Comput. System Sci., 78(5):1423 1443, 2012.

L. Devroye, L. Gy orﬁ, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31 of Applications of Mathematics (New York). Springer-Verlag, New York, 1996.

N. Dexheimer and C. Strauch. On Lasso and Slope drift estimators for L evy-driven Ornstein Uhlenbeck processes. Bernoulli, to appear.

R. Douc, G. Fort, E. Moulines, and P. Soulier. Practical drift conditions for subgeometric rates of convergence. Ann. Appl. Probab., 14(3):1353 1377, 2004.

R. Douc, A. Guillin, and E. Moulines. Bounds on regeneration times and limit theorems for subgeometric Markov chains. Ann. Inst. Henri Poincar e Probab. Stat., 44(2):239 257, 2008.

R. Douc, G. Fort, and A. Guillin. Subgeometric rates of convergence of f-ergodic strong Markov processes. Stochastic Process. Appl., 119(3):897 923, 2009.

A. Durmus and E. Moulines. Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm. ar Xiv:1507.05021v1, 2015.

A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab., 27(3):1551 1587, 2017.

A. Durmus and E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854 2882, 2019.

A. Durmus, S. Majewski, and B. a. Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. J. Mach. Learn. Res., 20:Paper No. 73, 46, 2019.

M. A. Erdogdu and R. Hosseinzadeh. On the convergence of langevin monte carlo: The interplay between tail growth and smoothness. In M. Belkin and S. Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 1776 1822. PMLR, 15 19 Aug 2021.

M. A. Erdogdu, L. Mackey, and O. Shamir. Global non-convex optimization with discretized diﬀusions. Advances in Neural Information Processing Systems, 31, 2018.

J. Fan, B. Jiang, and Q. Sun. Hoeﬀding s inequality for general Markov chains and its applications to statistical learning. J. Mach. Learn. Res., 22:Paper No. 139, 35, 2021.

G. Fort and G. O. Roberts. Subgeometric ergodicity of strong Markov processes. Ann. Appl. Probab., 15(2):1565 1589, 2005.

S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing. Applied and Numerical Harmonic Analysis. Birkh auser/Springer, New York, 2013.

S. Ga ıﬀas and G. Matulewicz. Sparse inference of the drift of a high-dimensional Ornstein Uhlenbeck process. J. Multivariate Anal., 169:1 20, 2019.

Concentration analysis of elliptic diffusions

L. Galtchouk and S. Pergamenshchikov. Uniform concentration inequality for ergodic diffusion processes. Stochastic Process. Appl., 117(7):830 839, 2007.

L. Galtchouk and S. Pergamenshchikov. Uniform concentration inequality for ergodic diffusion processes observed at discrete times. Stochastic Process. Appl., 123(1):91 109, 2013.

F. Gao, A. Guillin, and L. Wu. Bernstein-type concentration inequalities for symmetric Markov processes. Theory Probab. Appl., 58(3):358 382, 2014.

N. R. Hansen. Geometric ergodicity of discrete-time approximations to multivariate diﬀusions. Bernoulli, 9(4):725 743, 2003.

S. F. Jarner and G. O. Roberts. Polynomial convergence rates of Markov chains. Ann. Appl. Probab., 12(1):224 247, 2002.

B. Jiang, Q. Sun, and J. Fan. Bernstein s inequality for general markov chains, 2018. ar Xiv:1805.10721.

N. V. Krylov. Controlled diﬀusion processes, volume 14 of Stochastic Modelling and Applied Probability. Springer-Verlag, Berlin, 2009. Translated from the 1977 Russian original by A. B. Aries, Reprint of the 1980 edition.

R. S. Liptser and A. N. Shiryaev. Statistics of random processes. I, volume 5 of Applications of Mathematics (New York). Springer-Verlag, Berlin, expanded edition, 2001. General theory, Stochastic Modelling and Applied Probability.

M. N. Malyshkin. Subexponential estimates for the rate of convergence to the invariant measure for stochastic diﬀerential equations. Teor. Veroyatnost. i Primenen., 45(3):489 504, 2000.

J. C. Mattingly, A. M. Stuart, and M. V. Tretyakov. Convergence of numerical timeaveraging and stationary measures via Poisson equations. SIAM J. Numer. Anal., 48(2): 552 577, 2010.

R. Nickl and K. Ray. Nonparametric statistical inference for drift vector ﬁelds of multidimensional diﬀusions. Ann. Statist., 48(3):1383 1408, 2020.

E. Pardoux and A. Y. Veretennikov. On the Poisson equation and diﬀusion approximation. I. Ann. Probab., 29(3):1061 1085, 2001.

Y. Pokern, A. M. Stuart, and E. Vanden-Eijnden. Remarks on drift estimation for diﬀusion processes. Multiscale Model. Simul., 8(1):69 95, 2009.

E. Rio. Asymptotic theory of weakly dependent random processes, volume 80 of Probability Theory and Stochastic Modelling. Springer, Berlin, 2017.

G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363, 1996.

Trottner, Aeckerle-Willems and Strauch

O. Stramer and R. L. Tweedie. Existence and stability of weak solutions to stochastic diﬀerential equations with non-smooth coeﬃcients. Statist. Sinica, 7(3):577 593, 1997.

D. W. Stroock and S. R. S. Varadhan. Multidimensional diﬀusion processes. Classics in Mathematics. Springer-Verlag, Berlin, 2006. Reprint of the 1997 edition.

Y. W. Teh, A. H. Thiery, and S. J. Vollmer. Consistency and ﬂuctuations for stochastic gradient Langevin dynamics. J. Mach. Learn. Res., 17:Paper No. 7, 33, 2016.

P. Tuominen and R. L. Tweedie. Subgeometric rates of convergence of f-ergodic Markov chains. Adv. in Appl. Probab., 26(3):775 798, 1994.

S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. Exploration of the (non-)asymptotic bias and variance of stochastic gradient Langevin dynamics. J. Mach. Learn. Res., 17:Paper No. 159, 45, 2016.

M. J. Wainwright. High-dimensional statistics, volume 48 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2019. A nonasymptotic viewpoint.