# robust_stochastic_optimization_via_gradient_quantile_clipping__3c0dfc67.pdf

Published in Transactions on Machine Learning Research (10/2024)

Robust Stochastic Optimization via Gradient Quantile Clipping

Ibrahim Merad imerad@lpsm.paris LPSM, UMR 8001 Université Paris Cité, Paris, France

Stéphane Gaïffas stephane.gaiffas@lpsm.paris LPSM, UMR 8001 Université Paris Cité, Paris, France DMA, École normale supérieure

Reviewed on Open Review: https: // openreview. net/ forum? id= HCRk V3kx HW

We introduce a clipping strategy for Stochastic Gradient Descent (SGD) which uses quantiles of the gradient norm as clipping thresholds. We prove that this new strategy provides a robust and efficient optimization algorithm for smooth objectives (convex or non-convex), that tolerates heavy-tailed samples (including infinite variance) and a fraction of outliers in the data stream akin to Huber contamination. Our mathematical analysis leverages the connection between constant step size SGD and Markov chains and handles the bias introduced by clipping in an original way. For strongly convex objectives, we prove that the iteration converges to a concentrated distribution and derive high probability bounds on the final estimation error. In the non-convex case, we prove that the limit distribution is localized on a neighborhood with low gradient. We propose an implementation of this algorithm using rolling quantiles which leads to a highly efficient optimization procedure with strong robustness properties, as confirmed by our numerical experiments.

1 Introduction

Stochastic gradient descent (SGD) (Robbins & Monro, 1951) is the core optimization algorithm at the origin of most stochastic optimization procedures (Kingma & Ba, 2014; Defazio et al., 2014; Johnson & Zhang, 2013). SGD and its variants are ubiquitously employed in machine learning in order to train most models (Kushner & Yin, 2003; Benveniste et al., 2012; Lan, 2020; Shalev-Shwartz et al., 2007; Bottou et al., 2018; Ma et al., 2018). The convergence properties of SGD are therefore subjects of major interest.

Early studies of SGD convergence generally relied on strong assumptions such as bounded domain (Shalev Shwartz et al., 2009) or uniformly bounded gradient variance (Rakhlin et al., 2011) and obtained error bounds in expectation. With the recent resurgence of interest for robust statistics (Hsu & Sabato, 2016; Diakonikolas et al., 2019; Lecué & Lerasle, 2017; Prasad et al., 2018), variants of SGD based on clipping are shown to be robust to heavy-tailed gradients (Gorbunov et al., 2020; Tsai et al., 2022), where the gradient samples are only required to have a finite variance. The latter requirement has been further weakened to the existence of a q-th moment for some q > 1 in (Sadiev et al., 2023; Nguyen et al., 2023). In this paper, we go further and show that another variant of clipped SGD with proper thresholds is robust both to heavy tails and outliers in the data stream.

Robust statistics appeared in the 60s with the pioneering works of Huber, Tukey and others (Tukey, 1960; Huber, 1992; 1972; Rousseeuw & Hubert, 2011; Hampel, 1971). More recently, the field found new momentum thanks to a series of works about robust scalar mean estimation (Catoni, 2012; Alon et al., 1996; Jerrum

Published in Transactions on Machine Learning Research (10/2024)

et al., 1986; Lugosi & Mendelson, 2021) and the more challenging multidimensional case (Hopkins, 2020; Catoni & Giulini, 2018; Lugosi & Mendelson, 2019; Minsker, 2015; Cherapanamjeri et al., 2019; Depersin & Lecué, 2022; Lei et al., 2020; Diakonikolas et al., 2020). These paved the way to the elaboration of a host of robust learning algorithms (Holland & Ikeda, 2019; Prasad et al., 2018; Lecué & Lerasle, 2017; Liu et al., 2020; Pensia et al., 2020) which have to date overwhelmingly focused on the batch learning setting. We consider the setting of streaming stochastic optimization (Bottou & Cun, 2003; Bottou & Lecun, 2005; Mc Mahan et al., 2013), which raises an additional difficulty coming from the fact that algorithms can see each sample only once and must operate under an O(d) memory and complexity constraint for d-dimensional optimization problems. A limited number of papers (Tsai et al., 2022; Nazin et al., 2019; Diakonikolas et al., 2022) propose theoretical guarantees for robust algorithms learning from streaming data.

This work introduces such an algorithm that learns from data on the fly and is robust both to heavy tails and outliers, with minimal computational overhead and sound theoretical guarantees.

We consider the problem of minimizing a smooth objective

min θ Rd L(θ) := Eζ[ℓ(θ, ζ)] (1)

using observations G(θ, ζt) of the unknown gradient L(θ), based on samples (ζt)t 0 received sequentially that include corruptions with probability η < 1/2. Formulation (1) is common to numerous machine learning problems where ℓis a loss function evaluating the fit of a model with parameters θ on a sample ζ, the expectation E is w.r.t the unknown uncorrupted sample distribution.

We introduce quantile-clipped SGD (QC-SGD) which uses the iteration

θt+1 =θt αθtβG(θt, ζt) with αθt =min 1, τθt G(θt, ζt)

where β > 0 is a constant step size and αθt is the clipping factor with threshold chosen as the p-th quantile τθt = Qp( e G(θt, ζt) ) with e G(θt, ζt) an uncorrupted sample of L(θt) and p (0, 1) (details will follow). Quantiles are a natural choice of clipping threshold which allows to handle heavy tails (Rothenberg et al., 1964; Bloch, 1966) and corrupted data. For instance, the trimmed mean offers a robust and computationally efficient estimator of a scalar expectation (Lugosi & Mendelson, 2021). Since the quantile Qp( e G(θt, ζt) ) is non-observable, we introduce a method based on rolling quantiles in Section 5 which keeps the procedure O(d) both memory and complexity-wise. The main benefit of QC-SGD 2 is to grant robustness to the presence of a proportion η < 1/2 of corruptions in the stream of gradient samples. This could not be achieved by previous clipped SGD methods (Gorbunov et al., 2020; Tsai et al., 2022; Sadiev et al., 2023; Nguyen et al., 2023). We also show that iteration (2) is adaptive to heavy-tailed gradient variance and converges to a limit distribution with strong concentration properties.

Contributions. Our main contributions are as follows:

For small enough η and well-chosen p, we show that, whenever the optimization objective is smooth and strongly convex, QC-SGD converges geometrically to a limit distribution such that the deviation around the optimum achieves the optimal dependence on η.

In the non-corrupted case η = 0 and with a strongly convex objective, we prove that a coordinated choice of β and p ensures that the limit distribution is sub-Gaussian with constant of order O( β). In the corrupted case η > 0, the limit distribution is sub-exponential.

For a smooth objective (non-convex) whose gradient satisfies an identifiability condition, we prove that the total variation distance between QC-SGD iterates and its limit distribution vanishes sublinearly. In this case, the limit distribution is such that the deviation of the objective gradient is optimally controlled in terms of η.

Finally, we provide experiments to demonstrate that QC-SGD can be easily and efficiently implemented by estimating Qp( e G(θt, ζt) ) with rolling quantiles. In particular, we show that the iteration is indeed robust to heavy tails and corruption on multiple stochastic optimization tasks.

Published in Transactions on Machine Learning Research (10/2024)

Our theoretical results are derived thanks to a modelling through Markov chains and hold under an Lq assumption on the gradient distribution with q > 1.

Related works. Convergence in distribution of the Markov chain generated by constant step size SGD, relatively to the Wasserstein metric, was first established in (Dieuleveut et al., 2020). Another geometric convergence result was derived in (Yu et al., 2021) for non-convex, non-smooth, but quadratically growing objectives, where a convergence statement relatively to a weighted total variation distance is given and a CLT is established. These papers do not consider robustness to heavy tails or outliers. Early works proposed stochastic optimization and parameter estimation algorithms which are robust to a wide class of noise of distributions (Martin & Masreliez, 1975; Polyak & Tsypkin, 1979; 1981; Price & Vande Linde, 1979; Stanković & Kovačević, 1986; Chen et al., 1987; Chen & Gao, 1989; Nazin et al., 1992), where asymptotic convergence guarantees are stated for large sample sizes. Initial evidence of the robustness of clipped SGD to heavy tails was given by (Zhang et al., 2020) who obtained results in expectation. Subsequent works derived highconfidence sub-Gaussian performance bounds under a finite variance assumption (Gorbunov et al., 2020; Tsai et al., 2022) and later under an Lq assumption (Sadiev et al., 2023; Nguyen et al., 2023) with q > 1. A similar SGD clipping scheme to (2) is presented in (Seetharaman et al., 2020), however, in contrast to our work, they do not consider the robust setting and focus on experimental study while we also provide theoretical guarantees.

Robust versions of Stochastic Mirror Descent (SMD) are introduced in (Nazin et al., 2019; Juditsky et al., 2023). For a proper choice of the mirror map, SMD is shown to handle infinite variance gradients without any explicit clipping (Nemirovskij & Yudin, 1983; Vural et al., 2022). Finally, (Diakonikolas et al., 2022) study heavy-tailed and outlier robust streaming estimation algorithms of the expectation and covariance. On this basis, robust algorithms for linear and logistic regression are derived. However, the involved filtering procedure is hard to implement in practice and no numerical evaluation of the considered approach is proposed.

Agenda. In Section 2 we set notations, state the assumptions required by our theoretical results and provide some necessary background on continuous state Markov chains. In Section 3, we state our results for strongly convex objectives including geometric ergodicity of QC-SGD (Theorem 1), characterizations of the limit distribution and deviation bounds on the final estimate. In Section 4, we remove the convexity assumption and obtain a weaker ergodicity result (Theorem 2) and characterize the limit distribution in terms of the deviations of the objective gradient. Finally, we present a rolling quantile procedure in Section 5 and demonstrate its performance through a few numerical experiments on synthetic and real data.

2 Preliminaries

The model parameter space is Rd endowed with the Euclidean norm , B(Rd) is the Borel σ-algebra of Rd

and we denote by M1(Rd) the set of probability measures over Rd. We assume throughout the paper that the objective L is smooth. Assumption 1. The objective L is L-Lipschitz-smooth, namely

L(θ ) L(θ) + L(θ), θ θ + L

with L < + for all θ, θ Rd.

The results from Section 3 below use the following Assumption 2. The objective L is µ-strongly convex, namely

L(θ ) L(θ) + L(θ), θ θ + µ

with µ > 0 for all θ, θ Rd.

An immediate consequence of Assumption 2 is the existence of a unique minimizer θ = arg minθ Rd L(θ). The next assumption formalizes our corruption model.

Published in Transactions on Machine Learning Research (10/2024)

Assumption 3 (η-corruption). The gradients (G(θt, ζt))t 0 used in Iteration (2) are sampled as G(θt, ζt) = Ut q G(θt) + (1 Ut) e G(θt, ζt) where Ut are i.i.d Bernoulli random variables with parameter η < 1/2, q G(θt) DO(θt) with DO(θt) an arbitrary distribution and e G(θt, ζt) DI(θt) follows the true gradient distribution and is independent from the past given θt.

Assumption 3 is an online analog of the Huber contamination model (Huber, 1965; 1992) where corruptions occur with probability η and where the distribution of corrupted samples DO(θt) (outliers) is not fixed and may depend on the current iterate θt. On the other hand, DI(θt) denotes the distribution of inliers. This notation and dichotomy between inliers and outliers follows the example of (Lecué & Lerasle, 2017; Lecué & Lerasle, 2019). Assumption 3 corresponds to additive contamination (Diakonikolas & Kane, 2023, Section 1.2.2) where corruptions are only added to the data. A more general TV-contamination model allowing for true samples to be adversely removed is used in (Diakonikolas et al., 2022). Note however, that the latter mainly focuses on mean estimation. Note also that additive contamination remains realistic since it accounts for invalid entries occurring in the data stream even if it doesn t support entries being targeted and censored. The next assumption requires the true gradient distribution to be unbiased and diffuse.

Assumption 4. For all θ, non-corrupted gradient samples e G(θ, ζ) DI(θ) are such that

e G(θ, ζ) = L(θ) + εθ, (3)

where εθ is a centered noise E[εθ|θ] = 0 with distribution δνθ,1 + (1 δ)νθ,2 where δ > 0 and νθ,1, νθ,2 are distributions over Rd such that νθ,1 admits a density hθ w.r.t. the Lebesgue measure satisfying

inf ω R hθ(ω) > κ(R) > 0

for all R > 0, where κ( ) is independent of θ.

In addition to the unbiased property, Assumption 4 imposes that the noise distribution be expressible as the combination of two components, one of which must be diffuse with density satisfying a minorization inequality. Note that this is a weak constraint since it is satisfied, for example, as soon as the noise εθ admits a density w.r.t. Lebesgue s measure which is positive everywhere. This condition is similar to (Yu et al., 2021, Assumption 2.3) since both find their origin in Markov chain minorization conditions (Meyn & Tweedie, 1993, Section 5.2). These ensure that a chain properly explores its state space and are a common way to prove Markov chain convergence (Rosenthal, 1995b;a; Douc et al., 2004; Meyn & Tweedie, 1994; Baxendale, 2005). Our last assumption formalizes the requirement of a finite moment for the gradient error.

Assumption 5. There is q > 1 such that for e G(θ, ζ) DI(θ), we have

E εθ q | θ 1/q = E e G(θ, ζ) L(θ) q | θ 1/q Aq θ θ + Bq (4)

for all θ Rd, where Aq, Bq > 0. When L is not strongly convex, we further assume that Aq = 0.

The bound (4) captures the case of arbitrarily high noise magnitude through the dependence on θ θ . This is consistent with convex optimization problems with L-Lipschitz-smooth objectives (Assumption 1) where the norm of the gradient L(θ) is bounded by L θ θ . Assumption 5 improves upon the conditions used in (Gorbunov et al., 2020; Tsai et al., 2022; Gorbunov et al., 2023; Nguyen et al., 2023) since these either required a uniformly constant upperbound (independent of θ) or only considered the case q = 2 (finite variance). For non-strongly convex L, we require Aq = 0 since θ may not exist.

Definition 1. If X is a real random variable, we say that X is K-sub-Gaussian for K > 0 if

E exp(λ2X2) eλ2K2 for |λ| 1/K. (5)

We say that X is K-sub-exponential for K > 0 if

E exp(λ|X|) exp(λK) for all 0 λ 1/K. (6)

Published in Transactions on Machine Learning Research (10/2024)

The convergence results presented in this paper use the following formalism of continuous state Markov chains. Given a step size β > 0 and a quantile p (0, 1), we denote by Pβ,p the Markov transition kernel governing the Markov chain (θt)t 0 generated by QC-SGD, so that

P(θt+1 A | θt) = Pβ,p(θt, A)

for t 0 and A B(Rd). The transition kernel Pβ,p acts on probability distributions ν M1(Rd) through the mapping ν νPβ,p which is defined, for all A B(Rd), by νPβ,p(A) = R

A Pβ,p(θ, A)dν(θ) = P(θt+1 A|θt ν). For n 1, we similarly define the multi-step transition kernel P n β,p which is such that P n β,p(θt, A) = P(θt+n A | θt) and acts on probability distributions ν M1(Rd) through νP n β,p = (νPβ,p)P n 1 β,p . Finally, we define the total variation (TV) norm of a signed measure ν as

2 ν TV = sup f:|f| 1

Z f(θ)ν(dθ) = sup A B(Rd) ν(A) inf A B(Rd) ν(A). (7)

In particular, we recover the TV distance between ν1, ν2 M1(Rd) as

d TV(ν1, ν2) = ν1 ν2 TV = sup A B(Rd)

ν1(A) ν2(A) .

The second equality reflects the fact that the TV distance between two probability measures corresponds to the largest absolute difference between the probabilities they assign to the same event. The TV distance is a broadly used metric to quantify the convergence of Markov chains (Levin & Peres, 2017; Baxendale, 2005; Meyn & Tweedie, 1993; Rosenthal, 1995a) besides the Wasserstein distance (Dieuleveut et al., 2020).

In the next section, we will prove that the Markov chain defined by iteration (2) converges to unique invariant distribution in TV distance. This convergence mode will allow us to extrapolate the properties of the limit distribution on the iterates θt and thus derive non-asymptotic concentration bounds for them, see Corollaries 2 and 1 below.

3 Strongly Convex Objectives

We are ready to state our convergence result for the stochastic optimization of a strongly convex objective using QC-SGD with η-corrupted samples.

Theorem 1 (Geometric ergodicity). Let Assumptions 1-5 hold and assume there is a quantile p [η, 1 η] such that κ := (1 η)pµ ηL (1 p) 1

q Aq(1 p(1 η)) > 0. (8)

Then, for a step size β satisfying

4 κ µ2 + 24ηL2 + 28A2q 2 µ + L, (9)

the Markov chain (θt)t 0 generated by QC-SGD with parameters β and p converges geometrically to a unique invariant measure πβ,p: for any initial θ0 Rd, there is ρ < 1 and M < such that after T iterations δθ0P T β,p πβ,p TV MρT 1 + θ0 θ 2 ,

where δθ0 is the Dirac measure located at θ0.

The proof of Theorem 1 is given in Appendix D.3 and relies on the geometric ergodicity result of (Meyn & Tweedie, 1993, Chapter 15) for Markov chains with a geometric drift property. A similar result for quadratically growing objectives was established by (Yu et al., 2021) and convergence w.r.t. Wasserstein s metric was shown in (Dieuleveut et al., 2020) assuming gradient co-coercivity. However, robustness was not considered in these works. Theorem 1 establishes the iteration s convergence to a unique invariant measure πβ,p. The properties of this limit distribution will be explored in the sequel. The restriction p [η, 1 η] comes from the consideration that other quantiles are not estimable in the event of η-corruption. Condition (8) is

Published in Transactions on Machine Learning Research (10/2024)

best interpreted for the choice p = 1 η in which case it translates into η1 1/q O(µ/(L + Aq)) implying that it is verified for η small enough within a limit fixed by the problem conditioning. A similar condition with q = 2 appears in (Diakonikolas et al., 2022, Theorem E.9) which uses a finite variance assumption.

When (8) is satisfied, one clearly has that κ = O(µ). Considering q = 2 for simplicity and taking the maximum allowed corruption rate in this case η = O(µ2/(L + Aq)2) leads to an upperbound on the step-size β of order O(µ/(µ2 + A2 q) 1/L). While the condition β = O(1/L) is standard in smooth optimization, the additional condition in terms of Aq ensures that the noise introduced to the iteration by the gradient samples does not cause it to diverge.

The constants M and ρ controlling the geometric convergence speed in Theorem 1 depend on the parameters β, p and the initial θ0. Among choices fulfilling the convergence conditions, it is straightforward that greater step size β and θ0 closer to θ lead to faster convergence. However, the dependence in p is more intricate and should be evaluated through the resulting value of κ. We provide a more detailed discussion about the value of ρ in Appendix C.

The choice p = 1 η appears to be ideal since it leads to optimal deviation of the invariant distribution around the optimum θ which is the essence of our next statement.

Proposition 1. Assume the same as in Theorem 1 and condition (8) with the choice p = 1 η. For step size β satisfying (9), q 2, and additionally:

β η2 2/q/κ, (10)

for θ πβ,1 η, we have the following upper bound:

E θ θ 2 6η1 1/q Bq

Proposition 1 is proven in Appendix D.4. An analogous result holds for q (1, 2) but requires a different proof and can be found in Appendix D.5. Proposition 1 may be compared to (Yu et al., 2021, Theorem 3.1) which shows that the asymptotic estimation error can be reduced arbitrarily using a small step size. However, this is impossible in our case since we consider corrupted gradients. The performance of Proposition 1 is best discussed in the specific context of linear regression where gradients are given as G(θ, (X, Y )) = X(X θ Y ) for samples X, Y Rd R such that Y = X θ + ϵ with ϵ a centered noise. In this case, a finite moment of order k for the data implies order k/2 for the gradient corresponding to an η1 2/k rate in Proposition 1. Since Assumption 5 does not include independence of the noise ϵ from X, this corresponds to the negatively correlated moments assumption of (Bakshi & Prasad, 2021) being unsatisfied. Consequently, Proposition 1 is information-theoretically optimal in η based on (Bakshi & Prasad, 2021, Corollary 4.2). Nonetheless, the dimension dependence through Bq remains poor since we have Bq

d in general because the Euclidean norm is used in Assumption 5. This dimension dependence may be improvable by using the quantiles sup v =1 Qp | e G(θt, ζt), v | as clipping thresholds and adapting ideas from (Catoni & Giulini, 2018) in the analysis. However, exploring this method is beyond our scope as the involved estimations for all v = 1 would be excessively sample hungry and computationally heavy for stochastic optimization.If the gradient is sub-Gaussian with constant K, we would have Bq K q for q 1 (see (Vershynin, 2018) for a reference), in which case, the choice q = log(1/η) recovers the optimal rate in η p

log(1/η) for the Gaussian case.

We now turn to showing strong concentration properties for the invariant distribution πβ,p. For this purpose, we restrict the optimization to a bounded and convex set Θ Rd and replace Iteration (2) by the projected iteration θt+1 = ΠΘ θt αθtβG(θt, ζt) , (11)

where ΠΘ is the projection onto Θ. Assuming that the latter contains the optimum θ Θ, one can check that the previous results continue to hold thanks to the inequality

ΠΘ(θ) θ = ΠΘ(θ) ΠΘ(θ ) θ θ ,

which results from the convexity of Θ. The restriction of the optimization to a bounded set allows us to uniformly bound the clipping threshold τθ, which is indispensable for the following result.

Published in Transactions on Machine Learning Research (10/2024)

Proposition 2. In the setting of Theorem 1, consider projected QC-SGD (11) and let τ = supθ Θ τθ, D = diam(Θ) the diameter of Θ and Bq = Aq D + Bq.

Consider the non-corrupted case η = 0 and set the quantile p such that p 1 (βµ)

q 2(q 1) . Then, for θ πβ,p, the variable θ θ is sub-Gaussian in the sense of Definition 1 with constant

2β(B 2 q + τ 2) pµ .

Consider the corrupted case η > 0, and set the quantile p [η, 1 η] such that Inequality (8) holds. Then, for θ πβ,p, the variable θ θ is sub-exponential in the sense of Definition 1 with constant

K = 7τ + (1 p)1 1/q Bq

The proof can be found in Appendix D.6. The strong concentration properties given by Proposition 2 for the invariant distribution appear to be new. Still, the previous result remains asymptotic in nature. High confidence deviation bounds for an iterate θt can be derived by leveraging the convergence in Total Variation distance given by Theorem 1 leading to the following result. Corollary 1. In the setting of Proposition 2, in the absence of corruption η = 0, after T iterations, for δ > 0, we have

P θT θ > 4 q

B 2 q + τ 2 s

2β log(e/δ)

δ + ρT M 1 + θ0 θ 2 .

Choosing a smaller step size β in Corollary 1 allows to improve the deviation bound. However, this comes at the cost of weaker confidence because of slower convergence due to a greater ρ. See Appendix C for a discussion including a possible compromise. Corollary 1 may be compared to the results of (Gorbunov et al., 2020; Tsai et al., 2022; Sadiev et al., 2023; Nguyen et al., 2023) which correspond to β 1/T and have a similar dependence on the dimension through the gradient variance. Although their approach is also based on gradient clipping, they use different thresholds and proof methods. In the presence of corruption, the invariant distribution is not sub-Gaussian. This can be seen by considering the following toy Markov chain:

( αXt + ξ w.p. 1 η Xt + τ w.p. η

where α < 1, τ > 0 are constants and ξ is a positive random noise. Using similar methods to the proof of Theorem 1, one can show that (Xt)t 0 converges (for any initial X0) to an invariant distribution whose moments can be shown to grow linearly, indicating a sub-exponential distribution and excluding a sub Gaussian one. We provide additional details for the underlying argument in Appendix D.7. For the corrupted case, the sub-exponential property stated in Proposition 2 holds with a constant K of order τ/µ, which is not satisfactory and leaves little room for improvement due to the inevitable bias introduced by corruption. Therefore, we propose the following procedure in order to obtain a high confidence estimate, similarly to Corollary 1.

Algorithm 1 uses ideas from (Hsu & Sabato, 2016) (see also (Minsker, 2015; Juditsky et al., 2023)) and combines the collection of weak estimators θ(i) T

i JNK (only satisfying L2 bounds) into a strong one with

sub-exponential deviation. This is done by picking θ(i) T which is such that the median of its distances to other estimators r(i) N/2 is minimal. The aggregated estimator bθ satisfies the high probability bound given in the next result. Corollary 2. Assume the same as in Theorem 1 and Proposition 1. Consider bθ given by Algorithm 1, with the assumption that the gradient sample sets used for each θ(n) T

n JNK in Equation (12) are independent. For δ > 0, if N 16 log(1/δ) and T satisfies

T N log(15M(1 + θ0 θ 2))/ log(1/ρ),

Published in Transactions on Machine Learning Research (10/2024)

Algorithm 1: Aggregation of cycling iterates

Input: Step size β > 0, quantile index p (0, 1), initial parameter θ0 Θ, horizon T and number of concurrent iterates N 1.

1 Optimize multiple parameters θ(1) t , . . . , θ(N) t starting from a common θ0 = θ(n) 0 for n JNK =: {1, . . . , N} and T steps t = 0, . . . , T using the following cycling iteration:

( θ(n) t αθ(n) t βG θ(n) t , ζt if t n 1 mod N,

θ(n) t otherwise. (12)

2 Compute the pairwise distances ri,j = θ(i) T θ(j) T for i, j JNK.

3 For i JNK, let r(i) RN + be the vector ri,: := [ri,1, . . . , ri,N] sorted in non decreasing order.

4 Compute the aggregated estimator as bθ = θ(bi) T with bi = arg mini JNK r(i) N/2 .

5 return bθ

then, with probability at least 1 δ, we have

bθ θ 27η1 1

q Bq κ . (13)

We obtain a high confidence version of the bound in expectation previously stated in Proposition 1. As argued before, the above bound depends optimally on η. Similar bounds to (13) are obtained for q = 2 in (Diakonikolas et al., 2022) for streaming mean estimation, linear and logistic regression. Their results enjoy better dimension dependence but are less general than ours since we handle the case q (1, 2) and consider strongly convex objectives more broadly. In addition, our results further extend to non-convex objectives as detailed in the next section. Finally, the implementation of the algorithm in (Diakonikolas et al., 2022) is not straightforward whereas our method is quite easy to use (see Section 5).

4 Non-Convex Objectives

In this section, we drop Assumption 2 and consider the optimization of possibly non-convex objectives. Consequently, the existence of a unique optimum θ and the quadratic growth of the objective are no longer guaranteed. This motivates us to use a uniform version of Assumption 5 with Aq = 0 since the gradient is no longer assumed coercive and its deviation moments can be taken as bounded. In this context, we obtain the following weaker (compared to Theorem 1) ergodicity result for QC-SGD.

Theorem 2 (Ergodicity). Let Assumptions 1, 3, 4 and 5 hold with Aq = 0 (uniformly bounded moments) and let L be an objective such that infθ L(θ) > is finite. Let (θt)t 0 be the Markov chain generated by QC-SGD with step size β and quantile p [η, 1 η]. Assume that p and β are such that 3p(1 η)/4 > Lβ +η and that the subset of Rd given by

L(θ) 2 B2 q (1 p) 2

q (Lβ + 2η2) + 2η2 2

p(1 η)(3p(1 η)/4 Lβ η)

is bounded. Then, for any initial θ0 Rd, there exists M < + such that after T iterations

δθ0P T β,p πβ,p TV M

where πβ,p is a unique invariant measure and where δθ0 is the Dirac measure located at θ0.

The proof is given in Appendix D.10 and uses ergodicity results from (Meyn & Tweedie, 1993, Chapter 13). Theorem 2 provides convergence conditions for an SGD Markov chain on a smooth objective in a robust setting. We are unaware of anterior results of this kind in the literature. Condition (14) requires that the

Published in Transactions on Machine Learning Research (10/2024)

set where the true gradient norm is smaller than the estimation error is bounded. This aims to exclude the possibility that the iteration gets trapped within this set and keep using unreliable gradient estimates causing it to diverge. The result is stronger when the upperbound in (14) is smaller. Note that setting p close to 1 η increases the clipping threshold and the estimation error as a consequence, making this condition harder to satisfy. On the other hand, using β = O(1/L) and a more conservative value of p makes the upperbound of order O(B2 q) and condition 14 easier to satisfy. Observe that, for no corruption (η = 0), the condition is always fulfilled for some β and p. Note also that without strong convexity (Assumption 2), convergence occurs at a slower sublinear rate which is consistent with the optimization rate expected for a smooth objective (see (Bubeck, 2015, Theorem 3.3)).

As previously, we complement Theorem 1 with a characterization of the invariant distribution.

Proposition 3. Under the conditions of Theorem 2, assume that the choice p = 1 η is such that the set (14) is bounded. For step size β η2/L, the stationary measure θ πβ,1 η satisfies

E L(θ) 2 5η2 2

q B2 q p(1 η) 3p(1 η)/4 Lβ η . (16)

The statement of Proposition 3 is clearly less informative than Propositions 1 and 2 since it only pertains to the gradient rather than, for example, the excess risk. This is due to the weaker assumptions that do not allow to relate these quantities. Still, the purpose remains to find a critical point and is achieved up to O(η1 1/q) precision according to this result. Due to corruption, the estimation error on the gradient cannot be reduced beyond Ω(η1 1/q) (Prasad et al., 2020; Hopkins & Li, 2018; Diakonikolas & Kane, 2019). Therefore, one may draw a parallel with a corrupted mean estimation task, in which case, the previous rate is, in fact, information-theoretically optimal.

5 Implementation and Numerical Experiments

The use of the generally unknown quantile Qp( e G(θt, ζt) ) in QC-SGD constitutes the main obstacle to its implementation. For strongly convex objectives, one may use a proxy such as a θt θref + b with positive a, b and θref Rd an approximation of θ serving as reference point. This choice is consistent with Assumptions 1 and 5, see Lemma 2 in Appendix D. For instances of Problem (1) defined with an

Algorithm 2: Rolling QC-SGD

Input: Step size β > 0, quantile index p (0, 1), initial parameter θ0 Rd, τunif > 0, buffer B of size S and horizon T.

1 Fill B with S 1 values equal to τunif.

2 for t = 0 . . . T 1 do

3 Draw a sample G(θt, ζt) and add G(θt, ζt) to B.

4 b Qp p S rank element of B.

5 θt+1 θt βclip(G(θt, ζt), b Qp)

6 Delete the oldest value in B.

7 return θT

asymptotically linear function ℓsuch as the logistic, hinge or Huber s loss, a constant threshold can be used since the gradient is a priori uniformly bounded, implying the same for the quantiles of its deviations. In practice, we propose a simpler and more direct approach: we use a rolling quantile procedure, described in Algorithm 2. The latter stores the values ( G(θt j, ζt j) )1 j S in a buffer of size S N and replaces Qp( e G(θt, ζt) ) in QC-SGD by an estimate b Qp which is the p S -th order statistic in the buffer. Note that only the norms of previous gradients are stored in the buffer, limiting the memory overhead to O(S). The computational cost of b Qp can also be kept to O(S) per iteration thanks to a bookkeeping procedure (see Appendix B).

Published in Transactions on Machine Learning Research (10/2024)

Note that, since Algorithm 2 uses the corrupted samples G(θt, ζt) rather than the true ones e G(θt, ζt) to estimate the quantiles, a more conservative upperbound of roughly p 1 2η should be respected when an estimate of η is available. Otherwise, one may default to p = 1/2 as an initial guess and adapt based on performance. In practice, our experiments show that relatively low values within p [0.1, 0.2] are best for strongly convex objectives while higher values are affordable in other cases. See Appendix B for more details.

We implement this procedure for a few tasks and compare its performance with relevant baselines. We do not include a comparison with (Diakonikolas et al., 2022) whose procedure has no implementation we are aware of and is difficult to use in practice. Indeed, the algorithm in question heavily depends on several problem parameters and involves a filtering procedure which requires multiple passes on large data minibatches making it impractical for the streaming setting. Moreover, a number of special methods are required to mitigate the costs of the matrix operations needed in the original procedure making the algorithm s implementation even more involved.

Our experiments on synthetic data consider an infinite horizon, dimension d = 128, and a constant step size for all methods.

Linear regression. We consider least-squares linear regression and compare RQC-SGD with Huber s estimator (Huber, 1973) and clipped SGD (designated as CClip(λ)) with three clipping levels λσmax

d for λ {0.8, 1.0, 1.2} where σmax is a fixed data scaling factor. These thresholds provide a rough estimate of the gradient norm near the optimum θ . We generate covariates X and labels Y both heavy-tailed and corrupted. Corruption in the data stream is generated according to Assumption 3 with outliers represented either by aberrant values or fake samples Y = X θfake + ϵ using a false parameter θfake, see Appendix B for further details on data generation and fine tuning of the Huber parameter. All methods are run with constant step size and averaged results over 100 runs are displayed on Figure 1 (top row).

As anticipated, Huber s loss function is not robust to corrupted covariates. In contrast, using gradient clipping allows convergence to meaningful estimates. Although this holds true for a constant threshold, Figure 1 shows it may considerably slow the convergence if started away from the optimum. In addition, the clipping level also affects the final estimation precision and requires tuning. Both of the previous issues are well addressed by RQC-SGD whose adaptive clipping level allows fast progress of the optimization and accurate convergence towards a small neighborhood of the optimum.

Logistic regression. We test the same methods on logistic regression. Huber s baseline is represented by the modified Huber loss (also known as quadratic SVM (Zhang, 2004)). We generate data similarly to the previous task except for the labels which follow Y Bernoulli(σ(X θ )) with σ the sigmoid function. Corrupted labels are either uninformative, flipped or obtained with a fake θfake (see details in Appendix B). Results are displayed on the bottom row of Figure 1.

As previously, Huber s estimator performs poorly with corruption. However, constant clipping appears to be better suited when the gradient is bounded, so that the optimization is less affected by its underestimation. We observe, nonetheless, that a higher clipping level may lead to poor convergence properties, even at a low corruption rate. Note also that the constant levels we use are based on prior knowledge about the data distribution and would have to be fine tuned in practice. Meanwhile, the latter issue is well addressed by quantile clipping. Finally, notice that no algorithm truly approaches the true solution for this task. This reflects the difficulty of improving upon Proposition 3 which only states convergence to a neighborhood where the objective gradient is comparable to the estimation error in magnitude.

Classification with shallow networks. Finally, we evaluate the performance on the task of training a single hidden layer neural network classifier on some real datasets which corresponds to a non-convex optimization problem. To handle multiclass data, we use the cross entropy loss and replace Huber s baseline with plain SGD for simplicity. We define constant clipping baselines using thresholds given by the quantiles of order p = 0.25, 0.5, and 0.75 of the norms of a batch of gradients at the beginning of the optimisation. Due to the greater sensitivity to corruption observed in this case, we set η = 0.02 and use p = 0.9 for RQC-SGD.

Published in Transactions on Machine Learning Research (10/2024)

0 10000 20000 0.1

0 10000 20000

0 10000 20000

0 10000 20000

0 10000 20000

0 10000 20000

RQC-SGD HUBER CClip( = 0.8) CClip( = 1.0) CClip( = 1.2)

Figure 1: Evolution of θt θ on the tasks of linear regression (top row) and logistic regression (bottom row) averaged over 100 runs at increasing corruption levels (error bars represent half the standard deviation). Estimators based on Huber s loss are strongly affected by data corruption. SGD with constant clipping thresholds is robust but slow to converge for linear regression and requires tuning for better final precision. RQC-SGD combines fast convergence with good final precision thanks to its adaptive clipping strategy.

0 400000 800000

RQC_SGD SGD CClip(p=0.25) CClip(p=0.5) CClip(p=0.75) 0 25000 50000 75000 100000 1.4

2.6 sensorless

0.00 0.25 0.50 0.75 1.00

1.6 covtype

Figure 2: Evolution of the test loss (y-axis) against iteration t (x-axis) for the training of a single hidden layer network on different real world classification datasets (average over 20 runs). We observe more consistent and stable objective decrease for RQC-SGD whereas constant clipping baselines are slower and may fail to converge.

We train all methods with one sample per iteration using equal step sizes and evaluate them through the test loss. We provide further results and experimental details in Appendix B. Results are displayed on Figure 2.

Unsurprisingly, standard SGD is not robust to corrupted samples and, while using a constant clipping level helps keep the optimisation on track, the experiments show that careful tuning may sometimes be necessary to prevent divergence. On the other hand, the adaptive clipping levels used by RQC-SGD allow to make the iteration faster and more resilient to corruption. This leads to an optimization path with a more consistent

Published in Transactions on Machine Learning Research (10/2024)

decrease of the objective. Moreover, we also observe that RQC-SGD allows for a better control of the asymptotic variance of the optimized parameter compared to constant clipping.

6 Conclusion

We introduced a new clipping strategy for SGD and proved that it defines a stochastic optimization procedure which is robust to both heavy tails and outliers in the data stream. We also provided an efficient rolling quantile procedure to implement it and demonstrated its performance through numerical experiments on synthetic and real data. Future research directions include improving the dimension dependence in our bounds, possibly by using sample rejection rules or by considering stochastic mirror descent (Nemirovskij & Yudin, 1983; Beck & Teboulle, 2003) clipped with respect to a non Euclidean norm. This may also procure robustness to higher corruption rates. Another interesting research track is the precise quantification of the geometric convergence speed of the Markov chain generated by constant step size SGD on a strongly convex objective.

Acknowledgments

This research is supported by the Agence Nationale de la Recherche as part of the Investissements d avenir program (reference ANR-19-P3IA-0001; PRAIRIE 3IA Institute).

Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, pp. 20 29, 1996.

Ainesh Bakshi and Adarsh Prasad. Robust linear regression: Optimal rates in polynomial time. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 102 115, 2021.

Martyna Bator. Dataset for Sensorless Drive Diagnosis. UCI Machine Learning Repository, 2015. DOI: https://doi.org/10.24432/C5VP5F.

Peter H Baxendale. Renewal theory and computable convergence rates for geometrically ergodic Markov chains. The Annals of Applied Probability, 15(1B):700 738, 2005.

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003.

Witold Bednorz. The Kendall s Theorem and its Application to the Geometric Ergodicity of Markov Chains. ar Xiv preprint ar Xiv:1301.1481, 2013.

Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive Algorithms and Stochastic Approximations, volume 22. Springer Science & Business Media, 2012.

Kenneth S Berenhaut and Robert Lund. Geometric renewal convergence rates from hazard rates. Journal of Applied Probability, 38(1):180 194, 2001.

Jock Blackard. Covertype. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50K5N.

Joseph K Blitzstein and Jessica Hwang. Introduction to probability. Crc Press, 2019.

D. A. Bloch. A note on the estimation of the location parameter of the cauchy distribution. Journal of the American Statistical Association, 61:852 855, 1966.

Léon Bottou and Yann Cun. Large scale online learning. In S. Thrun, L. Saul, and B. Schölkopf (eds.), Advances in Neural Information Processing Systems, volume 16. MIT Press, 2003. URL https://proceedings.neurips.cc/paper_files/paper/2003/file/ 9fb7b048c96d44a0337f049e0a61ff06-Paper.pdf.

Published in Transactions on Machine Learning Research (10/2024)

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223 311, 2018.

Léon Bottou and Yann Lecun. On-line learning for very large data sets. Applied Stochastic Models in Business and Industry, 21:137 151, 03 2005. doi: 10.1002/asmb.538.

Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015.

Hervé Cardot, Peggy Cénac, and Pierre-André Zitt. Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli, 19(1):18 43, 2013. doi: 10.3150/11-BEJ390. URL https://doi.org/10.3150/11-BEJ390.

Hervé Cardot, Peggy Cénac, and Antoine Godichon-Baggioni. Online estimation of the geometric median in Hilbert spaces: Nonasymptotic confidence balls. The Annals of Statistics, 45(2):591 614, 2017. doi: 10.1214/16-AOS1460. URL https://doi.org/10.1214/16-AOS1460.

Olivier Catoni. Challenging the empirical mean and empirical variance: A deviation study. Annales de l Institut Henri Poincaré, Probabilités et Statistiques, 48(4):1148 1185, 2012. doi: 10.1214/11-AIHP454. URL https://doi.org/10.1214/11-AIHP454.

Olivier Catoni and Ilaria Giulini. Dimension-free pac-bayesian bounds for the estimation of the mean of a random vector. ar Xiv preprint ar Xiv:1802.04308, 2018.

Han-fu Chen and AI-JUN Gao. Robustness analysis for stochastic approximation algorithms. Stochastics and Stochastic Reports, 26(1):3 20, 1989.

Han-Fu Chen, Lei Guo, and Ai-Jun Gao. Convergence and robustness of the robbins-monro algorithm truncated at randomly varying bounds. Stochastic Processes and their Applications, 27:217 231, 1987.

Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Peter L. Bartlett. Fast mean estimation with subgaussian rates. In Conference on Learning Theory, pp. 786 806. PMLR, 2019.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems, 27, 2014.

Jules Depersin and Guillaume Lecué. Robust sub-Gaussian estimation of a mean vector in nearly linear time. The Annals of Statistics, 50(1):511 536, 2022. doi: 10.1214/21-AOS2118. URL https://doi.org/ 10.1214/21-AOS2118.

Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust statistics. ar Xiv preprint ar Xiv:1911.05911, 2019.

Ilias Diakonikolas and Daniel M Kane. Algorithmic high-dimensional robust statistics. Cambridge university press, 2023.

Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48 (2):742 864, 2019.

Ilias Diakonikolas, Daniel M Kane, and Ankit Pensia. Outlier robust mean estimation with subgaussian rates via stability. Advances in Neural Information Processing Systems, 33:1830 1840, 2020.

Ilias Diakonikolas, Daniel M Kane, Ankit Pensia, and Thanasis Pittas. Streaming algorithms for highdimensional robust statistics. In International Conference on Machine Learning, pp. 5061 5117. PMLR, 2022.

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. The Annals of Statistics, 48(3):1348 1382, 2020. doi: 10.1214/ 19-AOS1850. URL https://doi.org/10.1214/19-AOS1850.

Published in Transactions on Machine Learning Research (10/2024)

R. Douc, E. Moulines, and Jeffrey S. Rosenthal. Quantitative bounds on convergence of timeinhomogeneous Markov chains. The Annals of Applied Probability, 14(4):1643 1665, 2004. doi: 10.1214/105051604000000620. URL https://doi.org/10.1214/105051604000000620.

Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Advances in Neural Information Processing Systems, 33:15042 15053, 2020.

Eduard Gorbunov, Abdurakhmon Sadiev, Marina Danilova, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richtárik. High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise. ar Xiv preprint ar Xiv:2310.01860, 2023.

Frank R Hampel. A general qualitative definition of robustness. The Annals of Mathematical Statistics, 42 (6):1887 1896, 1971.

Abdelhakim Hannousse and Salima Yahiouche. Web page phishing detection. Mendeley Data, 2, 2020.

Matthew Holland and Kazushi Ikeda. Better generalization with less data using robust gradient descent. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2761 2770. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr.press/v97/holland19a.html.

Samuel B. Hopkins. Mean estimation with sub-Gaussian rates in polynomial time. The Annals of Statistics, 48(2):1193 1213, 2020. doi: 10.1214/19-AOS1843. URL https://doi.org/10.1214/19-AOS1843.

Samuel B Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1021 1034, 2018.

Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543 582, 2016.

Peter J Huber. A robust version of the probability ratio test. The Annals of Mathematical Statistics, pp. 1753 1758, 1965.

Peter J Huber. The 1972 wald lecture robust statistics: A review. The Annals of Mathematical Statistics, 43(4):1041 1067, 1972.

Peter J. Huber. Robust Regression: Asymptotics, Conjectures and Monte Carlo. The Annals of Statistics, 1(5):799 821, 1973. doi: 10.1214/aos/1176342503. URL https://doi.org/10.1214/aos/1176342503.

Peter J Huber. Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and distribution, pp. 492 518, 1992.

Daniel C Jerison. Quantitative convergence rates for reversible Markov chains via strong random times. ar Xiv preprint ar Xiv:1908.06459, 2019.

Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169 188, 1986.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 2013.

Anatoli Juditsky, Andrei Kulunchakov, and Hlib Tsyntseus. Sparse recovery by reduced variance stochastic approximation. Information and Inference: A Journal of the IMA, 12(2):851 896, 2023.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.

Harold J. Kushner and G. George Yin. Stochastic Approximation and Recursive Algorithms and Applications, volume 35. Springer New York, NY, 2003.

Published in Transactions on Machine Learning Research (10/2024)

Guanghui Lan. First-Order and Stochastic Optimization Methods for Machine Learning, volume 1. Springer, 2020.

Guillaume Lecué and Matthieu Lerasle. Learning from mom s principles: Le cam s approach. Stochastic Processes and their applications, 129(11):4385 4410, 2019.

Guillaume Lecué and Matthieu Lerasle. Robust machine learning by median-of-means: Theory and practice. The Annals of Statistics, 48, 11 2017. doi: 10.1214/19-AOS1828.

Zhixian Lei, Kyle Luh, Prayaag Venkat, and Fred Zhang. A fast spectral algorithm for mean estimation with sub-Gaussian rates. In Conference on Learning Theory, pp. 2598 2612. PMLR, 2020.

David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.

Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust sparse regression. In International Conference on Artificial Intelligence and Statistics, pp. 411 421. PMLR, 2020.

Gábor Lugosi and Shahar Mendelson. Sub-gaussian estimators of the mean of a random vector. The Annals of Statistics, 47(2):783 794, 2019.

Gábor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: The optimality of trimmed mean. The Annals of Statistics, 49(1):393 410, 2021. doi: 10.1214/20-AOS1961. URL https://doi. org/10.1214/20-AOS1961.

Robert B Lund, Sean P Meyn, and Richard L Tweedie. Computable exponential convergence rates for stochastically ordered Markov processes. The Annals of Applied Probability, 6(1):218 237, 1996.

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In International Conference on Machine Learning, pp. 3325 3334. PMLR, 2018.

Rainer Martin and Carl Masreliez. Robust estimation via stochastic approximation. IEEE Transactions on Information Theory, 21(3):263 271, 1975. doi: 10.1109/TIT.1975.1055386.

H Brendan Mc Mahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222 1230, 2013.

Sean P Meyn and Richard L Tweedie. Markov Chains and Stochastic Stability. Springer London, 1993.

Sean P Meyn and Robert L Tweedie. Computable bounds for geometric convergence rates of Markov chains. The Annals of Applied Probability, pp. 981 1011, 1994.

Stanislav Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308 2335, 2015. doi: 10.3150/14-BEJ645. URL https://doi.org/10.3150/14-BEJ645.

Alexander V Nazin, Boris T Polyak, and Alexandre B Tsybakov. Optimal and robust kernel algorithms for passive stochastic approximation. IEEE Transactions on Information Theory, 38(5):1577 1583, 1992.

Alexander V Nazin, Arkadi S Nemirovsky, Alexandre B Tsybakov, and Anatoli B Juditsky. Algorithms of robust stochastic optimization based on mirror descent method. Automation and Remote Control, 80: 1607 1627, 2019.

Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience publication. Wiley, 1983. ISBN 9780471103455. URL https: //books.google.fr/books?id=6ULv AAAAMAAJ.

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014. ISBN 1461346916.

Published in Transactions on Machine Learning Research (10/2024)

Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Le Nguyen. High probability convergence of clipped-SGD under heavy-tailed noise. ar Xiv preprint ar Xiv:2302.05437, 2023.

Ankit Pensia, Varun Jog, and Po-Ling Loh. Robust regression with covariate filtering: Heavy tails and adversarial contamination. ar Xiv preprint ar Xiv:2009.12976, 2020.

Boris Teodorovich Polyak and Yakov Zalmanovich Tsypkin. Adaptive estimation algorithms: convergence, optimality, stability. Automation and Remote Control, 40(3):378 389, 1979.

Boris Teodorovich Polyak and Yakov Zalmanovich Tsypkin. Robust pseudogradient adaptation algorithms. Automation and Remote Control, 41(10):1404 1409, 1981.

Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82, 2018.

Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Ravikumar. A robust univariate mean estimator is all you need. In International Conference on Artificial Intelligence and Statistics, pp. 4034 4044. PMLR, 2020.

E Price and V Vande Linde. Robust estimation using the robbins-monro stochastic approximation algorithm. IEEE Transactions on Information Theory, 25(6):698 704, 1979.

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. ar Xiv preprint ar Xiv:1109.5647, 2011.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400 407, 1951.

Gareth O Roberts and Richard L Tweedie. Rates of convergence of stochastically monotone and continuous time Markov models. Journal of Applied Probability, 37(2):359 373, 2000.

Byron Roe. Mini Boo NE particle identification. UCI Machine Learning Repository, 2010. DOI: https://doi.org/10.24432/C5QC87.

Jeffrey S Rosenthal. Convergence rates for Markov chains. Siam Review, 37(3):387 405, 1995a.

Jeffrey S Rosenthal. Minorization conditions and convergence rates for markov chain monte carlo. Journal of the American Statistical Association, 90(430):558 566, 1995b.

Thomas J. Rothenberg, Franklin M. Fisher, and C. B. Tilanus. A note on estimation from a cauchy sample. Journal of the American Statistical Association, 59(306):460 463, 1964. doi: 10.1080/01621459.1964. 10482170. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1964.10482170.

Peter J Rousseeuw and Mia Hubert. Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):73 79, 2011.

Abdurakhmon Sadiev, Marina Danilova, Eduard Gorbunov, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richtárik. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In International Conference on Machine Learning, pp. 29563 29648. PMLR, 2023.

Prem Seetharaman, Gordon Wichern, Bryan Pardo, and Jonathan Le Roux. Autoclip: Adaptive gradient clipping for source separation networks. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1 6. IEEE, 2020.

Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the 24th International Conference on Machine Learning, pp. 807 814, 2007.

Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In Conference on Learning Theory, volume 2, pp. 5, 2009.

Published in Transactions on Machine Learning Research (10/2024)

Srdjan S Stanković and Branko D Kovačević. Analysis of robust stochastic approximation algorithms for process identification. Automatica, 22(4):483 488, 1986.

Che-Ping Tsai, Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Ravikumar. Heavy-tailed streaming statistical estimation. In International Conference on Artificial Intelligence and Statistics, pp. 1251 1282. PMLR, 2022.

John Wilder Tukey. A survey of sampling from contaminated distributions. Contributions to Probability and Statistics, pp. 448 485, 1960.

Andrew V Uzilov, Joshua M Keegan, and David H Mathews. Detection of non-coding rnas on the basis of predicted secondary structure formation free energy change. BMC bioinformatics, 7(1):1 30, 2006.

Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge university press, 2018.

Nuri Mert Vural, Lu Yu, Krishna Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance. In Conference on Learning Theory, pp. 65 102. PMLR, 2022.

Lu Yu, Krishnakumar Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. Advances in Neural Information Processing Systems, 34:4234 4248, 2021.

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383 15393, 2020.

Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, pp. 116, 2004.

Published in Transactions on Machine Learning Research (10/2024)

Supplementary Material Robust Stochastic Optimization via Gradient Quantile Clipping

A Additional experimental results

0 100000 200000 0.55

RQC_SGD SGD CClip(p=0.25) CClip(p=0.5) CClip(p=0.75) 0 10000 20000 30000 0.60

0.75 phishing

Figure 3: Evolution of the test loss (y-axis) against iteration t (x-axis) for the training of a single hidden layer network on additional real world classification datasets (average over 20 runs).

Classification with shallow networks. We performed the same experiment using two additional datasets. The results are displayed on Figure 3 and corroborate our statements in the main paper.

Expectation estimation. We estimate the expectation of a random vector X by minimizing the objective L(θ) = 1

2 θ θ 2 with θ = E[X] using a stream of both corrupted and heavy-tailed samples, see Appendix B for details. We run RQC-SGD (Algorithm 2) and compare it to an online version of geometric and coordinatewise Median-Of-Means (GMOM and CMOM (Cardot et al., 2017; 2013)) which use block sample means to minimize an L1 objective (see Appendix B). Although these estimators are a priori not robust to η-corruption, we ensure that their estimates are meaningful by limiting η to 4% and using blocks of 10 samples. Thus, blocks are corrupted with probability < 1/2 so that the majority contains only true samples. Figure 4 displays the evolution of θt θ for each method averaged over 100 runs for increasing η and constant step size. We also display a single run for η = 0.04. We observe that RQC-SGD is only weakly affected by the increasing corruption whereas the performance of GMOM and CMOM quickly degrades with η, leading to unstable estimates.

B Experimental details

As previously mentioned, the dimension is set to d = 128 in our experiments with synthetic data. We also set σmin = 1 and σmax = 5 as minimum and maximum scaling factors. For all tasks and algorithms, the optimization starts from θ0 = 0.

Bookkeeping in RQC-SGD The buffer in Algorithm 2 stores values in sorted order along with their ages . The most recent and oldest values have ages 0 and S 1 respectively. At each iteration, a new gradient is received, all ages are incremented and the oldest value is replaced by the new one with age 0. The latter is then sorted using one iteration of insertion sort. The estimate b Qp is retrieved at each iteration as the value at position p S .

Published in Transactions on Machine Learning Research (10/2024)

0 2000 4000 6000 8000 0

0 2000 4000 6000 8000

0 2000 4000 6000 8000

RQC-SGD CMOM GMOM

Figure 4: Evolution of θt θ (y-axis) against iteration t (x-axis) for the expectation estimation task, averaged over 100 runs at different corruption levels η (bands widths correspond to the standard deviation of the 100 runs). For η = 0.04, the evolution on a single run is also displayed. We observe good performance for RQC-SGD for increasing η while CMOM and GMOM are more sensitive.

B.1 Mean estimation

Data generation We compute a matrix Σ = (AA + A A)/2 where A Rd d is a random matrix with i.i.d centered Gaussian entries with variance 1/d sampled once and for all. We generate true samples as X = 1 +ΣV where V is a vector of i.i.d symmetrized Pareto random variables with parameter 2 and 1 Rd

denotes the vector with all entries equal to 1.

We draw corrupted samples as q X = 10q V 100 1 where q V is a vector of i.i.d symmetrized Pareto variables with parameter 1.5. We use step size β = 10 3.

GMOM and CMOM The geometric and coordinatewise Median-Of-Means estimators (GMOM and CMOM) optimize the following objectives respectively:

E θ XNb 2 and E θ XNb 1,

where XNb is the average of Nb independent copies of X. The block size is set to Nb = 10 in the whole experiment. The above objectives are optimized by computing samples of XNb in a streaming fashion so that one step is made for each Nb samples. In order to compensate for this inefficiency we multiply the step size by Nb for both GMOM and CMOM. For GMOM, we additionally multiply the step size by

d in order to compensate the normalization included in the gradient formula.

RQC-SGD For mean estimation, we implement RQC-SGD (Algorithm 2) with buffer size S = 100, p = 0.2 and τunif = 10.

B.2 Linear regression

Data generation We choose the true parameter θ by independently sampling its coordinate uniformly in the interval [ 5, +5]. The true covariates are sampled as X = ΣV where Σ is a diagonal matrix with entries sampled uniformly in the interval [σmin, σmax] (once and for all) and V is a vector of i.i.d symmetrized Pareto random variables with parameter 2. The labels are sampled as Y = X θ +ϵ where ϵ is a symmetrized Pareto random variable with parameter 2.

The corrupted samples are obtained according to one of the following possibilities with equal probability:

X = 1000(maxi Σii)v + W where v is a fixed unit vector and W is a standard Gaussian vector and Y Bernoulli(1/2).

X = 1000(maxi Σii)V with V a unit norm random vector with uniform distribution and Y = 1000(Z + U) where Z is a random sign and U is uniform over [ 1/5, 1/5].

Published in Transactions on Machine Learning Research (10/2024)

X = 10V with V a random vector with i.i.d entries following a symmetrized Log-normal distribution and Y = X θfake + ϵ with θfake a fake parameter drawn similarly to θ once and for all and ϵ a standard Gaussian variable.

We use step size β = 10 3.

Huber parameter In order to tune the parameter δ of Huber s loss function, we proceed as follows:

For each corruption level η, we consider 10 candidate values δj = 10j/2 5 for 0 j < 10.

For each candidate δj, we train 250 estimators bθ(i) δj

i J250K using 1000 samples each.

We choose bj for which the average 1 250 P

i J250K bθ(i) δj θ is minimal and use bδ = δbj as parameter.

RQC-SGD For linear regression, we run RQC-SGD with buffer size S = 100 and τunif = 10. The quantile value was set to p = 0.1 for η {0.02, 0.06} and p = 0.05 for η = 0.1.

B.3 Logistic regression

Data generation The true parameter θ and covariates X are chosen similarly to linear regression. Given X, the label Y is set to +1 with probability σ(X θ ) where σ is the sigmoid function σ(x) = (1 + e x) 1

and to 1 otherwise.

The corrupted covariates are determined similarly to linear regression while the labels are set as follows in each respective case:

Y is set to +1 or 1 with equal probability.

Y = sign(X θ ).

Y = sign(X θfake) with θfake a fake parameter drawn similarly to θ once and for all.

We use step size β = 6 10 3.

Huber parameter The same procedure is used to tune the parameter of the modified Huber loss as for linear regression.

RQC-SGD For logistic regression, we run RQC-SGD with buffer size S = 100 and τunif = 10. The quantile value was set to p = 1 η 0.1 for η = 0.02 and p = 1 η 0.05 otherwise.

B.4 Single hidden layer neural network classifier

We train a single hidden layer neural network classifier with 100 hidden neurones for all datasets. We use one sample per iteration and step size β = 10 2 for all methods.

As previously, RQC-SGD is run with buffer size S = 100 and τunif = 10. The quantile value was set to p = 0.9. We compute the gradient norms over a batch of samples of size S at the beginning of the optimization and use the quantiles of order p = 0.25, 0.5 and 0.75 as the clipping level for the constant clipping baselines.

Data We used publicly available datasets for our experiments. We provide details about their characteristics and sources in Table 1.

We use a 10% share of each dataset as a test set in order to compute the test loss plotted in Figures 2 and 3. We also ensure the test set contains at least 5000 elements. Optimization is run using the remaining train set which is corrupted as specified next. The results are averaged over 20 runs for each datasets.

Published in Transactions on Machine Learning Research (10/2024)

Dataset # Samples # Features # Classes Source Codrna (Uzilov et al., 2006) 488,565 8 2 Open ML Sensorless (Bator, 2015) 58,509 48 11 UCI Covtype (Blackard, 1998) 581,012 52 7 scikit-learn Miniboone (Roe, 2010) 130,065 50 2 UCI Phishing (Hannousse & Yahiouche, 2020) 11,430 87 2 Kaggle

Table 1: Main characteristics of the data sets used in experiments, including number of samples, number of features, number of classes and sources.

Data corruption We corrupt train data samples at each iteration uniformly at random with probability η = 0.02.

Although we run the optimization using one sample per iteration, the datasets we use are available offline so that we have a data matrix denoted X Rn (d+1) whose last column represents the labels. This corresponds to n samples and d features.

For each feature j Jd K, we compute bµj and bσj the empirical mean and standard deviation respectively. We also sample a random unit vector u of size d and introduce corruption as follows:

For the label column, we introduce corruption by changing the value uniformly at random among the other possible modalities.

For features, we introduce corruption by replacing the original values with one of the following possibilities with equal probability:

a vector ξ sampled coordinatewise according to ξj = rj+1000 bσjν where rj is a value randomly picked in the column X ,j and ν is a sample from the Student distribution with 2.1 degrees of freedom. a vector ξ sampled coordinatewise according to ξj = bµj + 1000 bσjuj + z where z is a standard Gaussian. a vector ξ sampled according to ξ = bµ + 1000 bσ w where w is a uniformly sampled unit vector.

C Geometric convergence speed and relation to step size

The geometric Markov chain convergence stated in Theorem 1 occurs at a speed determined by the contraction factor ρ which mainly depends on the step size β and quantile p defining the iteration. Therefore, an explicit formulation of this dependency is necessary to precisely quantify the convergence speed. This question is lightly touched upon in (Yu et al., 2021) whose Proposition 2.1 is an analogous SGD ergodicity result. Like Theorem 1, the latter relies on the Markov chain theory presented in (Meyn & Tweedie, 1993). It is argued in (Yu et al., 2021) that a vanishing step size β 0 causes ρ to be close to one, leading to slow convergence but with smaller bias. However, these considerations remain asymptotic and do not quite address the convergence speed issue.

More generally, the precise estimation of the factor ρ goes back to the evaluation of the convergence speed of a Markov chain satisfying a geometric drift property. Near optimal results exist for chains with particular properties such as stochastic order (Lund et al., 1996; Roberts & Tweedie, 2000), reversibility (Jerison, 2019) or special assumptions on the renewal distribution (Berenhaut & Lund, 2001). Unfortunately, such properties do not hold for SGD. Let (θt)t 0 be a Markov chain satisfying the drift property:

( (1 λ)V (θ) for θ / C b for θ C

with λ (0, 1), b < + , V a real function such that V (θ) 1 for all θ and C a (bounded) small set (see (Meyn & Tweedie, 1993, Chapter 5)). Then, based on the available literature (Baxendale, 2005; Bednorz, 2013),

Published in Transactions on Machine Learning Research (10/2024)

(θt)t 0 converges as in Theorem 1 with ρ 1 λ3, the latter estimation being unimprovable without further information on (θt)t 0 (see the discussion following (Baxendale, 2005, Theorem 3.2)). For the specific setting of Theorem 1 (and more generally for SGD by setting p = 1), this only yields an excessively pessimistic estimate ρ 1 (pβµ)3 (17)

whereas it is reasonable to conjecture that ρ 1 pβµ. The suboptimality of (17) is felt in the uncorrupted case in Proposition 2 and Corollary 1 where one is tempted to set β of order 1/T, with T the horizon, reducing the bias to O(1/

T). However, this results in an unacceptable sample cost of order T 3 before convergence occurs. On the other hand, assuming the estimate ρ 1 pβµ holds, using a step size of order log(T)/T allows to combine fast convergence and near optimal statistical performance. Finally, note that in the corrupted case, the optimal statistical rate is O(η1 1/q) so that striking such a compromise is unnecessary.

D.1 Preliminary lemmas

Lemma 1. Grant Assumptions 1 and 2. For any θ, θ Rd and β 2 µ+L we have :

θ β L(θ) (θ β L(θ )) 2 (1 βµ)2 θ θ 2 (18)

Proof. For β 2 µ+L, we have:

θ β L(θ) (θ β L(θ )) 2

= θ θ 2 2β θ θ , L(θ) L(θ ) + β2 L(θ) L(θ ) 2

(1 β2µL) θ θ 2 β(2 β(µ + L)) L(θ) L(θ ), θ θ )

(1 β2µL) θ θ 2 β(2 β(µ + L))µ θ θ 2

= (1 β2µL 2βµ + β2µ(µ + L)) θ θ 2

= (1 βµ)2 θ θ 2,

where we used the inequalities : L(θ) L(θ ) 2 (µ + L) L(θ) L(θ ), θ θ µL θ θ 2 (19)

µ θ θ 2 L(θ) L(θ ), θ θ , (20)

valid for all θ, θ . Inequality (19) is stated, for example, in (Nesterov, 2014, Theoerem 2.1.12) (see also (Bubeck, 2015, Lemma 3.11) and (20) is just a characterization of strong convexity (see for instance (Nesterov, 2014, Theorem 2.1.9)).

Note that Lemma 1 is a simple generalization of the usual contraction property in strongly-convex optimization where θ is usually taken as θ . In that case, one has L(θ ) = 0. An example of such a result is given by (Bubeck, 2015, Theorem 3.10). A similar inequality to Lemma 1 was previously obtained in (Dieuleveut et al., 2020, Proposition 2) where convergence of SGD as Markov chain was studied.

In the sequel we will write gradient samples G(θ, ζ) and e G(θ, ζ) simply as G(θ) and e G(θ) respectively in order to lighten notation. The following lemma will be needed in the proof of Theorem 1.

Lemma 2. Let Assumptions 4 and 5 hold. Let θ Rd be fixed and let e G(θ) DI(θ) be a non corrupted gradient sample. Choosing the clipping threshold as τθ = Qp( e G(θ) ) for some p (0, 1) and denoting

αθ = min 1, τθ G(θ) the clipping factor and its average αθ = E αθ|θ, G(θ)= e G(θ) DI(θ) we have:

E[αθ e G(θ)] αθ L(θ) (1 p)1 1/q Aq θ θ + Bq , (21)

Published in Transactions on Machine Learning Research (10/2024)

τθ L(θ) + Qp( εθ )

L(θ) + (1 p) 1/q Aq θ θ + Bq . (22)

If Assumption 5 holds with q 2 then we also have

E αθ e G(θ) E[αθ e G(θ)] 2 Aq θ θ + Bq 2 + 5(1 p)τ 2 θ . (23)

Proof. We condition on the event that the sample G(θ) is not corrupted i.e. G(θ)= e G(θ) DI(θ). Noticing that 1 e G(θ) τθ = 1 1 e G(θ) >τθ and using the equality E[ e G(θ)] = L(θ), we find :

E[αθ e G(θ)] αθ L(θ) = E (αθ αθ) e G(θ) L(θ)

= E (1 αθ) e G(θ) L(θ) 1 e G(θ) τθ + E (αθ αθ) e G(θ) L(θ) 1 e G(θ) >τθ

= (αθ 1)E e G(θ) L(θ) 1 e G(θ) >τθ αθE e G(θ) L(θ) 1 e G(θ) >τθ

+ E τθ/ e G(θ) e G(θ) L(θ) 1 e G(θ) >τθ

= E 1 τθ/ e G(θ) e G(θ) L(θ) 1 e G(θ) >τθ .

Using our choice of τθ and Hölder s inequality, we find : E[αθ e G(θ)] αθ L(θ) E 1 e G(θ) >τθ 1 τθ/ e G(θ) e G(θ) L(θ)

E 1 e G(θ) >τθ e G(θ) L(θ)

(1 p)1 1/q E e G(θ) L(θ) q 1/q

(1 p)1 1/q Aq θ θ + Bq ,

where the second step corresponds to the inequality 1 τθ/ e G(θ) 1 under the event e G(θ) > τθ . Note also that we used Assumption 5 since θ is fixed. Inequality (21) is now proven. To show (22), we first write the inequality:

τθ = Qp( e G(θ) ) = Qp( L(θ) + εθ ) L(θ) + Qp( εθ ),

which holds since e G(θ) is a positive random variable. Further, using Assumption 5, we have:

1 p = P εθ > Qp( εθ ) E[ εθ q]

Qp( εθ )q Aq θ θ + Bq

It only remains to take the q-th root and plug the obtained bound on Qp( εθ ) back above to obtain (22).

To show (23), we define the event E = { e G(θ) τθ} and denote E its complement such that P(E) = p = 1 P(E). We write

E αθ e G(θ) E[αθ e G(θ)] 2 = p E αθ e G(θ) E[αθ e G(θ)] 2|E

+ (1 p)E αθ e G(θ) E[αθ e G(θ)] 2|E

p E αθ e G(θ) E[αθ e G(θ)] 2|E + 4(1 p)τ 2 θ

= p E αθ e G(θ) E[αθ e G(θ)|E] 2|E

+ p E[αθ e G(θ)|E] E[αθ e G(θ)] 2 + 4(1 p)τ 2 θ

= p E αθ e G(θ) E[αθ e G(θ)|E] 2|E

+ p (1 p)E[αθ e G(θ)|E] (1 p)E[αθ e G(θ)|E] 2 + 4(1 p)τ 2 θ

p E αθ e G(θ) E[αθ e G(θ)|E] 2|E + 4p(1 p)2τ 2 θ + 4(1 p)τ 2 θ

p E αθ e G(θ) E[αθ e G(θ)|E] 2|E + 5(1 p)τ 2 θ

Published in Transactions on Machine Learning Research (10/2024)

where we used the identity E[αθ e G(θ)] = p E[αθ e G(θ)|E]+(1 p)E[αθ e G(θ)|E] and the inequalities αθ e G(θ) τθ and p(1 p) 1/4 which hold in all cases. In addition, we have

p E αθ e G(θ) E[αθ e G(θ)|E] 2|E = p E e G(θ) E[ e G(θ)|E] 2|E E e G(θ) E[ e G(θ)] 2

E e G(θ) L(θ) q 2/q Aq θ θ + Bq 2.

The first inequality is obtained by applying Lemma 3 to each coordinate of e G(θ) while conditioning on θ. The second one uses Jensen s inequality and the third results from Assumption 5.

The statement of Lemma 2 appears to be novel and is enabled by the specific choice of the gradient norm quantile as clipping threshold.

Lemma 3. Let Y be a real random variable and E an event, then we have the inequality

P(E)E[(Y E[Y |E])2|E] E(Y E[Y ])2.

Proof. Define the conditional variance of a real random variable Y w.r.t. another variable Y as Var(Y |X) = E[(Y E[Y |X])2|X].

By Eve s law(see for instance (Blitzstein & Hwang, 2019, Theorem 9.5.4)) we have the identity

Var(Y ) = E[Var(Y |X)] + Var(E[Y |X]),

which entails the inequality Var(Y ) E[Var(Y |X)]. Applying the latter with X = 1E yields

E(Y E[Y ])2 = Var(Y ) P(E)E[(Y E[Y |E])2|E] + (1 P(E))E[(Y E[Y |E])2|E],

which implies the result.

We show the geometric ergodicity of the SGD Markov chain (θt)t 0 by relying on (Meyn & Tweedie, 1993, Theorem 15.0.1). We will show that the following function :

V (θ) := 1 + θ θ 2,

satisfies a geometric drift property. We define the action of the transition kernel Pβ,p on real integrable functions f over Rd through:

Pβ,pf(θ) = Z f(θ )Pβ,p(θ, dθ ) = E f(θ αθβG(θ)) .

We also define the variation operator:

f(θ) := Pβ,pf(θ) f(θ).

In many of our proofs, we will make use of the following adjustable bound.

Fact 1. For any real numbers a, b and positive ϵ, we have the inequality

2ab a2ϵ + b2/ϵ.

D.2 Preliminary properties for Markov chains

In order to prove Theorem 1, we set up the formalism we need from (Meyn & Tweedie, 1993) through the following definitions.

Definition 2 (Small set). A set C B(Rd) is called a small set if there exists an m > 0, and a non-trivial measure ν on B(Rd) such that for all θ C and B B(Rd),

P m β,p(θ, B) ν(B).

When this is the case, we say that C is (m, ν)-small.

Published in Transactions on Machine Learning Research (10/2024)

Minorization properties such as the one above are useful for proving the convergence of Markov chains. We define an analogous one via the notion of sampled chain.

Definition 3 (Sampled chain). Let a be a probability measure on Z+ i.e. such that a(n) 0 for n 0 and P n=0 a(n) = 1. Then the transition kernel for the sampled Markov chain w.r.t. the distribution a is

n=0 P n β,p(θ, A)a(n) for x Rd, A B(Rd).

We define the notion of a petite set for a sampled Markov chain.

Definition 4 (Petite set). Let a be a probability measure on Z+ defining a sampled chain. A set C B(Rd) is called a (a, ν)-petite if Ka(θ, B) ν(B),

for all θ C and B B(Rd) where ν is a non-trivial measure on B(Rd).

Note that an (m, ν)-small set is also (a, ν)-petite with a = δm. Finally, we define the norm of a measure w.r.t. a potential function.

Definition 5 (f-norm). Let f : Rd R be a function such that f 1 and let ν be a signed measure on Rd. We define the f-norm of ν as

ν f = sup g:|g| f |ν(g)| = sup g:|g| f

Z g(θ)ν(dθ) .

In particular, we have ν f ν TV.

D.3 Proof of Theorem 1

We will assume in this proof that q 2. The case q (1, 2) will be treated in Appendix D.5.

First, we define τ = infθ τθ. Note that Assumption 4 excludes the existence of θ such that e G(θ) = 0 almost surely, therefore, we have τ > 0.

Thanks to Assumption 4 and conditioning on θt = θ Rd for t 0, the distribution of θt+1 has a strictly positive density at least on a ball of radius βτ around θt. This implies that the chain is aperiodic since Pβ,p(θt, Wθt) > 0 for any neighborhood Wθt contained in the previous ball. Moreover, by induction, the distribution of θt+m has positive density at least on a ball of radius mβτ around θt. Thus for m high enough we have P m β,p(θt, A) > 0 for any set A with non zero Lebesgue measure. It follows that the Markov chain is irreducible w.r.t. Lebesgue s measure and is thus ψ-irreducible (see (Meyn & Tweedie, 1993, Chapter 4)).

For fixed θ, condition (9) implies β < 2 µ+L and using Lemma 1 and denoting αθ and αθ as in Lemma 2, we find:

Pβ,p θ θ 2 = E θ βαθG(θ) θ 2

ηE θ θ + βτθ 2 + (1 η)E θ αθβ e G(θ) θ 2

ηE θ θ + βτθ 2 + (1 η)E θ αθβ L(θ) θ 2

2β θ αθβ L(θ) θ , αθ e G(θ) αθ L(θ) + β2 αθ e G(θ) αθ L(θ) 2

ηE θ θ + βτθ 2 + (1 η)E θ αθβ L(θ) θ

+ β E[αθ e G(θ)] αθ L(θ) 2 + β2 Aq θ θ + Bq 2 + 5β2(1 p)τ 2 θ ,

where the last step uses that E αθ e G(θ) αθ L(θ) 2 = E αθ e G(θ) E[αθ e G(θ)] 2 + E[αθ e G(θ)] αθ L(θ) 2

and Lemma 2.

Published in Transactions on Machine Learning Research (10/2024)

Using Lemmas 1 and 2 and Assumption 5 and grouping the terms by powers of θ θ , we arrive at

Pβ,p θ θ 2 η + (1 η)(1 αθβµ)2 E θ θ 2+

2βE θ θ ητθ + (1 η)(1 αθβµ) E[αθ e G(θ)] αθ L(θ) +

β2 ητ 2 θ + (1 η) E[αθ e G(θ)] αθ L(θ) 2 + (Aq θ θ + Bq)2 + 5(1 p)τ 2 θ

E A θ θ 2 + 2βB θ θ + β2C

(A + βκ)E θ θ 2 + βB2

κ + β2C, (24)

where we used Fact 1 and defined the quantities A, B, C which may be bounded as follows

A = η + (1 η)(1 αθβµ)2 + 2β(η(L + Aq(1 p) 1/q) + (1 η)(1 αθβµ)(1 p) 1/q Aq)

+ β2((η + 5(1 p))(L + (1 p) 1/q Aq)2 + 4A2 q)

1 2β (1 η)αθµ η(L + Aq(1 p) 1/q) (1 η)(1 αθβµ)(1 p)1 1/q Aq +

β2 (1 η)(αθµ)2 + 2(η + 5(1 p)) L + (1 p) 1

q Aq 2 + 4A2 q

1 2β (1 η)pµ ηL (1 p) 1

q Aq 1 p(1 η) + β2 µ2 + 24ηL2 + 28A2 q

= 1 2βκ + β2 µ2 + 24ηL2 + 28A2 q ,

B = η(1 p) 1/q + (1 η)(1 αθβµ)(1 p)1 1/q Bq

(1 p) 1/q Bq(η + (1 p))

C = 4 + 2(η + 5(1 p))(1 p) 2/q B2 q

The above bounds use the simple properties p αθ 1, 0 η 1, 1 αθβµ 1 and (a + b)2 2a2 + 2b2

for all a, b.

Thanks to our choice of β, we get that A+βκ < 1. It is now easy to check that V (θ) = 1+ θ θ 2 satisfies the contraction

Pβ,p V (θ) (A + βκ) | {z }

V (θ) + 1 (A + βκ) + βB2

κ + C | {z }

We now define the set C = θ Rd , V (θ) 2eb/(1 eλ) for which we have:

2 V (θ) + eb1θ C. (25)

For such C, let C = diam(C) be its diameter and set m C = l C

m . As previously mentioned in the beginning of the proof, conditioning on θt = θ, the distribution of θt+m admits a positive density at least over a ball of radius mβτ around θ i.e. there exists h+m θ (ω) 0 satisfying

h+m θ (ω) > 0 for θ ω < mβτ and P m β,p(θ, A) Z

A h+m θ (ω)dω for A B(Rd).

We then let h C(ω) = infθ C h+m C θ (ω) and define the measure νC by:

A C h C(θ)dθ.

The above measure is non trivial since our choice of m C ensures that h C defines a non zero density at least on C. It follows that for all θ0 C, we have the following minorization property:

P m C β,p (θ0, A) νC(A) for all A B(Rd), (26)

Published in Transactions on Machine Learning Research (10/2024)

which implies that the set C is (m C, νC)-small and also (δm C, νC)-petite as a result. We have previously shown that the Markov chain (θt) is irreducible and aperiodic. Moreover, we have exhibited a petite set C (via (26)) towards which the geometric drift property (25) holds. Thus, condition (iii) of (Meyn & Tweedie, 1993, Theorem 15.0.1) is fulfilled.

By the latter result, it follows that the chain (θt) admits a unique invariant probability measure πβ,p and there exist r > 1 and M < such that: X

t 0 rt Pβ,p(θ0, ) πβ,p V MV (θ0). (27)

Taking ρ = r 1 and using that ν V ν TV for any signed measure ν concludes the proof.

D.4 Proof of Proposition 1

We consider the case q 2. Since the distribution πβ,p is invariant by the transition kernel Pβ,p, we can deduce that for θ πβ,p, we have

E θ θ 2 = E θ βαθG(θ) θ 2.

From here, we can follow similar computations to those leading to (24) in the proof of Theorem 1. By setting p = 1 η and using that q 2, we additionally have the bounds B 3η1 1/q Bq and C 16B2 q.

Hence, we have that:

E θ θ 2 (A + βκ)E θ θ 2 + βB2

= E θ θ 2 βB2

κ(1 A βκ) + β2C 1 A βκ. (28)

Note that (9) entails 1 A βκ βκ β2(µ2 + 24ηL2 + 28A2 q) 3βκ/4. (29)

Plugging this into (28) and using (10), we find

E θ θ 2 12η2 2/q B2 q κ2 + 64βB2 q 3κ 34η2 2/q B2 q κ2

which implies the result.

D.5 The case q (1, 2)

Similar results hold for the case q (1, 2) but require a different proof given below.

Proposition 4. Let Assumptions 1-5 hold with q (1, 2) and let QC-SGD be run with quantile p [η, 1 η]. Assume that κ := (1 η)pµ qηL q(η(1 p) 1/q + (1 p)1 1/q)Aq > 0, (30)

and take a step-size satisfying

86(L + Aq)q

1 q 1 . (31)

then the generated Markov chain (θt)t converges geometrically to a unique invariant measure πβ,1 η as in Theorem 1. In addition, for p = 1 η, β η/κ and θ πβ,1 η, we have

E θ θ q 128 η1 1/q Bq

Published in Transactions on Machine Learning Research (10/2024)

Proof. We first need to prove that convergence holds as stated in Theorem 1. We establish the properties of irreducibility, aperiodicity and exhibit a petite similarly as before. However, the demonstration of a geometric drift property such as (25) requires a different approach.

In this proof, we will use the following inequalities valid for all positive x, y and ε and q (1, 2)

(x + y)q 2q 1(xq + yq), (32)

(x + y)q xq + qxq 1y + yq, (33)

(a consequence of the inequality (1 + a)q 1 + qa + aq for a > 0),

(x + y)q 1 xq 1 + yq 1, (34)

q (y/ε)q/(q 1) (35)

which is a consequence of Young s inequality applied to the pair xε and y/ε with exponent q and its conjugate. We first write

Pβ,p θ θ q = E θ βαθG(θ) θ q ηE θ θ + βτθ q + (1 η)E θ αθβ e G(θ) θ q

Defining the notation Eθ[ ] = E[ |θ], we have

E θ αθβ e G(θ) θ q = E θ αθβ L(θ) θ 2

2β θ αθβ L(θ) θ , αθ e G(θ) αθ L(θ) + β2 αθ e G(θ) αθ L(θ) 2 q/2

E θ αθβ L(θ) θ 2 2β θ αθβ L(θ) θ , αθ e G(θ) αθ L(θ) +

β2( αθ e G(θ) Eθ[αθ e G(θ)] 2 + 2 αθ e G(θ) Eθ[αθ e G(θ)], Eθ[αθ e G(θ)] αθ L(θ) +

Eθ[αθ e G(θ)] αθ L(θ) 2) q/2

E θ αθβ L(θ) θ 2 2β θ αθβ L(θ) θ , αθ e G(θ) αθ L(θ) +

β2(2 αθ e G(θ) Eθ[αθ e G(θ)], Eθ[αθ e G(θ)] αθ L(θ) + Eθ[αθ e G(θ)] αθ L(θ) 2) q/2

+ βq E αθ e G(θ) Eθ[αθ e G(θ)] q

E θ αθβ L(θ) θ 2 2β θ αθβ L(θ) θ , Eθ[αθ e G(θ)] αθ L(θ) +

β2 Eθ[αθ e G(θ)] αθ L(θ) 2 q/2 + βq E αθ e G(θ) Eθ[αθ e G(θ)] q

E θ αθβ L(θ) θ + β Eθ[αθ e G(θ)] αθ L(θ) q + βq E αθ e G(θ) Eθ[αθ e G(θ)] q,

where we used (34), conditioned on θ, then used Jensen s inequality for Eθ and a Cauchy-Schwarz inequality. We focus on the last term. Defining the event E = { e G(θ) τθ} such that P(E) = p, we have

E αθ e G(θ) Eθ[αθ e G(θ)] q = (1 p)E[ αθ e G(θ) Eθ[αθ e G(θ)] q|E]

+ p E[ αθ e G(θ) Eθ[αθ e G(θ)] q|E]

(1 p)(2τθ)q + p E[ αθ e G(θ) Eθ[αθ e G(θ)] q|E].

In addition, using (34) twice, we find

E[ αθ e G(θ) Eθ[αθ e G(θ)] q|E] 2q 1 E[ αθ e G(θ) αθ L(θ) q|E]+

αθ L(θ) Eθ[αθ e G(θ)] q

2q 1 2q 1E[ e G(θ) L(θ) q|E] + 2q 1(1 αθ)q L(θ) q+

αθ L(θ) Eθ[αθ e G(θ)] q .

Published in Transactions on Machine Learning Research (10/2024)

Therefore, from the two previous displays and using Assumption 1, Lemma 2 and the inequality p E[ e G(θ) L(θ) q|E] E[ e G(θ) L(θ) q], we get

E αθ e G(θ) Eθ[αθ e G(θ)] q (1 p)(2τθ)q + 22q 2p(1 αθ)q Lq θ θ q

+ (22q 2 + 2q 1p(1 p)q 1)(Aq θ θ + Bq)q.

We also have

(1 p)τ q θ (1 p)((L + (1 p) 1/q Aq) θ θ + (1 p) 1/q Bq)q

((L + Aq) θ θ + Bq)q.

Plugging into the previous display and simplifying leads to

E αθ e G(θ) Eθ[αθ e G(θ)] q 22q 2((L + Aq) θ θ + Bq)q + 22q 2p(1 αθ)q(L θ θ )q

+ (22q 2 + 2q 1p(1 p)q 1)(Aq θ θ + Bq)q

22q 1((L + Aq) θ θ + Bq)q + (22q 2 + 2q 1p(1 p)q 1)(Aq θ θ + Bq)q

14((L + Aq) θ θ + Bq)q. (36)

Using these inequalities along with Lemma 2 to bound E θ θ q, we find that

Pβ,p θ θ q ηE θ θ + βτθ q + (1 η)βq E αθ e G(θ) Eθ[αθ e G(θ)] q

+ (1 η)E θ αθβ L(θ) θ + β Eθ[αθ e G(θ)] αθ L(θ) q

ηE θ θ q + βqτ q θ + qβ θ θ q 1τθ + (1 η)E (1 αθβµ)q θ θ q

+ βq Eθ[αθ e G(θ)] αθ L(θ) q + qβ θ θ q 1 Eθ[αθ e G(θ)] αθ L(θ)

+ (1 η)βq E αθ e G(θ) Eθ[αθ e G(θ)] q

E (1 (1 η)αθβµ) θ θ q + qβE θ θ q 1 ητθ

+ (1 η) Eθ[αθ e G(θ)] αθ L(θ) + βq ητ q θ + (1 η) Eθ[αθ e G(θ)] αθ L(θ) q + αθ e G(θ) Eθ[αθ e G(θ)] q

1 β((1 η)pµ qη(L + (1 p) 1/q Aq) q(1 η)(1 p)1 1/q Aq)

+ βq2q 1(η(L + (1 p) 1/q Aq)q + (1 η)14(L + Aq)q

+ (1 η)(1 p)q 1Aq q) E θ θ q + qβE θ θ q 1 (1 η)(1 p)1 1/q Bq

+ η(1 p) 1/q Bq + βq2q 1 η(1 p) 1Bq q + (1 η)(1 p)q 1Bq q + 14Bq q

AE θ θ q + qβBE θ θ q 1 + βq2q 1Cq (37)

where the second inequality uses Lemma 1 and (33) twice, the third inequality rearranges according to powers of θ θ , the fourth one applies Lemma 2 and (36) and the last inequality defines the factors A, B and C. Since p 1 η, these satisfy:

A 1 βκ + βq2q 1(16)(L + Aq)q, B 2Bq and C 24/q Bq

Applying (35), we have for all ε > 0 that

BE θ θ q 1 ε

q q 1 E θ θ q

q/(q 1) + B/ε)q

q κ E θ θ q

where the second inequality corresponds to the choice ε = κ 8(q 1) q 1

q . Plugging back into (37) and using

condition (31) ensures that A+βκ /8 < 1. Defining V (θ) = 1+ θ θ q, we get that satisfies the contraction

Pβ,p V (θ) (A + βκ /8) | {z }

V (θ) + 1 (A + βκ /8) + βBq κ

1 q + βq2q 1Cq

Published in Transactions on Machine Learning Research (10/2024)

From here, we use the same argument as the previous proof of Theorem 1 to deduce that the Markov chain converges to a unique invariant distribution πβ,p. Now, since convergence occurs to a limit distribution πβ,p, we have, as before, for θ πβ,p the identity

E θ θ q = E θ βαθG(θ) θ q.

By setting p = 1 η, we can show that in fact B 2η1 1/q Bq. Plugging back into (37) and using (38) and (31), we find

E θ θ q 1 (7/8)βκ + 32βq(L + Aq)q E θ θ q + β 2η1 1/q Bq q κ

1 q + 32βq Bq q

1 (3/8)βκ E θ θ q + β 2η1 1/q Bq q κ

1 q + 32βq Bq q

Now using the condition β η/κ and rearranging the inequality, we finally arrive at

q 2(8(q 1))q 1 + 32

128 η1 1/q Bq

which is the desired result.

D.6 Proof of Proposition 2

We use the invariance of πβ,p by the transition kernel Pβ,p. For real λ, this implies the equality

E exp λ2 θ θ 2 = E exp λ2 θ αθβG(θ) θ 2 . (39)

We then write:

θ αθβG(θ) θ 2 = θ αθβ L(θ) θ 2 + β2 αθG(θ) αθ L(θ) 2

2β θ αθβ L(θ) θ , αθG(θ) E[αθG(θ)] + E[αθG(θ)] αθ L(θ) .

We also have the inequality

αθG(θ) αθ L(θ) 2 2 αθG(θ) E[αθG(θ)] 2 + 2 E[αθG(θ)] αθ L(θ) 2.

Conditioning upon θ, we have that αθG(θ) E[αθG(θ)] 2τθ 2τ. Moreover, the vector αθG(θ) E[αθG(θ)] is centered, therefore it is sub-Gaussian with constant 2τ (see (Vershynin, 2018, Proposition 2.5.2)). Still conditioning on θ, using these two properties and a Cauchy-Schwarz inequality, we find:

E exp λ2 2β2 αθG(θ) E[αθG(θ)] 2 2β θ αθβ L(θ) θ , αθG(θ) E[αθG(θ)]

exp 8λ2β2τ 2 + 16λ4β2τ 2 θ αθβ L(θ) θ 2 .

Putting everything together in (39), we get:

E exp(λ2 θ θ 2) E exp λ2 (1 + 16λ2β2τ 2) θ αθβ L(θ) θ 2 + 8β2τ 2

+ 2β2 E[αθG(θ)] αθ L(θ) 2 + 2β θ αθβ L(θ) θ E[αθG(θ)] αθ L(θ)

E exp λ2 (1 + 16λ2β2τ 2 + ϵ) θ αθβ L(θ) θ 2 + 8β2τ 2

+ 2β2(1 + 1/(2ϵ)) E[αθG(θ)] αθ L(θ) 2 ,

Published in Transactions on Machine Learning Research (10/2024)

where we used Fact 1. Now, recalling that αθ p, we set ϵ = αθβµ/2 and restrict λ to λ (4τ p

E exp λ2 θ θ 2 E exp λ2 (1 + 16λ2β2τ 2 + ϵ) θ αθβ L(θ) θ 2 + 8β2τ 2

+ 2β2(1 + 1/(2ϵ)) E[αθG(θ)] αθ L(θ) 2

E exp λ2 (1+αθβµ) θ αθβ L(θ) θ 2 + 8β2τ 2 + 4β

E[αθG(θ)] α L(θ) 2

E exp λ2 (1 + αθβµ)(1 αθβµ)2 θ θ 2 + 8β2τ 2 + 4β

αθµ (1 p)1 1/q Bq 2

E exp λ2 (1 αθβµ)(1 (αθβµ)2) θ θ 2 + 4β2(2τ 2 + B 2 q/p) ,

where we used Lemma 1, inequality (21) from Lemma 2 (recall that η = 0 in this context) and the imposed bound relating β and p.

Finally, using Jensen s inequality and the fact that αθ p 1/2 we get:

E exp λ2 θ θ 2 exp 8λ2β2

τ 2 + B 2 q

αβµ + (αβµ)2 (αβµ)3

τ 2 + B 2 q

which concludes the first part of the proof. We now consider the corrupted case η > 0. Let λ > 0 and write :

E exp λ θ θ = E exp λ θ αθβG(θ) θ

ηE exp λ θ αθβ q G(θ) θ + (1 η)E exp λ θ αθβ e G(θ) θ

ηE exp λ( θ θ + βτθ) + (1 η)E exp λ θ αθβ L(θ) θ

+ λβ αθ e G(θ) E[αθ e G(θ)] + E[αθ e G(θ)] αθ L(θ)

ηE exp λ θ θ eλβτ

+ (1 η)E exp λ(1 pβµ) θ θ exp λβ(2τ + (1 p)1 1/q Bq) ,

where we used Lemma 1, the inequality αθ p, the inequality αθ e G(θ) E[αθ e G(θ)] 2τθ 2τ and (21) from Lemma 2. Using Hölder s inequality, this leads to :

E exp λ θ θ 1 η 1 ηeλβτ

1/(pβµ) exp λ(2τ + (1 p)1 1/q Bq)/(pµ)

Now we use the inequality log 1 η 1 ηeλβτ βτλ/ log(1/η)2

1 βτλ/ log(1/η) valid for λ 0 which leads to:

1 η 1 ηeλβτ

1/(pβµ) exp 2λτ pµ log(1/η)2

for 0 λ log(1/η)

Using that η < 1/2, we find that for 0 λ log(1/η)

2βτ , the following inequality holds :

E[exp(λ θ θ )] exp λ

2τ 1 + 1 log(1/η)2

+ (1 p)1 1/q Bq

pµ 7τ + (1 p)1 1/q Bq .

Noticing that β 1

µ allows to finish the proof.

D.7 Unimprovability of the sub-exponential property for η > 0

We consider the Markov chain

( αXt + ξ w.p. 1 η Xt + τ w.p. η

Published in Transactions on Machine Learning Research (10/2024)

Assuming that the distribution of the noise ξ has a density, one can show that the chain is aperiodic and satisfies a minorization property as in the proof of Theorem 1 (see Appendix D.3).

Defining V (x) = 1 + x, we can show that V verifies a geometric drift property similar to (25). Consequently, Theorem 15.0.1 of (Meyn & Tweedie, 1993) applies to the chain (Xt)t 0 and implies that it converges geometrically to a limit distribution π analogously to the claim of Theorem 1.

We denote M k k = E|X|k the absolute moments of X for k 1 and show that Mk = Ω(k) (we merely provide a sketch and do not attempt to explicitly compute the involved constants). For X π following the invariant measure, using the recursion defining Xt and the positivity of ξ, it is easy to establish the inequality for k 1

(1 η)(1 αk)M k k η

τ j M k j k j

where one may use the convention that M0 = 1. We now postulate the induction hypothesis Mj Cj up to j = k 1 for some k > 1 and C > 0. Using Stirling s formula, we find:

(1 η)(1 αk)

τ j C(k j) k j =

k! j!(k j)!τ j C(k j) k j

k j(k j) kkτ j C(k j) k j

jj(k j)k j =

where denotes an inequality up to a universal constant and we took the term j = 1 in the last step. It is only left to set C small enough such that τη C(1 η)(1 α) 1 in order to finish the induction. It follows that Mk = Ω(k) implying that π may be sub-exponential but cannot be sub-Gaussian since that would require Mk = O(

k) (see (Vershynin, 2018, Chapter 2) for a reference).

D.8 Proof of Corollary 1

We need the following lemma.

Lemma 4. Let X be a real sub-Gaussian random variable with constant K then, with probability at least δ, we have : |X| K p

Proof. Using Chernoff s method, we find for t > 0 and λ > 0:

P |X| > t = P λ2X2 > λ2t2 = P exp(λ2X2) > exp(λ2t2)

E exp(λ2X2)e λ2t2 exp λ2(K2 t2) .

Choosing λ=1/K, we have exp 1 (t/K)2 δ t K p

log(e/δ) and the result follows.

By Theorem 1, the Markov chain (θt)t 0 is geometrically converging to the invariant distribution πβ,p w.r.t. the Total Variation distance so that for any event E B(Rd), we have: P θT E Pθ πβ,p θ E MρT V (θ0). (40)

Proposition 2 states that, in the absence of corruption, for θ πβ,p, the variable θ θ is sub-Gaussian

with constant K = 4 q

2β(B2q+τ 2)

pµ . It is only left to combine this conclusion with Lemma 4 in order to obtain the claimed bound.

Published in Transactions on Machine Learning Research (10/2024)

D.9 Proof of Corollary 2

We assume without loss of generality that T is a multiple of N. Note that according to the assumptions, the estimators θ(n) T for n JNK are independent and for each n. For positive ϵ < 1 define the events

En := θ(n) T θ η 1 1

20 κϵ . We first assume that PN n=1 1En > N/2 then there exists i JNK such that

r(bi) N/2 = θ(bi) T θ(i ) T θ(bi) T θ + θ(i ) T θ 2η1 1

Moreover, among the N/2 estimators closest to θ(bi) T , at least one of them θ(i ) T satisfies θ(i ) T θ

20 κϵ thus we find :

bθ θ = θ(bi) T θ θ(bi) T θ(i ) T + θ(i ) T θ

r(bi) N/2 + η1 1

20 κϵ 3η1 1

20 κϵ . (41)

Notice that setting ϵ = 1/2 immediately yields (13). We now show that PN n=1 1En > N/2 happens with high probability. Thanks to Theorem 1 and Proposition 1, we have:

P(En) ϵ2 + MρT/N 1 + θ0 θ 2

Consequently, the variables 1En are stochastically dominated by Bernoulli variables with parameter ϵ2+ϵ so that their sum is stochastically dominated by a Binomial random variable S := Bin(N, ϵ2 +ϵ ). We compute :

n=1 1En < N/2 = P N X

n=1 1En > N/2 P S ES > N/2 (ϵ2 + ϵ )N

exp 2N(1/2 ϵ2 ϵ )2

exp 2N(1/2 1/4 MρT/N(1 + θ0 θ 2))2

exp 2N(1/4 1/15)2 exp log(1/δ) = δ

where we used Hoeffding s inequality, the choice ϵ = 1/2 and the fact that our condition on T implies MρT/N(1 + θ0 θ 2))2 1/15. The last inequalities result from our condition on N.

D.10 Proof of Theorem 2

As previously done in the proof of Theorem 1, we show that the Markov chain is aperiodic. Note that since L has finite lower bound infθ L, we can replace it with 1 + L(θ) infθ L and assume it is positive in the rest of the proof without loss of generality. We will now show that it satisfies a drift property. Let θ Θ be fixed, using Assumptions 1 and 3, we have :

E L θ αθβG(θ) L(θ) E h β L(θ), αθG(θ) + Lβ2

E h η β L(θ), αθ q G(θ) + Lβ2

αθ q G(θ) 2 +

(1 η) β L(θ), αθ e G(θ) + Lβ2

αθ e G(θ) 2 i

Published in Transactions on Machine Learning Research (10/2024)

Note that we have αθ e G(θ) , αθ q G(θ) τθ, therefore

E L θ αθβG(θ) L(θ) E h η β L(θ), αθ q G(θ) +

(1 η) β L(θ), αθ e G(θ) i + Lβ2τ 2 θ 2 .

Further, we write

E L(θ), αθ e G(θ) = L(θ), E[αθ e G(θ)] αθ L(θ) + αθ L(θ)

= αθ L(θ) 2 + L(θ), E[αθ e G(θ)] αθ L(θ) .

Plugging back and using Cauchy-Schwartz and the inequality αθ q G(θ) τθ leads to

E L θ αθβG(θ) L(θ) ηβ L(θ), E[αθ q G(θ)] β(1 η)αθ L(θ) 2

β(1 η) L(θ), E[αθ e G(θ)] αθ L(θ) + Lβ2τ 2 θ 2 ηβτθ L(θ) β(1 η)αθ L(θ) 2+

β(1 η) L(θ) E[αθ e G(θ)] αθ L(θ) + Lβ2τ 2 θ 2 .

We now use the inequalities τθ L(θ) + Qp( εθ ) (see Lemma 2) and (a + b)2 2a2 + 2b2, to find that

E L θ αθβG(θ) L(θ) β L(θ) 2 (1 η)αθ Lβ η + Lβ2Qp( εθ )2+

β L(θ) ηQp( εθ ) + (1 η) E[αθ e G(θ)] αθ L(θ) .

Next, letting ϵ > 0, and using Fact 1 gives

L(θ) ηQp( εθ ) + (1 η) E[αθ e G(θ)] αθ L(θ) ϵ L(θ) 2/2+ ηQp( εθ ) + (1 η) E[αθ e G(θ)] αθ L(θ) 2/(2ϵ).

We now plug back with the choice ϵ = p(1 η)/2 and use that αθ p to find

E L θ αθβG(θ) L(θ) β L(θ) 2 3p(1 η)/4 Lβ η + Lβ2Qp( εθ )2+

β ηQp( εθ ) + (1 η) E[αθ e G(θ)] αθ L(θ) 2

Finally, we use Lemma 2 to bound the terms Qp( εθ ) and E[αθ e G(θ)] αθ L(θ) leading to

E L θ αθβG(θ) L(θ) β L(θ) 2 3p(1 η)/4 Lβ η + Lβ2(1 p) 2

βB2 q η(1 p) 1

q + (1 η)η1 1

β L(θ) 2 3p(1 η)/4 Lβ η +

βB2 q (1 p) 2

q (Lβ + 2η2) + 2η2 2

p(1 η) . (42)

By assumption, we have 3p(1 η)/4 Lβ η > 0. Define the quantity ξ = B2 q (1 p) 2

q (Lβ+2η2)+2η 2 2

the set C := n θ Rd : 1

2 L(θ) 2 ξ 3p(1 η)/4 Lβ η o . By assumption, C is bounded and it is clear that the

right hand side in (42) is negative outside C. Define the function V (θ) = L(θ)/(βξ), which is positive and satisfies: V (θ) 1 + 2 1θ C. (43)

Published in Transactions on Machine Learning Research (10/2024)

In addition, we show similarly to Theorem 1 that the set C is small and, therefore, also petite according to the definitions of (Meyn & Tweedie, 1993, Chapter 5). Since V is everywhere finite and bounded on C (because the latter is compact), the conditions of (Meyn & Tweedie, 1993, Theorem 11.3.4) are fulfilled implying that the chain is Harris recurrent.

We have shown that the Markov chain verifies the fourth condition of (Meyn & Tweedie, 1993, Theorem 13.0.1). This allows us to conclude that the Markov chain is ergodic i.e. we have for any initial measure λ that λP t πβ,p TV 0 and the following sum is finite X

λP t β,p πβ,p TV < .

In addition, by (Meyn & Tweedie, 1993, Proposition 13.3.2) the terms in the above sum are non-increasing which implies that λP t β,p πβ,p TV = O(t 1) and the result follows.

D.11 Proof of Proposition 3

By Theorem 2, the assumptions imply that the Markov chain (θt)t 0 converges to an invariant distribution πβ,p. For θ πβ,1 η, by invariance of πβ,1 η, we have that the variables L(θ αθβG(θ)) and L(θ) are identically distributed. Taking the expectation w.r.t. θ, this implies the identity E[L(θ αθβG(θ))] = E[L(θ)]. Plugging into Inequality (42) from the proof of Theorem 2, we find

E L(θ) 2 B2 q (1 p) 2

q (Lβ + 2η2) + 2η2 2

p(1 η) 3p(1 η)/4 Lβ η

B2 q 3(1 p) 2

q η2 + 2η2 2

p(1 η) 3p(1 η)/4 Lβ η

p(1 η) 3p(1 η)/4 Lβ η

where we used the choices β η2

L and p = 1 η.