# anytime_guarantees_under_heavytailed_data__62a89379.pdf

Anytime Guarantees under Heavy-Tailed Data

Matthew J. Holland

Osaka University matthew-h@ar.sanken.osaka-u.ac.jp

Under data distributions which may be heavy-tailed, many stochastic gradient-based learning algorithms are driven by feedback queried at points with almost no performance guarantees on their own. Here we explore a modiﬁed anytime online-to-batch mechanism which for smooth objectives admits high-probability error bounds while requiring only lower-order moment bounds on the stochastic gradients. Using this conversion, we can derive a wide variety of anytime robust procedures, for which the task of performance analysis can be effectively reduced to regret control, meaning that existing regret bounds (for the bounded gradient case) can be robustiﬁed and leveraged in a straightforward manner. As a direct takeaway, we obtain an easily implemented stochastic gradient-based algorithm for which all queried points formally enjoy sub-Gaussian error bounds, and in practice show noteworthy gains on real-world data applications.

Introduction The ultimate goal of many learning tasks can be formulated as a minimization problem: min h R(h), s.t. h H. (1)

What characterizes this as a learning problem is that R (henceforth called the true objective) is unknown to the learner, who must choose from the hypothesis class H a ﬁnal candidate based only on incomplete and noisy (stochastic) feedback related to R (Haussler 1992; Vapnik 1999). One of the most ubiquitous and well-studied feedback mechanisms is the stochastic gradient oracle (Hazan 2016; Nemirovsky and Yudin 1983; Shalev-Shwartz 2012), in which the learner generates a sequence of candidates (ht) based on a sequence of random sub-gradients (Gt), which are unbiased in the following sense: E Gt | G[t 1] R(ht), for all t 1. (2)

Here R(h) denotes the sub-differential of R evaluated at h, and we denote sub-sequences by G[t] ..= (G1, . . . , Gt).1

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1More strictly speaking, for each t, this inclusion holds almost surely over the random draw of G[t], and the conditional expectation is that of Gt conditioned on the sigma-algebra generated by G[t 1]. See Ash and Dol eans-Dade (2000, Ch. 5 6) for additional background on probabilistic foundations.

Our problem of interest is that of efﬁciently minimizing R( ) over H when the noisy feedback is potentially heavy-tailed, i.e., for all steps t, it is unknown whether the distribution of Gt is congenial in the sub-Gaussian sense, or heavy-tailed in the sense of having inﬁnite or undeﬁned higher-order moments (Chen, Su, and Xu 2017). By efﬁciently, we mean procedures with performance guarantees (high-probability error bounds) on par with the case in which the learner knows a priori that the feedback is sub-Gaussian (Devroye et al. 2016; Nazin et al. 2019). Recently, notable progress has been made on this front, with a common theme of making principled modiﬁcations (e.g., truncation, data splitting + validation, etc.) to the the raw feedback (Gt) before passing it to a more traditional stochastic gradient-based update, to achieve sub-Gaussian bounds (with optimal dependence on the conﬁdence level) while assuming just ﬁnite variance (Davis et al. 2019; Gorbunov, Danilova, and Gasnikov 2020; Nazin et al. 2019). Here we focus on two key limitations to the current state of the art: (a) many robust learning algorithms only have such guarantees when R is strongly convex (Chen, Su, and Xu 2017; Davis et al. 2019; Holland and Ikeda 2019); (b) without strong convexity, sub-Gaussian guarantees are unavailable for the iterates (ht) being queried in (2), only for a running average of these iterates (Gorbunov, Danilova, and Gasnikov 2020; Nazin et al. 2019). While there exist general-purpose anytime online-to-batch conversions to ensure that the points being queried have guarantees (Cutkosky 2019), even the most reﬁned conversions either require bounded gradients or are only in expectation (Joulani et al. 2020), meaning that under potentially heavy-tailed gradients, a direct anytime conversion based on existing results fails to achieve the desired guarantees. In this paper, in order to address the issues described above, we introduce a modiﬁed mechanism for making the anytime conversion (Algorithm 1), which is both easy to implement and robust to the underlying data distribution. More concretely, assuming only that R is convex and smooth, and that raw gradients have ﬁnite variance, we obtain martingale concentration guarantees for truncated gradients queried at a moving average (Lemma 2), which lets us reduce the problem of obtaining error bounds to that of regret control (section ), substantially broadening the domain to which the anytime conversion of Cutkosky (2019) can be

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

applied. Regret control for online learning algorithms (under bounded gradients) is a well-studied problem, and in section we show that existing well-known regret bounds can be readily modiﬁed to utilize the control offered by Lemma 2. In particular, we look at vanilla FTRL (Lemma 4), mirror descent (Lemma 5), and AO-FTRL (Theorem 8), giving us anytime robust analogues to results given in expectation by Joulani et al. (2020). As a natural takeaway, we obtain a stochastic gradient-based procedure (section ) for which all queried points have sub-Gaussian error bounds (Corollary 7), a methodological improvement over the averaging scheme of Nazin et al. (2019), which we also empirically demonstrate has substantial practical beneﬁts (section ). Our results are stated with a high degree of generality (works on any reﬂexive Banach space), and taken together are suggestive of an appealing general-purpose learning strategy.

Outline of the paper In section we set out our basic notation, describe the basic principle underlying anytime conversions and highlight the impact it has on the excess risk. Our general-purpose robustiﬁed anytime algorithm design is described in detail in section , where we establish control over the gradient estimation error, and show how this leads to a robust analogue of existing regret-based excess error bounds, valid under potentially heavy-tailed gradients. We then put these general results to work in section , where we illustrate how it is straightforward to apply our techniques to many important classes of online learning algorithms to obtain variants which are both anytime and robust. In section , we carry out empirical investigations using real-world benchmark datasets to explore the practical utility of anytime feedback. Finally, we note that formal deﬁnitions of key concepts, support lemmas, and detailed proofs of all results in the main text are provided in the appendix (supplementary materials).

Preliminaries Terms and Notation The underlying space For the underlying hypothesis class H, we shall assume H V, where (V, ) is a normed linear space. For any normed space (V, ), we will denote by V the usual dual of V, namely the set of all continuous linear functionals on V. As is traditional, the norm for the dual is deﬁned f ..= sup{a : |f(v)| a v , v V} for f V . We denote the distance from a point v to a set A V by dist(v; A) ..= inf{ v u : u A}. We use the notation , to represent the coupling function between V and V , namely v , v ..= v (v) for each v V and all v V; when V is a Hilbert space this coincides with the usual inner product. We denote the extended real line by R.

Convexity and smoothness We say that a function f : V R is convex if for all 0 < α < 1 and u, v V, we have f(αu + (1 α)v) αf(u) + (1 α)f(v). The effective domain of f is deﬁned dom f ..= {u V : f(u) < }. A convex function f : V R is said to be proper if < f and dom f = . For any proper convex function f : V R, the sub-differential of f at h V is f(h) ..= {v V : f(u) f(h) v , u h , u V}.

For readability, we will sometimes make statements involving multi-valued functions; for example, the statement f(h), u = a, is equivalent to the statement v , u = a for all v f(h). When we say a certain point h is a stationary point of f on H, we mean that h H and 0 f(h ). If the convex function f happens to be (Gateaux) differentiable at some h H, then the sub-differential contains a unique element, f(h) = { f(h)}, the gradient of f at h. When we say that f is λ-smooth on some open convex set U V, we mean that f(h) f(h ) λ h h for all h, h U. For any sub-differentiable function f, we write Df(u; v) ..= f(u) f(v) f(v), u v ; when f happens to be convex and differentiable, this becomes the usual Bregman divergence induced by f.

Miscellaneous notation For indexing purposes, we denote the set of all positive integers no greater than k by [k] ..= {1, . . . , k}. We denote α1:t ..= Pt i=1 αi for any integer t 1, using the convention α1:0 ..= 0 as needed. We also denote sub-sequences in a similar fashion, with α[t] ..= (α1, . . . , αt); this applies not only to (αt), but also (ht), (Gt) and other sequences used throughout the paper. Indicator functions (i.e., Bernoulli random variables) are typically denoted as I{event}.

Anytime Conversions As preparation, we start with almost no assumptions on the learning algorithm or feedback-generating process. Let (ht) be an arbitrary sequence of candidates, henceforth referred to as the ancillary iterates. Letting (αt) be a sequence of positive weights, we consider the corresponding main iterates (ht), deﬁned for all t 1 as

ht = Weighting ht; h[t 1] ..= Pt i=1 αihi

As a starting point, we note that the excess error of the weighted main iterates can be expressed in a convenient fashion. Lemma 1 (Anytime lemma). Let V be a linear space, and let R : V R be sub-differentiable. Let (ht) be an arbitrary sequence of ht dom R, and let (ht) be generated via (3). Then we have

R(h T ) R(h )

t=1 αt R(ht), ht h DR(h ; ht)

t=1 α1:t DR(ht; ht+1)

for any reference point h dom R and T 1. The above equality is a slight generalization of the anytime online-to-batch inequality introduced by Cutkosky (2019) and sharpened by Joulani et al. (2020); it follows by direct manipulations utilizing little more than the deﬁnition of DR. The key point of Lemma 1 is that we can obtain control over the main iterates R(ht) using an ideal quantity that

Algorithm 1: Anytime robust online-to-batch conversion.

inputs: Weights (αt), thresholds (ct), algorithm A, initial point h1, max iterations T. Initialize h1 = h1. for t [T 1] do Obtain stochastic gradient Gt at ht, satisfying (4). Set Gt = Process[Gt; ct] following (6) (7). Ancillary update: ht+1 = A(ht). Main update:

ht+1 = Weighting[ht+1; h[t]], as in (3). end for return: h T .

depends directly on ht, rather than simply ht, as is typical of traditional online-to-batch conversions (Cesa-Bianchi, Conconi, and Gentile 2004). This is important because it opens the door to new stochastic feedback processes, driven by the main iterates, rather than the ancillary ones. In other words, we want feedback that provides an estimate of some element of R(ht), rather than R(ht). When R is convex, we have DR 0, and the subtracted terms can be utilized to sharpen our guarantees once we have regret bounds, as will be discussed in the technical appendices.

Anytime Robust Algorithm Design In Algorithm 1, we give a summary of the modiﬁed onlineto-batch conversion that we utilize throughout the rest of the paper. Essentially, we start with an arbitrary online learning algorithm A, query the potentially heavy-tailed stochastic feedback after averaging the iterates, and process the raw gradients in a robust fashion before updating. In the following paragraphs, we describe the details of these steps.

Raw feedback process Let (Gt) denote a sequence of stochastic gradients Gt V , which are conditionally unbiased in the sense that we have

E[t 1] Gt ..= E Gt | G[t 1] R(ht) (4)

for all t 1, recalling our notation G[t] ..= (G1, . . . , Gt). We emphasize to the reader that (4) differs from the traditional assumption (2) in terms of the points at which the sub-differential is being evaluated (ht rather than ht). As is traditional in the literature (Nazin et al. 2019; Nguyen et al. 2018), we shall also assume a uniform bound on the conditional variance, namely that for all t 1, we have

E[t 1] Gt E[t 1] Gt σ2 < . (5)

We will not assume anything else about the underlying distribution of (Gt); as such, the gradients clearly may be unbounded or heavy-tailed in the sense of having inﬁnite or undeﬁned higher-order moments. In this setting, while one could naively use the raw sequence (Gt) as-is, since we have made extremely weak assumptions on the underlying distribution, it is always possible for heavy-tailed data to severely destabilize the learning process (Brownlees, Joly, and Lugosi 2015; Chen, Su, and Xu 2017; Lecu e, Lerasle, and Mathieu 2018). As such, it is desirable to process the

raw gradients in a statistically principled manner, such that the processed output provides useful feedback to be passed directly to A.

Overall design for robust feedback A simple and popular approach to deal with heavy-tailed random vectors is to use norm-based truncation (Catoni and Giulini 2017; Nazin et al. 2019). As with Nazin et al. (2019), we process the raw gradients as follows:

Process [G; c] ..= eg, if G eg > c G, else. (6)

Here c > 0 is a threshold, the point eg V used in this sub-routine is an anchor in the dual space, associated with some primal anchor eh H assumed to satisfy

P n dist(eg; R(eh)) > eεσ o δ. (7)

We discuss settings of the anchor points and the critical thresholds in the following paragraphs. To concisely summarize the robust feedback that we use, instead of naively using (Gt) as feedback for A, we will pass (Gt), deﬁned by Gt ..= Process [Gt; ct], based on a sequence of thresholds (ct). The anchors eg and eh remain ﬁxed throughout the learning process.

Determining the anchor points Let us ﬁrst assume a primal anchor eh H is given; the key requirement we need to fulﬁll is the equation (7), which asks that the dual anchor eg we choose be eεσ-close to some sub-gradient of R with probability no less than 1 δ. To achieve this under our weak assumptions on the stochastic gradient distribution is straightforward using modern robust mean estimation techniques. Perhaps the simplest example is generalized medianof-means, as follows. Under our assumption 4, we can obtain m unbiased stochastic gradients G1, . . . , Gm, such that E[i 1] Gi R(eh) for all i [m]. Assuming k divides m, partition these points into k equal-sized subsets, compute the empirical means G1, . . . , Gk for each subset, and set the dual anchor as eg = Geo Med{G1, . . . , Gk}, where the sub-routine Geo Med refers to the geometric median of these points, a convex program for which many efﬁcient algorithms are known (Vardi and Zhang 2000; Cohen et al. 2016). It is well known that under this eg setting, if we set k = 8 log(δ 1) , then (7) holds with

(1 + 8 log(δ 1))

See Lugosi and Mendelson (2019) for additional background on this and other robust vector mean estimators. Finally, we emphasize that the choice of primal anchor eh H is arbitrary, in that the theoretical guarantees to be discussed shortly hold regardless of which eh we choose, given that the dual anchor is selected in the manner just described. In practice, the difference between good and bad primal anchors is in the bias incurred by the truncation (6). More concretely, considering the ideal case where eh is a stationary (interior) point, we have 0 R(eh) and thus thresholding

G is sufﬁcient. On the other hand, when eh is far from a stationary point, eg may have a large norm. If the learning process ﬁnds a good ht such that Gt tends to be small, then Gt eg may be large, even though we probably want to use Gt as-is without truncation. This bias must be dealt with by proper threshold settings, which we describe in the next paragraph.

Threshold settings under smooth objectives Let us further assume that R is λ-smooth, still leaving A abstract. In this case, the sub-differential is simply R(h) = { R(h)}, and so the error that we focus on is naturally that of the approximation Gt R(ht), for t [T]. With (Gt) generated as described in Algorithm 1, direct inspection shows us that

Gt R(ht) = (Gt eg)(1 It) + eg R(ht) (8)

where It is the Bernoulli random variable deﬁned It ..= I { Gt eg > ct}. The right-hand side of (8) has two terms we need to control. The ﬁrst term is clearly bounded above by ct, considering the truncation event. As for the second term, a smooth risk makes it easy to establish control in primal distance terms. More explicitly, we have

eg R(eh) + R(eh) R(ht)

eεσ + λ eh ht (9)

with probability no less than 1 δ, where the latter inequality follows from λ-smoothness and the anchor property (7). Taking (8) and (9) together, we readily obtain

Gt R(ht) Gt eg (1 It) + eg R(ht)

ct + eεσ + λ eh ht (10)

on an event of probability at least 1 δ. This inequality suggests an obvious choice for the threshold ct that keeps the preceding upper bound tidy:

ct = eεσ + λ eh ht + c0, t [T]. (11)

Here c0 > 0 is positive parameter that is used to control the degree of bias incurred due to truncation.

Estimation error under smooth objectives Using the thresholding strategy described in the preceding paragraph, one can obtain sub-linear bounds on the weighted gradient error terms, as the next result shows.

Lemma 2. Let R be convex and λ-smooth. Let H dom R be convex with diameter < . Given conﬁdence parameter 0 < δ < 1 and iterations T log(δ 1)( eεσ )2, running Algorithm 1 with thresholds (ct) as in (11) with c0 = max{λ , σ p

T/ log(δ 1)} + eεσ, and weights (αt) such that E[t 1] αt = αt almost surely, it follows that

t=1 αt sup h,h H

Gt R(ht), h h

max {qδ(T), rδ(T)}

with probability no less than 1 2δ, where we have deﬁned

t=1 α2 t + 2 max t [T ] αt

2λ 2 log(δ 1)

t=1 α2 t + 2

2 max t [T ] αt

The main beneﬁt of this lemma is that it holds under very weak assumptions on the stochastic gradients. The main limitations are that the feasible set has a ﬁnite diameter, and prior knowledge of T and other factors are used for thresholding.

A general strategy Let us deﬁne the regret incurred by Algorithm 1 after T steps by

Regret(T; A) ..=

t=1 αt Gt, ht h , (12)

where the reference point h dom R is left implicit in the notation. This weighted linear regret is somewhat special since the losses (i.e., h 7 αt Gt, h ) are evaluated on the ancillary sequence (ht), but they are deﬁned in terms of potentially biased stochastic gradients which depend on the main sequence (ht). With this notion of regret in hand, note that from Lemma 1, we immediately have the following expression:

R(h T ) R(h ) =

Regret(T; A) +

t=1 αt Gt R(ht), h ht

t=1 αt DR(h ; ht)

t=1 α1:t DR(ht; ht+1)

This inequality offers us a nice starting point for analyzing a wide class of anytime robust algorithms, since the second sum can clearly be controlled using Lemma 2. It just remains to seek out regret bounds for different choices of the underlying algorithm A which are sub-linear, up to error terms that are amenable to Lemma 2. We give several concrete examples in the next section. To close this section, by combining our notion of regret with the preceding lemma, we can obtain a robust analogue of Cutkosky (2019, Thm. 1), which is valid under unbounded, heavy-tailed stochastic gradients. Corollary 3. Under the assumptions of Lemma 2, for any reference point h H, we have

R(h T ) R(h )

1 α1:T [Regret(T; A) + max {qδ(T), rδ(T)} BT ]

with probability no less than 1 2δ, where BT 0 denotes the sum of all the Bregman divergence terms given in (13).

Anytime Robust Learning Algorithms Thus far, the underlying algorithm object A used in Algorithm 1 has been left abstract. In this section, we illustrate how (13) can be utilized for important classes of algorithms, by obtaining regret bounds that are sub-linear up to error terms that can be controlled using Lemma 2. Our running assumptions are that (V, ) is a reﬂexive Banach space, H V is convex and closed, R is sub-differentiable, and the sequence (Gt) driven by (ht) is precisely as in Algorithm 1.

Anytime Robust FTRL Here we consider the setting in which A is implemented using a form of follow-the-regularized-leader (FTRL). Letting (ψt) be a sequence of regularizer functions ψt : V R, we are interested in the ancillary sequence (ht) generated by

ht+1 = A(ht) arg min h H

i=1 αi Gi, h

(14) The initial value is set using an extra regularizer ψ1, with h1 arg minh H ψ1(h). We proceed assuming that the sequence (ht) exists, but we do not require the minimizer in (14) to be unique. Lemma 4. Let A be implemented as in (14), assuming that for each step t 1, the regularizer ψt is κt-strongly convex. Then, for any reference point h H, we have Regret(T; A)

ψT (h ) ψ1(h1) +

t=1 [ψt(ht+1) ψt+1(ht+1)]

R(ht) 2 2κt + αt R(ht) Gt, ht+1 ht .

(15) This lemma is a natural anytime robust analogue of standard FTRL regret bounds (Orabona 2020, Sec. 7.8). While the above bound holds as long as R is sub-differentiable, in the special case where R is smooth, the ﬁnal sum on the right-hand side of (15) is amenable to direct application of Lemma 2, as desired. Combining this with (13), one can immediately derive excess risk bounds for the output of Algorithm 1 under this FTRL-type of implementation, for a wide variety of regularization strategies.

Anytime Robust SMD Next we consider the closely related setting in which A is implemented using a form of stochastic mirror descent (SMD). Assuming H V is bounded, closed, and convex, let Φ : V R be a differentiable and strictly convex function. Let A generate (ht) based on the update

ht+1 = A(ht) = arg min h H

βt DΦ(h; ht) .

(16) The function DΦ is the Bregman divergence induced by Φ; see the appendix for more detailed background. The step sizes (βt) are assumed positive, but can be set freely.

Lemma 5. Let A be implemented as in (16), with Φ chosen to be κ-strongly convex on H. Then for any reference point h H, we have

DΦ(h ; ht) DΦ(h ; ht+1)

+ R(ht) Gt, ht+1 ht

for all t 1. This lemma can be interpreted easily as an anytime robust analogue of traditional regret bounds for SMD (e.g., (Orabona 2020, Lem. 6.7)). It can be combined with (13) and Lemma 2 to obtain the following guarantee. Theorem 6 (Anytime robust mirror descent). Under the setting of Lemmas 2 and 5, denote the diameter of H with respect to DΦ as Φ ..= suph,h H DΦ(h; h ) < . Setting the weight sequences such that αt/αt 1 βt/βt 1 and βt κ/λ, we have that for any h which is a stationary point of R on H, the inequality

R(h T ) R(h ) 1 α1:T

βT Φ + max{qδ(T), rδ(T)}

holds with probability no less than 1 2δ. In contrast with Nazin et al. (2019) who query at the ancillary iterates, the preceding high-probability error bounds effectively give us sub-Gaussian guarantees for all points used to query stochastic gradients. As an important special case, consider the setting where V is Euclidean space, and the underlying norm used is the ℓ2 norm 2. In this case, it is easy to verify that setting Φ(u) = u 2 2/2, with κ = 1 the update (16) amounts to

ht+1 = A(ht) = ΠH ht βt Gt (17)

where ΠH[ ] denotes projection onto H. That is, anytime robust stochastic gradient descent. These settings lead us to the following corollary. Corollary 7 (Anytime robust SGD). Consider A implemented using (17), with weights αt = 1 and βt 1/λ for all t [T]. Then we have

R(h T ) R(h )

T , 12λ 2 log(δ 1)

with probability no less than 1 2δ.

Anytime Robust AO-FTRL In this sub-section, we consider the case that A is implemented using an adaptive optimistic follow-the-leader (AOFTRL) procedure, namely updating as

ht+1 = A(ht)

arg min h H

αt+1 e Gt, h +

i=1 αi Gi, h +

Here (ϕt) is a sequence of regularizers that is now summed over for later notational convenience. Recalling the FTRL update (14), then clearly the AO-FTRL update is almost the same, save for the presence of e Gt at each step t, with the interpretation is that it provides a prediction of the loss that will be incurred in the following step, i.e., e Gt Gt+1.

Theorem 8. Let Algorithm 1 be run under the assumptions of Lemma 2, with A implemented as in (18), setting e Gt = Gt 1 for each t > 1. In addition, let each ϕt( ) be convex and non-negative, and denoting the regularizer partial sums as ψt( ) ..= Pt 1 i=0 ϕi( ), let each ψt be κt-strongly convex, with weights set such that (λ/κt)α2 t α1:(t 1) for t > 1. Then, for any h H we have

R(h T ) R(h )

α2 1 2κ1 R(h1) e G1 2

t=1 [ϕt 1(h ) ϕt 1(ht)] + 2 max {qδ(T), rδ(T)}

with probability no less than 1 4δ.

This theorem can be considered a robust, high-probability analogue of the results in expectation given by Joulani et al. (2020, Thm. 3). As such, it can be readily combined with existing regularization techniques (Joulani et al. 2020, Sec. 4) to achieve the same rates (in T) under potentially heavytailed noise, with minimal computational overhead.

Empirical Analysis

In this section we complement the preceding theoretical analysis with an application of the proposed learning strategy to real-world benchmark datasets. The practical utility of various gradient truncation mechanisms has already been well-studied in the literature (Chen, Su, and Xu 2017; Prasad et al. 2018; Lecu e, Lerasle, and Mathieu 2018; Holland and Ikeda 2019), and thus our chief point of interest here is if and when the feedback scheme utilized in Algorithm 1 can outperform the traditional feedback mechanism given by (2), under a convex, differentiable true objective. Put more succinctly, the key question is: is there a practical beneﬁt to querying at points with guarantees?

Experimental setup Considering the context of key related work (Gorbunov, Danilova, and Gasnikov 2020; Nazin et al. 2019), we focus on averaged SGD as our baseline, and consider several real-world classiﬁcation datasets of varying size, using standard multi-class logistic regression as our model.2 We test three different learning procedures: averaged SGD using traditional feedback (2) (denoted SGD-Ave), anytime robust SGD precisely as in Algorithm 1 and Corollary 7 (denoted Anytime-Robust-SGD), and ﬁnally anytime SGD without the robustiﬁcation sub-routine Process (denoted Anytime-SGD).

2Additional details for all the datasets used are provided in the appendix.

At a high level, for each dataset of interest, we run multiple independent randomized trials, and for each trial, we run the methods of interest for multiple epochs (i.e., multiple passes over the data), recording the on-sample (training) and off-sample (testing) performance at the end of each epoch. As a simple and lucid example that implies a convex objective, we use multi-class logistic loss under a linear model; for a dataset with k distinct classes, each predictor returns precisely k scores which are computed as a linear combination of the input features. Thus with k classes and din input features, the total dimensionality is d = kdin. For these experiments we run 10 independent trials. Everything is implemented by hand in Python (ver. 3.8), making signiﬁcant use of the numpy library (ver. 1.20).3 For each method and each trial, the dataset is randomly shufﬂed before being split into training and testing subsets. If n is the size of any given dataset, then the training set is of size ntr ..= 0.8n , and the test set is of size n ntr. Within each trial, for each epoch, the training data is also randomly shufﬂed. For all methods, the step size in update (17) is ﬁxed at βt = 2/ ntr, for all steps t; this setting is appropriate for Anytime-* methods due to Corollary 7, and also for SGD-Ave based on standard results such as Nemirovski et al. (2009, Sec. 2.3). The (Gt) are obtained by direct computation of the logistic loss gradients, averaged over a mini-batch of size 8; this size was set arbitrarily for speed and stability, and no other minibatch values were tested. Furthermore, for each method and each trial, the initial value h1 is randomly generated in a dimension-wise fashion from the uniform distribution on the interval [ 0.05, 0.05]. All raw input features are normalized to the unit interval [0, 1] in a per-feature fashion. We do not do any regularization, for any method being tested. We test three different learning procedures: averaged SGD using traditional feedback (2) (denoted SGD-Ave), anytime robust SGD precisely as in Algorithm 1 and Corollary 7 (denoted Anytime-Robust-SGD), and ﬁnally anytime SGD without the robustiﬁcation sub-routine Process (denoted Anytime-SGD).

Details for Anytime-Robust-SGD First, as a simple choice of anchors eh and eg, we set eh = h1 and estimate eg using the empirical mean on the training data set; this is meant to provide a transparent baseline for what is possible without any ﬁne-tuning. As discussed in section , a natural reﬁnement is to set aside a small subset of data and dedicate a subset to mean estimation (i.e., computation of the dual anchor); see Lugosi and Mendelson (2019) for several examples of well-known practical robust high-dimensional mean estimators. As for the thresholds (ct) used in the Process sub-routine, we set ct = p

ntr/ log(δ 1) for all t, with a conﬁdence level of δ = 0.05 ﬁxed throughout.

Results and discussion Our results are summarized in Figure 1, which plots the average training and test losses. For each trial, losses are averaged over datasets, and these average losses are themselves averaged over all trials to obtain the values plotted here. The impact of using feedback with

3A public repository including all experimental code has been published: https://github.com/feedbackward/anytime

Anytime-Robust-SGD-Ave (train) Anytime-SGD-Ave (train) SGD-Ave (train)

Anytime-Robust-SGD-Ave (test) Anytime-SGD-Ave (test) SGD-Ave (test)

0 10 20 30 -0.2

0 10 20 30 -0.2

0 10 20 30 -0.2

0 10 20 30 -3.0

emnist_balanced

0 10 20 30 -7.5

fashion_mnist

Figure 1: Training and test loss versus epoch number, averaged over all trials, for each method. The eight individual plot titles correspond to dataset names.

guarantees is immediate; in all cases, we see a notable boost in learning efﬁciency. This positive effect holds essentially uniformly across the datasets used, with no hyperparameter tuning. For CIFAR-10, we observe that the robustiﬁed version performs worse than than vanilla anytime averaged SGD; this looks to be due to the simple eh = h1 setting, and can be readily mitigated by updating eh after one pass over the data. It is reasonable to conjecture that if we were to shift to more complex non-linear models, from the resulting lack of convexity in the objective, there might emerge a tradeoff between the stability encouraged by Algorithm 1, and the beneﬁts of parameter space exploration that are incidental to the noisier gradients arising under (2).

Future Directions

From a technical perspective, the most salient direction moving forward is strengthening the robust estimation subroutines to reduce the amount of prior knowledge required, and to potentially extend the methodology to cover nonsmooth R. The requirement of a bounded domain can be removed (in the non-anytime setting) by using a more sophisticated update procedure (Gorbunov, Danilova, and Gasnikov 2020), and extending insights of this nature to reﬁne or modify the procedure used to obtain Lemma 2 is of natural interest. In most learning tasks of interest, the variance of the underlying feedback distribution may change signiﬁcantly, and an adaptive strategy for setting (ct) is of interest both for strengthening formal guarantees and improving efﬁciency and stability in practice.

Acknowledgements This work was supported by the JSPS KAKENHI Grant Number 19K20342, and by JST ACT-X Grant Number JPMJAX200O.

References Ash, R. B.; and Dol eans-Dade, C. A. 2000. Probability and Measure Theory. Academic Press, 2nd edition. Brownlees, C.; Joly, E.; and Lugosi, G. 2015. Empirical risk minimization for heavy-tailed losses. Annals of Statistics, 43(6): 2507 2536. Catoni, O.; and Giulini, I. 2017. Dimension-free PACBayesian bounds for matrices, vectors, and linear least squares regression. ar Xiv preprint ar Xiv:1712.02747. Cesa-Bianchi, N.; Conconi, A.; and Gentile, C. 2004. On the Generalization Ability of On-Line Learning Algorithms. IEEE Transactions on Information Theory, 50(9): 2050 2057. Chen, Y.; Su, L.; and Xu, J. 2017. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. In Proceedings of the ACM on Measurement and Analysis of Computing Systems. ACM. Cohen, M. B.; Lee, Y. T.; Miller, G.; Pachocki, J.; and Sidford, A. 2016. Geometric median in nearly linear time. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, 9 21. Cutkosky, A. 2019. Anytime online-to-batch, optimism and acceleration. In 36th International Conference on Machine

Learning (ICML), volume 97 of Proceedings of Machine Learning Research, 1446 1454. Davis, D.; Drusvyatskiy, D.; Xiao, L.; and Zhang, J. 2019. Robust stochastic optimization with the proximal point method. ar Xiv preprint ar Xiv:1907.13307v3. Devroye, L.; Lerasle, M.; Lugosi, G.; and Oliveira, R. I. 2016. Sub-Gaussian mean estimators. Annals of Statistics, 44(6): 2695 2725. Gorbunov, E.; Danilova, M.; and Gasnikov, A. 2020. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. ar Xiv preprint ar Xiv:2005.10785. Haussler, D. 1992. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1): 78 150. Hazan, E. 2016. Introduction to Online Convex Optimization. Foundations and Trends in Optimization, 2(3-4): 157 325. Holland, M. J.; and Ikeda, K. 2019. Better generalization with less data using robust gradient descent. In 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research. Joulani, P.; Raj, A.; Gy orgy, A.; and Szepesv ari, C. 2020. A Simpler Approach to Accelerated Stochastic Optimization: Iterative Averaging Meets Optimism. In 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, 4984 4993. Lecu e, G.; Lerasle, M.; and Mathieu, T. 2018. Robust classiﬁcation via MOM minimization. ar Xiv preprint ar Xiv:1808.03106v1. Lugosi, G.; and Mendelson, S. 2019. Robust multivariate mean estimation: the optimality of trimmed mean. ar Xiv preprint ar Xiv:1907.11391v1. Nazin, A. V.; Nemirovsky, A. S.; Tsybakov, A. B.; and Juditsky, A. B. 2019. Algorithms of robust stochastic optimization based on mirror descent method. Automation and Remote Control, 80(9): 1607 1627. Nemirovski, A.; Juditsky, A.; Lan, G.; and Shapiro, A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4): 1574 1609. Nemirovsky, A. S.; and Yudin, D. B. 1983. Problem complexity and method efﬁciency in optimization. Wiley Interscience. Nguyen, L. M.; Nguyen, P. H.; van Dijk, M.; Richt arik, P.; Scheinberg, K.; and Tak aˇc, M. 2018. SGD and Hogwild! convergence without the bounded gradients assumption. ar Xiv preprint ar Xiv:1802.03801v2. Orabona, F. 2020. A Modern Introduction to Online Learning. ar Xiv preprint ar Xiv:1912.13213v3. Prasad, A.; Suggala, A. S.; Balakrishnan, S.; and Ravikumar, P. 2018. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485. Shalev-Shwartz, S. 2012. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2): 107 194.

Vapnik, V. N. 1999. The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science. Springer, 2nd edition. Vardi, Y.; and Zhang, C.-H. 2000. The multivariate L1median and associated data depth. Proceedings of the National Academy of Sciences, 97(4): 1423 1426.