# stochastic_optimization_under_distributional_drift__714aef62.pdf

Journal of Machine Learning Research 24 (2023) 1-56 Submitted 11/21; Revised 10/22; Published 2/23

Stochastic Optimization under Distributional Drift

Joshua Cutler jocutler@uw.edu Department of Mathematics University of Washington Seattle, WA 98195-4322, USA

Dmitriy Drusvyatskiy ddrusv@uw.edu Department of Mathematics University of Washington Seattle, WA 98195-4322, USA

Zaid Harchaoui zaid@uw.edu Department of Statistics University of Washington Seattle, WA 98195-4322, USA

Editor: Alekh Agarwal

We consider the problem of minimizing a convex function that is evolving according to unknown and possibly stochastic dynamics, which may depend jointly on time and on the decision variable itself. Such problems abound in the machine learning and signal processing literature, under the names of concept drift, stochastic tracking, and performative prediction. We provide novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. The eﬃciency estimates we obtain clearly decouple the contributions of optimization error, gradient noise, and time drift. Notably, we identify a low drift-to-noise regime in which the tracking eﬃciency of the proximal stochastic gradient method beneﬁts signiﬁcantly from a step decay schedule. Numerical experiments illustrate our results. Keywords: stochastic gradient, stochastic tracking, concept drift, performative prediction, high-probability bounds

1. Introduction

Stochastic optimization underpins much of machine learning theory and practice. Signiﬁcant progress has been made over the last two decades in the ﬁnite-time analysis of stochastic approximation algorithms (Bottou, 2003; Bottou and Bousquet, 2007; Bottou, 2012; Srebro et al., 2011; Agarwal et al., 2014; Lang et al., 2019; Bach and Moulines, 2011; Sra et al., 2011; Nemirovski et al., 2009). The predominant assumption in much of the work on stochastic optimization for machine learning is that the distribution generating the data is ﬁxed throughout the run of the process. There is no shortage of problems, however, where this assumption is grossly violated. There are two main sources of such distributional shifts. The ﬁrst is temporal, wherein the distribution varies slowly in time due to reasons that are independent of the learning process. This setting is often called dynamic stochastic approximation (Dupaˇc, 1965) and is the basis for adaptive algorithms for stochastic track-

c 2023 Joshua Cutler, Dmitriy Drusvyatskiy, Zaid Harchaoui.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1410.html.

Cutler, Drusvyatskiy, Harchaoui

ing (Benveniste et al., 1990). The second common source is due to a feedback mechanism, wherein the distribution generating the data may depend on, or react to, the decisions made by the learner. This setting has been a subject of increased interest recently in the context of strategic classiﬁcation and performative prediction (Mendler-D unner et al., 2020; Drusvyatskiy and Xiao, 2022). In this work, we present ﬁnite-time eﬃciency estimates in expectation and with high probability for the tracking error of the proximal stochastic gradient method under time drift. The results are presented in a single framework that encompasses both the purely temporal and the decision-dependent time drift. Our results concisely explain the interplay between the learning rate, the noise variance in the gradient oracle, and the strength of the time drift. While conventional wisdom and previous work recommend the use of a constant step size under time drift, we identify a low drift-to-noise regime in which tracking eﬃciency beneﬁts signiﬁcantly from a step size schedule that geometrically decays to a critical step size . Setting the stage, consider the sequence of stochastic optimization problems

min x ϕt(x) := ft(x) + rt(x) (1)

indexed by time t N. In typical machine learning and signal processing settings, the function ft corresponds to an average loss that varies in time, while the regularizer rt models constraints or promotes structure (e.g., sparsity) in the variable x. Two examples are worth highlighting. The ﬁrst is a classical problem in signal processing related to stochastic tracking (Kushner and Yin, 1997; Sayed, 2003), wherein the learning algorithm aims to track over time a moving target driven by an unknown stochastic process. The second example is the concept drift phenomenon in online learning (Hazan and Seshadhri, 2009; Zhang et al., 2018), wherein the true hypothesis may be changing over time. The main goal of a learning algorithm for problem (1) is to generate a sequence of points {xt} that minimize some natural performance metric. To make progress, we impose the standard assumption that each function ft is µ-strongly convex with L-Lipschitz continuous gradient, while each regularizer rt is proper, closed, and convex. The online proximal stochastic gradient method (PSG) naturally applies to the sequence of problems (1). At each iteration t, the method simply takes the step

xt+1 = proxηtrt(xt ηtgt),

where the vector gt is an unbiased estimator of the true gradient of ft at xt, the step size (learning rate) ηt > 0 is user-speciﬁed, and proxηtrt( ) is the proximal map of the scaled regularizer ηtrt. In this work, we analyze two types of tracking error for PSG: the squared distance xt x t 2 and the suboptimality gap ϕt(ˆxt) ϕt(x t ). Here, x t denotes the minimizer of the function ϕt which may evolve stochastically in time, and ˆxt denotes a weighted average of iterates up to time t. We next outline the main results of the paper; the results in Sections 1.1 and 1.2 below appeared in a preliminary version of this paper at Neur IPS (Cutler et al., 2021).

Stochastic Optimization under Distributional Drift

1.1 Tracking the Minimizer

We begin with a simple bound on distance tracking of the constant-step PSG:

E xt x t 2 (1 µη)t x0 x 0 2 | {z } optimization

µ |{z} noise

| {z } drift

Here η (0, 1/2L] is the constant step size used by PSG, σ2 upper-bounds the variance of the stochastic gradient, and 2 upper-bounds the minimizer variations E x t x t+1 2; the symbol indicates an inequality that holds up to an absolute constant factor, i.e., up to multiplying the upper bound by a positive numerical constant independent of the problem parameters. Inequality (2) asserts that the tracking error E xt x t 2 decays linearly in time t, until it reaches the noise + drift error ησ2/µ + ( /µη)2. Notice that the noise + drift error cannot be made arbitrarily small by tuning η. This is perfectly in line with intuition: a step size η that is too small prevents the algorithm from catching up with the minimizers x t . We note that the individual error terms due to the optimization and noise are classically known to be tight for PSG; tightness of the drift term is proved by Madden et al. (2021, Theorem 3.2). Though the estimate (2) is likely known, we were unable to ﬁnd a precise reference in this generality. Letting t tend to inﬁnity in (2), the optimization error tends to zero, leaving only the noise + drift term. Optimizing this remaining term over η, it is natural to deﬁne the asymptotic distance tracking error of PSG and the corresponding optimal learning rate as

E := min η (0,1/2L]

and η := min

( 1 2L, 2 2

Two regimes of variation are brought to light: the high drift-to-noise regime /σ p

µ/16L3, and the low drift-to-noise regime /σ < p

µ/16L3. The high drift-to-noise regime is uninteresting from the viewpoint of stochastic optimization because in this case the optimal learning rate η 1/L is as large as in the deterministic setting (here, the symbol indicates an equality that holds up to an absolute constant factor). In contrast, the low drift-to-noise regime is interesting because it necessitates using a smaller learning rate η ( 2/µσ2)1/3

that exhibits a nontrivial scaling with the problem parameters. Consequently, for the rest of the introduction we focus on the low drift-to-noise regime. A central question is to ﬁnd a learning rate schedule that achieves a tracking error E xt x t 2 that is within a constant factor of E in the shortest possible time. The simplest strategy is to execute PSG with the constant learning rate η . Then a direct application of (2) yields the eﬃciency estimate E xt x t 2 E in time t (σ2/µ2E) log( x0 x 0 2/E). This eﬃciency estimate can be signiﬁcantly improved by gradually decaying the learning rate using a step decay schedule , wherein the algorithm is implemented in epochs with the new learning rate chosen to be the midpoint between the current learning rate and η . Such schedules are well known to improve eﬃciency in the static (stationary objective) setting, as was discovered by Ghadimi and Lan (2013), and can be used here. The end result is an algorithm that produces a point xt satisfying

E xt x t 2 E in time t L

µ log x0 x 0 2

Cutler, Drusvyatskiy, Harchaoui

This eﬃciency estimate is remarkably similar to that in the static setting (Ghadimi and Lan, 2013), with E playing the role of the target accuracy ε. An elementary computation shows that (3) improves the constant learning rate eﬃciency estimate when E is small, e.g., when E x0 x 0 2/e2, where e denotes Euler s number. The eﬃciency estimate (3) is a baseline guarantee for PSG with step decay. Since the result is stated in terms of the expected tracking error E xt x t 2, it is only meaningful if the entire algorithm can be repeated from scratch multiple times on the same problem.1

However, there is no shortage of situations in which a learning algorithm is operating in real time and the time drift is irreversible; in such settings, the algorithm may only be executed once. These situations call for eﬃciency estimates that hold with high probability, rather than only in expectation. With this in mind, we show that under mild light-tail assumptions, PSG with step decay produces a point xt satisfying xt x t 2 E log (1/δ) with probability at least 1 δ in the same order of iterations as in (3). The proof follows closely the probabilistic techniques developed by Harvey et al. (2019) for bounding moment generating functions.

1.2 Tracking the Minimum Value

The results outlined so far have focused on tracking the minimizer x t ; stronger guarantees may be obtained for tracking the minimum value ϕ t . To this end, we require stronger assumptions on the variation of the functions ft beyond control on the minimizer drift x t x t+1 2. Similar in spirit to the measure of cumulative gradient variation in the dynamic online learning literature (e.g., see Jadbabaie et al., 2015), we will be concerned with the gradient drift Gi,t := sup x fi(x) ft(x)

and assume the bound E[G2 i,t/µ2] 2|i t|2 for all times i and t. Thus, the second moment of the gradient drift is assumed to grow at most quadratically in the time horizon. Assuming henceforth that the regularizers rt r are identical for all times t, this condition on the gradient drift implies the weaker assumption E x t x t+1 2 2 used in Section 1.1. Analogous to (2), we show that PSG generates a point ˆxt (an average iterate) satisfying

E ϕt(ˆxt) ϕ t 1 µη

2 t ϕ0(x0) ϕ 0

| {z } optimization

+ ησ2 |{z} noise + 2

µη2 |{z} drift

This estimate again decouples nicely into three terms, signifying the error due to optimization, gradient noise, and time drift. Taking the limit as t tends to inﬁnity, we obtain the asymptotic function gap tracking error G := µE. Similar to (3), we show that PSG with step decay produces a point ˆxt satisfying

E ϕt(ˆxt) ϕ t G in time t L

µ log ϕ0(x0) ϕ 0 G

1. Speciﬁcally, error bounds holding in expectation yield concentration inequalities for the average of i.i.d. errors arising from executing a stochastic algorithm multiple times from scratch on the same problem (e.g., via Chebyshev s inequality); if the algorithm cannot be executed in this fashion to generate i.i.d. errors, then alternative high-probability error bounds are called for.

Stochastic Optimization under Distributional Drift

Again, the similarity to the static setting (Ghadimi and Lan, 2013), with G playing the role of a target accuracy, is striking. We then provide a high-probability extension of this estimate: under mild light-tail assumptions, PSG with step decay produces a point ˆxt satisfying ϕt(ˆxt) ϕ t G log(1/δ) with probability at least 1 δ in the same order of iterations as in (4) up to a factor of log log(1/δ). The proofs are based on the generalized Freedman inequality of Harvey et al. (2019) a remarkably ﬂexible tool for analyzing stochastic gradient-type algorithms.

1.3 Extension to Decision-Dependent Problems with Time Drift

We have so far focused on stochastic optimization problems that undergo a temporal shift. A primary reason for this phenomenon in machine learning, and data science more broadly, is that data distributions often evolve in time independently of the learning process. Recent literature, on the other hand, highlights a diﬀerent source of distributional shift due to decision-dependent or performative eﬀects. Namely, the distribution generating the data in iteration t may depend on, or react to, the current decision xt. For example, deployment of a classiﬁer by a learning system, when made public, often causes the population to adapt their attributes in order to increase the likelihood of being positively labeled a process called gaming . Even when the population is agnostic to the classiﬁer, the decisions made by the learning system (e.g., loan approval) may inadvertently alter the proﬁle of the population (e.g., credit score). The goal of the learning system therefore is to ﬁnd a classiﬁer that generalizes well under the response distribution. Recent research in strategic classiﬁcation (Hardt et al., 2016; Br uckner et al., 2012; Bechavod et al., 2021; Dalvi et al., 2004) and performative prediction (Perdomo et al., 2020; Mendler-D unner et al., 2020) has highlighted the prevalence of this phenomenon. Combining time-dependence and decision-dependence yields a class of problems (1) where the loss function ft(x) takes the special form ft(x) = Eξ D(t,x)ℓ(x, ξ). Here D(t, x) is a distribution that depends on both time t and the decision variable x. Thus for any ﬁxed time t, the problem (1) becomes the performative risk problem considered by Perdomo et al. (2020) and Mendler-D unner et al. (2020). Following this line of work, instead of tracking the true minimizer of ϕt typically a challenging task we will settle for tracking the equilibrium points xt. These are the points satisfying

xt argmin x E ξ D(t, xt)ℓ(x, ξ) + r(x).

Equilibrium points are sure to exist and are unique under mild Lipschitzness and strong convexity assumptions. We refer the reader to Perdomo et al. (2020) for a compelling motivation for considering such equilibrium points. The problem of tracking equilibrium points is yet again an instance of (1), but now with the diﬀerent function ft(x) = Eξ D(t, xt)ℓ(x, ξ) induced by the equilibrium distributions. The PSG algorithm is not directly applicable here since the learner cannot typically sample from D(t, xt) directly. Instead, a natural algorithm for this problem class draws in each iteration t a sample ξt from the current distribution D(t, xt) and declares xt+1 = proxηtr(xt ηt ℓ(xt, ξt)). Notice that the sample gradient ℓ(xt, ξt) is a biased estimator of the true gradient ft(xt) = Eξ D(t, xt) ℓ(xt, ξ) because ξt is sampled from the wrong distribution. Nonetheless, as pointed out by Drusvyatskiy and Xiao (2022), the gradient bias is small for any ﬁxed time, decaying linearly with the distance

Cutler, Drusvyatskiy, Harchaoui

to xt. Using this perspective, we show that all guarantees for PSG in the time-dependent setting naturally extend to this biased PSG algorithm for tracking equilibrium points, with essentially no loss in eﬃciency.

1.4 Related Work

Our current work ﬁts within the broader literature on stochastic tracking, online optimization with dynamic regret, high-probability guarantees in stochastic optimization, and performative prediction. We now survey the most relevant literature in these areas.

Stochastic tracking. Stochastic optimization with time drift was considered soon after the Robbins-Monro approach for stochastic optimization was introduced; see Kushner and Yin (1997) for a survey. Early results can be traced back to Dupaˇc (1965) in sequential estimation and Gaivoronskii (1978) in stochastic optimization; see also Fujita and Fukao (1972), Ruppert (1979), Tsypkin and Nikolic (1971), Tsypkin and Polyak (1992), and Uosaki (1974). Stochastic algorithms have also been extensively studied as adaptive algorithms for stochastic tracking (Tsypkin and Nikolic, 1971; Kushner and Yin, 1997; Benveniste et al., 1990), for their ability to indeed track parameters under time drift. Most works have focused on the so-called least mean-squares (LMS) algorithm and its variants, which can be viewed as a stochastic gradient method on a least-squares loss-based objective. Other stochastic algorithms that have been studied in these settings with a larger cost per iteration include recursive least-squares and related Kalman ﬁltering algorithms (Guo and Ljung, 1995). Recent works have revisited these methods from a more modern viewpoint (Besbes et al., 2015; Wilson et al., 2019; Madden et al., 2021). In particular, the paper of Madden et al. (2021) focuses on (accelerated) gradient methods for deterministic tracking problems, while Wilson et al. (2019) present a framework for online stochastic gradient methods with parameter estimation. The work of Besbes et al. (2015) analyzes the dynamic regret of stochastic algorithms for time-varying problems, focusing both on lower and upper complexity bounds. Though the proof techniques in our paper share many aspects with those available in the literature, the results we obtain are distinct. In particular, the guarantees (3) and (4) for PSG with step decay, along with their high-probability variants, are new to the best of our knowledge.

Online optimization with dynamic regret. Online optimization for sequences of convex objectives with domain X has been studied through the lens of adaptive regret (Hazan and Seshadhri, 2009; Daniely et al., 2015) and dynamic regret (Besbes et al., 2015; Zinkevich, 2003; Zhang et al., 2018; Zhao et al., 2020; Jadbabaie et al., 2015; Mokhtari et al., 2016). The adaptive regret

sup [r,s] [T]

t=r ft(xt) inf x X

reads as the maximum static regret over any contiguous time interval; more relevant to our analysis is the dynamic regret

ft(xt) ft(x t ) ,

Stochastic Optimization under Distributional Drift

which reads as the cumulative diﬀerence between the instantaneous loss and the minimum loss. More generally, one can consider the dynamic regret against an arbitrary comparator sequence {ut}T t=1 in X, given by

Reg T (u1, . . . , u T ) =

ft(xt) ft(ut) .

Jadbabaie et al. (2015) apply an adaptive step size strategy to an optimistic mirror descent algorithm, thereby obtaining a comprehensive dynamic regret guarantee for Reg T in terms of the cumulative loss variation VT = PT t=2 supx X |ft(x) ft 1(x)|, the cumulative gradient variation DT = PT t=1 ft(xt) Mt 2 using a causally predictable sequence Mt available to the algorithm prior to time t (e.g., Mt = ft 1(xt 1)), and the cumulative minimizer variation C T = PT t=2 x t x t 1 . Under strong convexity, Mokhtari et al. (2016) show that online projected gradient descent satisﬁes Reg T O(1 + C T ). For convex losses with bounded domain X, Zhang et al. (2018) present an adaptive online gradient method that achieves an optimal dynamic regret bound in terms of the cumulative comparator variation CT = PT t=2 ut ut 1 , namely, Reg T (u1, . . . , u T ) O p

T(1 + CT ) . This last guarantee is enhanced by Zhao et al. (2020) through exploiting smoothness to replace the time horizon T by problem-dependent quantities that are at most O(T) but often much smaller in easy problems. As is standard in the dynamic online optimization literature, the preceding works assume that either the losses ft(x) or their gradients ft(x) are uniformly bounded in both t and x, and a priori knowledge of these uniform bounds is required for the aforementioned guarantees. In contrast, we take great care to make no such uniform boundedness assumptions and we work instead with bounded second moments or light tails of minimizer or gradient drift, which we allow to evolve stochastically. Furthermore, we only assume stochastic gradient access, and the presence of stochasticity in the drift and the gradient noise requires guarantees that hold both in expectation and with high probability. Our bounds depend on a characteristic quantity of the problem diﬃculty encapsulating the drift and the noise level and hence delineate two regimes depending on the drift-to-noise ratio. In general, regret bounds do not entail last-iterate bounds of the type presented in our work.

High-probability guarantees in stochastic optimization. A large part of our work revolves around high-probability guarantees for stochastic optimization. Classical references on the subject in static settings and for minimizing regret in online optimization include the work of Bartlett et al. (2008), Hazan and Kale (2014), Lan (2012), and Rakhlin et al. (2012). There exists a variety of techniques for establishing high-probability guarantees based on Freedman s inequality and doubling tricks (e.g., see Bartlett et al., 2008; Hazan and Kale, 2014). A more recent line of work by Harvey et al. (2019) establishes a generalized Freedman inequality that is custom-tailored for analyzing stochastic gradient-type methods and results in the best known high-probability guarantees. Our arguments closely follow the paradigm of Harvey et al. (2019) based on the generalized Freedman inequality.

Performative prediction and decision-dependent learning. Recent works on strategic classiﬁcation (Hardt et al., 2016; Br uckner et al., 2012; Bechavod et al., 2021; Dalvi et al., 2004) and performative prediction (Perdomo et al., 2020; Mendler-D unner et al., 2020) have highlighted the importance of strategic behavior in machine learning. That is, common

Cutler, Drusvyatskiy, Harchaoui

learning systems exhibit a feedback mechanism, wherein the distribution generating the data in iteration t may depend on, or react to, the current decision of an algorithm xt. The recent paper by Perdomo et al. (2020) put forth an elegant framework for thinking about such problems, while Mendler-D unner et al. (2020) develop stochastic algorithms for this setting. The subsequent work of Drusvyatskiy and Xiao (2022) shows that a variety of stochastic algorithms for performative prediction can be understood as biased variants of the same algorithms on a certain static problem in equilibrium. Building on the techniques of Drusvyatskiy and Xiao (2022), we show how all our results for time-dependent problems extend to problems that simultaneously depend on time and on the decision variable. We note that during the ﬁnal stage of completing this paper, the closely related and complementary work by Wood et al. (2022) was posted on ar Xiv.2 The paper by Wood et al. (2022) considers decision-dependent projected stochastic gradient descent under time drift in the distributional framework proposed by Perdomo et al. (2020), establishing distance tracking bounds in expectation and with high probability under sub-Weibull gradient noise. In particular, the light-tail assumption on gradient noise used by Wood et al. (2022) for obtaining high-probability guarantees is more general than the one in our paper. On the other hand, we analyze tracking of both the minimizer and the minimum value of more general stochastically evolving objectives, allow presence of general convex regularizers, and propose a step decay schedule for improved eﬃciency.

1.5 Outline

The outline of the paper is as follows. Section 2 formalizes the problem setting of timedependent stochastic optimization and records the relevant assumptions. Sections 3 5 summarize the main results of the paper. Speciﬁcally, Section 3 focuses on eﬃciency estimates for tracking the minimizer, Section 4 focuses on eﬃciency estimates for tracking the minimum value, and Section 5 develops an extension to the decision-dependent setting via tracking equilibria. Section 6 presents the proofs of the main results in a uniﬁed framework. Illustrative numerical results appear in Section 7. Appendix A describes the averaging technique used for tracking function values, and additional proofs appear in Appendix B.

2. Framework and Assumptions

Throughout Sections 2 4, we consider the sequence of stochastic optimization problems

min x Rd ϕt(x) := ft(x) + rt(x) (5)

indexed by time t N, where Rd denotes a ﬁxed d-dimensional Euclidean space with inner product , and Euclidean norm x = p

x, x , and the following standard regularity assumptions hold:

(i) Each function ft : Rd R is µ-strongly convex and C1-smooth with L-Lipschitz continuous gradient for some common parameters µ, L > 0.

2. More precisely, a short version of our paper (Cutler et al., 2021) was submitted to Neur IPS in May 21, the paper by Wood et al. (2022) appeared on ar Xiv in July 21, and our full paper was posted on ar Xiv in August 21. Our paper (Cutler et al., 2021) was presented at Neur IPS in December 21.

Stochastic Optimization under Distributional Drift

(ii) Each regularizer rt : Rd R { } is proper, closed, and convex.3

The minimizer and minimum value of ϕt will be denoted by x t and ϕ t , respectively. We will be concerned with settings in which ϕt evolves stochastically in time. As motivation, we describe two classical examples of (5) that are worth keeping in mind and that guide our framework: stochastic tracking of a drifting target and online learning under distributional drift.

Example 1 (Stochastic tracking of a drifting target) The problem of stochastic tracking, related to the ﬁltering problem in signal processing, is to track a moving target x t from observations bt = ct(x t ) + ϵt, where ct( ) is a known measurement map and ϵt is a mean-zero noise vector. A typical time-dependent problem formulation takes the form

min x E ϵt ℓt(bt ct(x)) + rt(x),

where the loss ℓt( ) derives from the distribution of ϵt and the regularizer rt( ) encodes available side information about the target x t . Common choices for rt are the 1-norm and the squared 2-norm. The motion of the target x t is typically driven by a random walk or a diﬀusion (Guo and Ljung, 1995; Sayed, 2003).

Example 2 (Online learning under distributional drift) The problem of online learning under distributional drift is to learn while the data distribution changes over time. More formally, a typical problem formulation takes the form

min x E ξ D(vt)ℓ(x, ξ) + r(x),

where D(vt) is a data distribution that depends on an unknown parameter sequence {vt}, which itself may evolve stochastically.

The main goal of a learning algorithm for problem (5) is to generate a sequence of points {xt} that minimize some natural performance metric. The most prevalent performance metrics in the literature are the tracking error and the dynamic regret. We will focus on two types of tracking error: the squared distance xt x t 2 and the suboptimality gap ϕt(ˆxt) ϕt(x t ), where ˆxt denotes a weighted average of iterates up to time t. We make the standing assumption that at every time t, and at every query point x, the learner can select an unbiased estimator e ft(x) of the true gradient ft(x) in order to proceed with a stochastic gradient-like optimization algorithm. With this oracle access, the online proximal stochastic gradient method recorded as Algorithm 1 below selects in each iteration t the stochastic gradient gt = e ft(xt) and takes the step

xt+1 := proxηtrt(xt ηtgt) = argmin u Rd

n rt(u) + 1 2ηt u (xt ηtgt) 2o

using step size ηt > 0. The goal of our work is to obtain eﬃciency estimates for this procedure that hold both in expectation and with high probability.

3. We assume dom ft = Rd for simplicity, but this is not essential. For example, it suﬃces to assume that each function ft : Rd R { } is closed and µ-strongly convex and that there exists an open convex set U Rd such that for all t N, dom rt U dom ft and ft is L-smooth on U.

Cutler, Drusvyatskiy, Harchaoui

Algorithm 1 Online Proximal Stochastic Gradient PSG(x0, {ηt}, T)

Input: initial x0 and step sizes {ηt}T 1 t=0 (0, ) Step t = 0, . . . , T 1: Select gt = e ft(xt)

Set xt+1 = proxηtrt(xt ηtgt)

The guarantees we obtain allow both the iterates xt and the minimizers x t to evolve stochastically. This is convenient for example when tracking a moving target x t whose motion may be governed by a stochastic process such as a random walk or a diﬀusion (Example 1), or when tracking the minimizer of an expected loss over a stochastically evolving data distribution (Example 2). Given {xt} and {gt} as in Algorithm 1, we let

zt := ft(xt) gt

denote the gradient noise at time t and we impose the following assumption modeling stochasticity on a ﬁxed probability space (Ω, F, P) throughout Sections 2 4.

Assumption 1 (Stochastic framework) There exists a probability space (Ω, F, P) with ﬁltration (Ft)t 0 such that F0 = { , Ω} and the following two conditions hold for all t 0:

(i) xt, x t : Ω Rd are Ft-measurable.

(ii) zt : Ω Rd is Ft+1-measurable with E[zt | Ft] = 0.

The ﬁrst item of Assumption 1 formalizes the assertion that xt and x t are fully determined by information up to time t. The second item of Assumption 1 formalizes the assertion that the gradient noise zt is fully determined by information up to time t + 1 and has zero mean conditioned on the information up to time t, i.e., gt is an unbiased estimator of ft(xt); for example, this holds naturally in Example 2 under typical regularity assumptions if gt = ℓ(xt, ξt) with ξt D(vt), where ℓ(xt, ξt) denotes the gradient of ℓ( , ξt) at xt. Eﬃciency estimates for Algorithm 1 must clearly take into account the variation of the problem (5) in time t. One of the standard metrics for measuring this variation is the minimizer drift t := x t x t+1 . Another popular metric is the gradient drift

sup x ft(x) ft+1(x) .

Our eﬃciency estimates for tracking the minimizer will depend on the minimizer drift, while our eﬃciency estimates for tracking the minimum value will depend on the gradient drift. As the following elementary lemma shows, the minimizer drift scaled by µ is dominated by the gradient drift whenever the regularizers do not vary in time.4

4. Lemma 1 provides a bound similar in spirit to the bound µ x i x t 2 4 supx dom r |fi(x) ft(x)| in terms of variation in function value, which is also an elementary consequence of µ-strong convexity (e.g., see Zhao and Zhang, 2021, Section 4.1).

Stochastic Optimization under Distributional Drift

Lemma 1 (Minimizer vs. gradient drift) Suppose that i and t are indices for which the regularizers ri and rt are identical. Then

µ x i x t fi(x t ) ft(x t ) .

Proof Let r denote the common regularizer: r = ri = rt. Then the ﬁrst-order optimality condition 0 ϕt(x t ) = ft(x t ) + r(x t ) implies ft(x t ) r(x t ), so the vector v := fi(x t ) ft(x t ) lies in ϕi(x t ). Hence the µ-strong convexity of ϕi and the inclusion 0 ϕi(x i ) imply µ x i x t 0 v .

3. Tracking the Minimizer

This section presents bounds on the tracking error xt x t 2 that are valid both in expectation and with high probability under light-tail assumptions. Further, we show that a geometrically decaying learning rate schedule may be superior to a constant learning rate in terms of eﬃciency.

3.1 Bounds in Expectation

We begin with bounding the expected value E xt x t 2. Proofs appear in Section 6.1. The starting point for our analysis is the following standard one-step improvement guarantee.

Lemma 2 (One-step improvement) For all x Rd, the iterates {xt} produced by Algorithm 1 with ηt < 1/L satisfy the bound:

2ηt(ϕt(xt+1) ϕt(x)) (1 µηt) xt x 2 xt+1 x 2 + 2ηt zt, xt x + η2 t 1 Lηt zt 2.

For simplicity, we state the main results under the assumption that the second moments E 2 t and E zt 2 are uniformly bounded; more general guarantees that take into account weighted averages of the moments and allow for time-dependent learning rates follow from Lemma 2 as well.

Assumption 2 (Bounded second moments) There exist constants , σ > 0 such that the following two conditions hold for all t 0:

(i) (Drift) The minimizer drift t satisﬁes E 2 t 2.

(ii) (Noise) The gradient noise zt satisﬁes E zt 2 σ2.

The following theorem establishes an expected improvement guarantee for Algorithm 1, and serves as the basis for much of what follows.

Theorem 3 (Expected distance) Suppose that Assumption 2 holds. Then the iterates produced by Algorithm 1 with constant learning rate η 1/2L satisfy the bound:

E xt x t 2 (1 µη)t x0 x 0 2 | {z } optimization

µ |{z} noise

| {z } drift

Cutler, Drusvyatskiy, Harchaoui

Interplay of optimization, noise, and drift. Theorem 3 states that when using a constant learning rate, the error E xt x t 2 decays linearly in time t, until it reaches the noise + drift error ησ2/µ + ( /µη)2. Notice that the noise + drift error cannot be made arbitrarily small. This is perfectly in line with intuition: a learning rate that is too small prevents the algorithm from catching up with x t . We note that the individual error terms due to the optimization and noise are classically known to be tight for PSG; tightness of the drift term is proved by Madden et al. (2021, Theorem 3.2). With Theorem 3 in hand, we deﬁne the asymptotic tracking error of Algorithm 1 corresponding to E xt x t 2, together with the corresponding optimal step size:

E := min η (0,1/2L]

and η := min

( 1 2L, 2 2

Plugging η into the deﬁnition of E, we see that Algorithm 1 exhibits qualitatively diﬀerent behaviors in settings with high or low drift-to-noise ratio /σ. Explicitly,

µ2 2/3 otherwise.

Two regimes of variation are brought to light by the above computation: the high driftto-noise regime /σ p

µ/16L3 and the low drift-to-noise regime /σ < p

µ/16L3. The high drift-to-noise regime is uninteresting from the viewpoint of stochastic optimization because in this case the optimal learning rate η 1/L is as large as in the deterministic setting. In contrast, the low drift-to-noise regime is interesting because it necessitates using a smaller learning rate η ( 2/µσ2)1/3 that exhibits a nontrivial scaling with the problem parameters.

Learning rate vs. rate of variation. A central question is to ﬁnd a learning rate schedule that achieves a tracking error E xt x t 2 that is within a constant factor of E in the shortest possible time. The answer is clear in the high drift-to-noise regime /σ p

µ/16L3. Indeed, in this case, Theorem 3 directly implies that Algorithm 1 with the constant learning rate η = 1/2L will ﬁnd a point xt satisfying E xt x t 2 E in time t (L/µ) log( x0 x 0 2/E). Notice that this eﬃciency estimate is logarithmic in 1/E; intuitively, the reason for the absence of a sublinear component is that the error due to the drift dominates the error due to the variance σ2 in the stochastic gradient. The low drift-to-noise regime /σ < p

µ/16L3 is more subtle. Namely, the simplest strategy is to execute Algorithm 1 with the constant learning rate η = (2 2/µσ2)1/3. Then a direct application of Theorem 3 yields the estimate E xt x t 2 E in time t (σ2/µ2E) log( x0 x 0 2/E). This eﬃciency estimate can be signiﬁcantly improved by gradually decaying the learning rate using a step decay schedule , wherein the algorithm is implemented in epochs with the new learning rate chosen to be the midpoint between the current learning rate and η . Such schedules are well known to improve eﬃciency in the static setting, as was discovered by Ghadimi and Lan (2013), and can be used here. The end result is the following theorem; see Theorem 28 for the formal statement.

Theorem 4 (Time to track in expectation, informal) Suppose that Assumption 2 holds. Then there is a learning rate schedule {ηt} such that Algorithm 1 produces a point xt

Stochastic Optimization under Distributional Drift

E xt x t 2 E in time t L

µ log x0 x 0 2

The eﬃciency estimate in Theorem 4 is strikingly similar to the eﬃciency estimate in the static setting (Ghadimi and Lan, 2013), with E playing the role of the target accuracy ε. An elementary computation shows that in the low drift-to-noise regime, Theorem 4 improves the constant learning rate eﬃciency estimate when E is small, e.g., when E x0 x 0 2/e2. Theorems 3 and 4 provide useful baseline guarantees for the performance of Algorithm 1. Nonetheless, these guarantees are all stated in terms of the expected tracking error E xt x t 2, and are therefore only meaningful if the entire algorithm can be repeated from scratch multiple times. There is no shortage of situations in which a learning algorithm is operating in real time and the time drift is irreversible; in such settings, the algorithm may only be executed once. These situations call for eﬃciency estimates that hold with high probability, rather than only in expectation.

3.2 High-Probability Guarantees

We next present high-probability guarantees for the tracking error xt x t 2. Proofs appear in Section 6.2. We make the following standard light-tail assumptions on the minimizer drift and gradient noise (Harvey et al., 2019; Lan, 2012; Nemirovski et al., 2009).

Assumption 3 (Sub-Gaussian drift and noise) There exist constants , σ > 0 such that the following two conditions hold for all t 0:

(i) (Drift) The drift 2 t is sub-exponential conditioned on Ft with parameter 2:

E exp(λ 2 t ) | Ft exp(λ 2) for all 0 λ 2.

(ii) (Noise) The noise zt is norm sub-Gaussian conditioned on Ft with parameter σ/2:

P zt τ | Ft 2 exp( 2τ 2/σ2) for all τ > 0.

Note that the ﬁrst item of Assumption 3 is equivalent to asserting that the minimizer drift t is sub-Gaussian conditioned on Ft (see Vershynin, 2018, Lemma 2.7.6). Clearly Assumption 3 implies Assumption 2 with the same constants , σ. It is worthwhile to note some common settings in which Assumption 3 holds; the claims in Remark 5 below follow from standard results on sub-Gaussian random variables (Jin et al., 2019; Vershynin, 2018).

Remark 5 (Common settings for Assumption 3) Fix constants , σ > 0. If t is bounded by , then clearly 2 t is sub-exponential (conditioned on Ft) with parameter 2. Similarly, if zt is bounded by σ/2, then zt is norm sub-Gaussian (conditioned on Ft) with parameter σ/2 (by Markov s inequality). Alternatively, if the increment x t x t+1 is mean-zero sub-Gaussian conditioned on Ft with parameter /

d, then x t x t+1 is mean-zero norm sub-Gaussian conditioned on Ft with parameter 2

2 and hence 2 t is sub-exponential conditioned on Ft with parameter c 2 for some absolute constant c > 0. Similarly, if zt is sub-Gaussian conditioned on Ft with parameter σ/4

2d, then zt is norm sub-Gaussian conditioned on Ft with parameter σ/2.

Cutler, Drusvyatskiy, Harchaoui

The following theorem shows that if Assumption 3 holds, then the expected bound on xt x t 2 derived in Theorem 3 holds with high probability.

Theorem 6 (High-probability distance tracking) Let {xt} be the iterates produced by Algorithm 1 with constant learning rate η 1/2L, and suppose that Assumption 3 holds. Then there is an absolute constant c > 0 such that for any speciﬁed t N and δ (0, 1), the following estimate holds with probability at least 1 δ:

xt x t 2 1 µη

2 t x0 x 0 2 + c

The proof of Theorem 6 employs a technique used by Harvey et al. (2019). The idea is to build a careful recursion for the moment generating function of xt x t 2, leading to a onesided sub-exponential tail bound. As a consequence of Theorem 6, we can again implement a step decay schedule to obtain the following eﬃciency estimate with high probability; see Theorem 31 for the formal statement.

Theorem 7 (Time to track with high probability, informal) Suppose that Assumption 3 holds and that we are in the low drift-to-noise regime /σ < p

µ/16L3. Then there is a learning rate schedule {ηt} such that for any speciﬁed δ (0, 1), Algorithm 1 produces a point xt satisfying

xt x t 2 E log e

with probability at least 1 δ in time

µ log x0 x 0 2

4. Tracking the Minimum Value

The results outlined so far have focused on tracking the minimizer x t . In this section, we present results for tracking the minimum value ϕ t . These two goals are fundamentally diﬀerent. Generally speaking, good bounds on the function gap along with strong convexity imply good bounds on the distance to the minimizer; the reverse implication is false. To this end, we require a stronger assumption on the variation of the functions ft in time t: rather than merely controlling the minimizer drift t, we will assume control on the gradient drift

Gi,t := sup x fi(x) ft(x) .

Our strategy is to track the minimum value along the running average ˆxt of the iterates xt produced by Algorithm 1, as deﬁned in Algorithm 2 below. The reason behind using this particular running average is brought to light in Section 6.3, where we apply a standard averaging technique (Appendix A) to a one-step improvement along xt (Lemma 32) to obtain the desired progress along ˆxt (Proposition 33).

4.1 Bounds in Expectation

We begin with bounding the expected value E[ϕt(ˆxt) ϕ t ]. Proofs appear in Section 6.3. Analogous to Assumption 2, we make the following assumption regarding drift and noise.

Stochastic Optimization under Distributional Drift

Algorithm 2 Averaged Online Proximal Stochastic Gradient PSG(x0, µ, {ηt}, T)

Input: initial x0 = ˆx0, strong convexity parameter µ, and step sizes {ηt}T 1 t=0 (0, 1/µ) Step t = 0, . . . , T 1:

Select gt = e ft(xt)

Set xt+1 = proxηtrt(xt ηtgt)

Set ˆxt+1 = 1 µηt 2 µηt

ˆxt + µηt 2 µηt xt+1

Return ˆx T

Assumption 4 (Bounded second moments) The regularizers rt r are identical for all times t and there exist constants , σ > 0 such that the following two conditions hold for all 0 i < t:

(i) (Drift) The gradient drift Gi,t satisﬁes E G2 i,t (µ |i t|)2.

(ii) (Noise) The gradient noise zi satisﬁes E zi 2 σ2 and E zi, x t = 0.

These two assumptions are natural indeed. Taking into account Lemma 1, it is clear that Assumption 4 implies the earlier Assumption 2 with the same constants , σ. The drift assumption intuitively asserts that second moment of Gi,t grows at most quadratically in time |i t|. In particular, returning to Example 2, suppose that the distribution map D( ) is ε-Lipschitz continuous in the Wasserstein-1 distance, the loss ℓ( , ξ) is C1smooth for all ξ, and the gradient ℓ(x, ) is β-Lipschitz continuous for all x. Then the Kantorovich-Rubinste ın duality theorem (Kantorovich and Rubinshte ın, 1958) directly implies E G2 i,t (εβ)2 E vi vt 2. Therefore, as long as the second moment E vi vt 2

scales quadratically in |i t|, the desired drift assumption holds. The assumption on the gradient noise stipulates a uniform bound on the second moment E zi 2 and that the condition E zi, x t = 0 holds. The latter property confers a weak form of uncorrelatedness between the gradient noise zi and the future minimizer x t , and holds automatically if the gradient noise and the minimizers evolve independently of each other, as would typically be the case for instance in Example 2. The following theorem establishes an expected improvement guarantee for Algorithm 2.

Theorem 8 (Expected function gap) Let {ˆxt} be the iterates produced by Algorithm 2 with constant learning rate η 1/2L, and suppose that Assumption 4 holds. Then the following bound holds for all t 0:

E ϕt(ˆxt) ϕ t 1 µη

2 t ϕ0(x0) ϕ 0

| {z } optimization

+ ησ2 |{z} noise + 2

µη2 |{z} drift

The noise + drift error term in Theorem 8 coincides with µ times the error term in Theorem 3, as expected. With Theorem 8 in hand, we are led to deﬁne the following asymptotic tracking error of Algorithm 2 corresponding to E[ϕt(ˆxt) ϕ t ]:

G := µE = min η (0,1/2L]

Cutler, Drusvyatskiy, Harchaoui

The corresponding asymptotically optimal choice of η is again given by

( 1 2L, 2 2

and the dichotomy governed by the drift-to-noise ratio /σ remains:

µ2 2/3 otherwise.

In the high drift-to-noise regime /σ p

µ/16L3, Theorem 8 directly implies that Algorithm 2 with the constant learning rate η = 1/2L ﬁnds a point ˆxt satisfying E[ϕt(ˆxt) ϕ t ] G in time t (L/µ) log((ϕ0(x0) ϕ 0)/G). In the low drift-to-noise regime /σ < p

µ/16L3, another direct application of Theorem 8 shows that Algorithm 2 with the constant learning rate η = (2 2/µσ2)1/3 ﬁnds a point ˆxt satisfying E[ϕt(ˆxt) ϕ t ] G in time t (σ2/µG) log((ϕ0(x0) ϕ 0)/G). As before, this eﬃciency estimate can be signiﬁcantly improved by implementing a step decay schedule. The end result is the following theorem; see Theorem 35 for the formal statement.

Theorem 9 (Time to track in expectation, informal) Suppose that Assumption 4 holds. Then there is a learning rate schedule {ηt} such that Algorithm 2 produces a point ˆxt satisfying

E ϕt(ˆxt) ϕ t G in time t L

µ log ϕ0(x0) ϕ 0 G

In the low drift-to-noise regime, Theorem 9 improves the constant learning rate eﬃciency estimate when G is small, e.g., when G (ϕ0(x0) ϕ 0)/e2.

4.2 High-Probability Guarantees

Next, we obtain high-probability analogues of Theorems 8 and 9. Proofs appear in Section 6.4. Naturally, such results should rely on light-tail assumptions on the gradient drift Gi,t and the norm of the gradient noise zi . We state the guarantees under an assumption of sub-Gaussian drift and noise (Assumption 5 below). In particular, we require that the gradient noise zi is mean-zero conditioned on the σ-algebra

Fi,t := σ(Fi, x t )

for all 0 i < t; the property E[zi | Fi,t] = 0 would follow from independence of the gradient noise zi and the future minimizer x t and is very reasonable in light of Examples 1 and 2.

Assumption 5 (Sub-Gaussian drift and noise) The regularizers rt r are identical for all times t and there exist constants , σ > 0 such that the following two conditions hold for all 0 i < t:

(i) (Drift) The squared gradient drift G2 i,t is sub-exponential with parameter (µ |i t|)2:

E exp λG2 i,t exp λ(µ |i t|)2 for all 0 λ (µ |i t|) 2.

Stochastic Optimization under Distributional Drift

(ii) (Noise) The gradient noise zi is mean-zero norm sub-Gaussian conditioned on Fi,t with parameter σ/2, i.e., E[zi | Fi,t] = 0 and

P zi τ | Fi,t 2 exp( 2τ 2/σ2) for all τ > 0.

Clearly the chain of implications holds:

Assumption 5 = Assumption 4 = Assumption 2.

The following theorem shows that if Assumption 5 holds, then the expected bound on ϕt(ˆxt) ϕ t derived in Theorem 8 holds with high probability.

Theorem 10 (Function gap with high probability) Let {ˆxt} be the iterates produced by Algorithm 2 with constant learning rate η 1/2L, and suppose that Assumption 5 holds. Then there is an absolute constant c > 0 such that for any speciﬁed t N and δ (0, 1), the following estimate holds with probability at least 1 δ:

ϕt(ˆxt) ϕ t c 1 µη

2 t ϕ0(x0) ϕ 0 + ησ2 + 2

The proof of Theorem 10 is based on combining the generalized Freedman inequality of Harvey et al. (2019) with careful control on the drift and noise in improvement guarantees for the proximal stochastic gradient method. The key observation is that although we do not have simple recursive control on the moment generating function of ϕt(ˆxt) ϕ t (as we do with xt x t 2), we can instead control the tracking error ϕt(ˆxt) ϕ t by leveraging control on the martingale Pt 1 i=0 zi, xi x t ζt 1 i, where ζ = 1 µη/(2 µη). This martingale is self-regulating in the sense that its total conditional variance is bounded by the history of the process; the generalized Freedman inequality is precisely suited to bound such martingales with high probability. With Theorem 10 in hand, we may implement a step decay schedule as before to obtain the following eﬃciency estimate; see Theorem 41 for the formal statement.

Theorem 11 (Time to track with high probability, informal) Suppose that Assumption 5 holds and that we are in the low drift-to-noise regime /σ < p

µ/16L3. Fix δ (0, 1). Then there is a learning rate schedule {ηt} such that Algorithm 2 produces a point ˆxt satisfying

ϕt(ˆxt) ϕ t G log e

with probability at least 1 Kδ in time

µ log ϕ0(x0) ϕ 0 G

µG log log e

, where K log2

5. Extension to the Decision-Dependent Setting

In this section, we extend the framework and results of the previous sections to a much wider class of tracking problems. In particular, the material in this section is a strict generalization of all the results in the previous sections and can model the performative prediction framework of Perdomo et al. (2020) in a time-dependent setting.

Cutler, Drusvyatskiy, Harchaoui

Setting the stage, suppose that we have a family of functions {ft,x( )} indexed by time t N and points x Rd. Upon replacing the function ft in the time-dependent problem (5) by the function ft,x depending not only on the time t but also on the decision variable x, we obtain the sequence of decision-dependent stochastic optimization problems

min x Rd ft,x(x) + rt(x) (6)

indexed by t. Tracking the solutions to (6) is typically a challenging task due to the dual dependency of ft,x(x) on the decision x. To obtain a more tractable tracking problem, we may decouple this dependency on x by introducing an auxiliary decision variable u and considering the family of stochastic optimization problems

min u Rd ft,x(u) + rt(u) (7)

indexed by both t and x. Instead of tracking the optimal decisions solving (6), our aim becomes to track the stable decisions

xt argmin u Rd ft, xt(u) + rt(u) (8)

arising from (7). We call a point xt satisfying (8) an equilibrium point of (7) at time t; observe that xt is stable in the sense that it is a ﬁxed point of the map x 7 argminu{ft,x(u) +rt(u)}. Reasonable regularity assumptions on the family {ft,x( )} ensure that (7) admits a unique equilibrium point at each time t (see Assumption 6 and Lemma 12 below). When the functions ft,x are independent of x, the equilibrium points are simply the minimizers of ft + rt the content of the previous sections. Our goal in this section is to track the equilibrium points xt, or equivalently to track the minimizers of the time-dependent stochastic optimization problem min u Rd ft, xt(u) + rt(u). (9)

Formally, (9) is an example of (5), but this viewpoint is not directly useful since xt is unknown. This more general framework allows us to model more dynamic settings. The main example stems from the setting of performative prediction introduced by Perdomo et al. (2020). This will be a running example throughout the section.

Example 3 (Performative prediction) Within the framework of performative prediction, the functions take the form ft,x(u) = Eξ D(t,x)ℓ(u, ξ) for some family of distributions D(t, x) indexed by both the time t and the decision variable x. The motivation for the dependence of the distribution on x is that often deployment of a learning rule parametrized by x causes the population to change their proﬁle to increase the likelihood of a better personal outcome a process called gaming . In other words, the population data is a function of the decision taken by the learner. Moreover, the dependence of the population data on time appears naturally when the population evolves due to exogenous temporal eﬀects (e.g., seasonal, economic). The equilibrium points xt have a clear meaning in this context. Namely, xt is an equilibrium point if the learner has no reason deviate from the learning rule xt based on the response distribution D(t, xt) alone. Whenever we refer back to this example, we will impose the following assumptions that are direct extensions of Perdomo et al. (2020) to the time-dependent setting. Namely, ﬁx a nonempty metric space M equipped with its Borel σ-algebra and let P1(M) denote the

Stochastic Optimization under Distributional Drift

space of Radon probability measures on M with ﬁnite ﬁrst moment, equipped with the Wasserstein-1 distance W1. We make the natural assumption that there exist constants θ, ε 0 such that the distribution map D( , ) satisﬁes the following Lipschitz condition:

W1 D(i, x), D(t, y) θ|i t| + ε x y for all (i, x), (t, y) N Rd.

Moreover, we suppose that the loss function ℓ: Rd M R has the following three properties: ℓ(u, ) L1(π) for all u Rd and π P1(M); ℓ( , ξ) is C1-smooth for all ξ M; and there is a constant β 0 such that the map ξ 7 ℓ(u, ξ) is β-Lipschitz continuous for all u Rd, where ℓ(u, ξ) denotes the gradient of ℓ( , ξ) evaluated at u. These assumptions directly imply the following stability property of the gradients with respect to distributional perturbations (see Drusvyatskiy and Xiao, 2022, Lemma 2.1):

sup u Rd fi,x(u) ft,y(u) θβ|i t| + εβ x y for all (i, x), (t, y) N Rd. (10)

Suppose now that each expected loss ft,x( ) is µ-strongly convex. In this case, Perdomo et al. (2020) identiﬁed the optimal parameter regime εβ < µ wherein the repeated minimization procedure yk+1 = argminu{ft,yk(u) + rt(u)} converges to the unique equilibrium point xt at a linear rate as k (see Perdomo et al., 2020, Theorem 3.5 and Proposition 3.6). The gradient stability property (10) will lead us to the corresponding parameter regime in our generalized setting.

5.1 Decision-Dependent Framework

We begin by recording the assumptions of our framework. Similar to the previous sections, we assume that the following standard regularity conditions hold:

(i) Each function ft,x : Rd R is µ-strongly convex and C1-smooth with L-Lipschitz continuous gradient for some common parameters µ, L > 0.

(ii) Each regularizer rt : Rd R { } is proper, closed, and convex.

For each t N and x, u Rd, we let ft,x(u) denote the gradient of the function ft,x( ) evaluated at u. In order to control the variation of the family {ft,x( )} in the decision variable x, we introduce the parameter

γ := sup t N, u,x,y Rd x =y

ft,x(u) ft,y(u)

and impose throughout Section 5 the following stability property of the gradients with respect to the decision variable.

Assumption 6 (Gradient stability in the decision variable) The following parameter regime holds: γ < µ.

In particular, γ is ﬁnite and the following Lipschitz bound holds:

sup t N, u Rd ft,x(u) ft,y(u) γ x y for all x, y Rd.

Returning to Example 3, it follows from (10) that Assumption 6 holds whenever εβ < µ. As the following lemma shows, the requirement γ < µ guarantees that for each t N, the equilibrium point xt is well deﬁned and unique.

Cutler, Drusvyatskiy, Harchaoui

Lemma 12 (Existence of equilibrium) For each t N, the map St : Rd Rd given by

St(x) = argmin u Rd ft,x(u) + rt(u)

is (γ/µ)-contractive and therefore has a unique ﬁxed point xt to which the repeated minimization procedure yk+1 = St(yk) converges at a linear rate as k .

Proof Note ﬁrst that St is well deﬁned by the strong convexity of each function ϕt,x := ft,x + rt. Next, given x, y Rd, observe that we have the ﬁrst-order optimality conditions 0 ϕt,x(St(x)) and 0 ϕt,y(St(y)); this last inclusion implies ft,y(St(y)) rt(St(y)) and hence ft,x(St(y)) ft,y(St(y)) ϕt,x(St(y)). On the other hand, the µ-strong convexity of ϕt,x implies that for all u, u dom ϕt,x, w ϕt,x(u), and w ϕt,x(u ), we have µ u u w w . Thus, taking u = St(x), w = 0, u = St(y), and w = ft,x(St(y)) ft,y(St(y)) yields

µ St(x) St(y) ft,x(St(y)) ft,y(St(y)) γ x y ,

where the last inequality holds by the deﬁnition of γ. Hence St(x) St(y) (γ/µ) x y , so St is (γ/µ)-contractive since γ < µ. An application of the Banach ﬁxed-point theorem completes the proof.

It is easy to see that the parameter regime γ < µ is optimal in the sense that equilibrium points can fail to exist whenever γ µ, as illustrated in the following example.

Example 4 (Optimality of the regime γ < µ) Consider the time-independent family of functions given by fx(u) = 1

2 u ax b 2 for any ﬁxed constant a 1 and vector b Rd. Then fx is 1-strongly convex and smooth with 1-Lipschitz continuous gradient, and

γ = sup u,x,y Rd x =y

fx(u) fy(u)

x y = a 1 = µ.

Now let X Rd be a nonempty closed convex set and take r = δX to be the convex indicator of X. Then x is an equilibrium point of the decoupled family of problems minu{fx(u)+r(u)} if and only if x argmin u Rd {f x(u) + r(u)} = {proj X (a x + b)}.

Taking a = 1, b = 0, and X = Rd, we see that γ = µ and no equilibrium point exists. On the other hand, if we take X = Rd + to be the nonnegative orthant and b > 0, then for any a 1 and x X we have x < ax + b = proj X (ax + b) and hence no equilibrium point exists for this problem with any value γ [µ, ).

Next, we turn to tracking the equilibria xt furnished by Lemma 12 using a decisiondependent proximal stochastic gradient method. Speciﬁcally, we make the standing assumption that at every time t, and at every query point x, the learner may obtain an unbiased estimator e ft,x(x) of the true gradient ft,x(x). With this oracle access, the decisiondependent proximal stochastic gradient method recorded as Algorithm 3 below selects in

Stochastic Optimization under Distributional Drift

each iteration t the stochastic gradient gt = e ft,xt(xt) and takes the step

xt+1 := proxηtrt(xt ηtgt) = argmin u Rd

n rt(u) + 1 2ηt u (xt ηtgt) 2o

using step size ηt > 0. As before, our goal is to obtain eﬃciency estimates for this procedure that hold both in expectation and with high probability.

Algorithm 3 Decision-Dependent PSG D-PSG(x0, {ηt}, T)

Input: initial x0 and step sizes {ηt}T 1 t=0 (0, ) Step t = 0, . . . , T 1: Select gt = e ft,xt(xt)

Set xt+1 = proxηtrt(xt ηtgt)

The guarantees we obtain allow both the iterates xt and the equilibria xt to evolve stochastically. Given {xt} and {gt} as in Algorithm 3, we let

zt := ft,xt(xt) gt

denote the gradient noise at time t and we impose the following assumption modeling stochasticity on a ﬁxed probability space (Ω, F, P) throughout Section 5.

Assumption 7 (Stochastic framework) There exists a probability space (Ω, F, P) with ﬁltration (Ft)t 0 such that F0 = { , Ω} and the following two conditions hold for all t 0:

(i) xt, xt : Ω Rd are Ft-measurable.

(ii) zt : Ω Rd is Ft+1-measurable with E[zt | Ft] = 0.

The ﬁrst item of Assumption 7 formalizes the assertion that xt and xt are fully determined by information up to time t. The second item of Assumption 7 formalizes the assertion that the gradient noise zt is fully determined by information up to time t + 1 and has zero mean conditioned on the information up to time t, i.e., gt is an unbiased estimator of ft,xt(xt); for example, this holds naturally in Example 3 if we take gt = ℓ(xt, ξt) with ξt D(t, xt). Finally, we ﬁx some notation to be used henceforth. We deﬁne the positive parameter

and we deﬁne the equilibrium drift t and the temporal gradient drift Gi,t to be the random variables t := xt xt+1 and Gi,t := sup u,x Rd fi,x(u) ft,x(u) .

Note that in the setting of Example 3, the estimate (10) implies Gi,t θβ|i t| and hence t θβ/ µ by Lemma 1, provided the regularizers rt r are identical for all times t. We also set ϕt := ft,xt + rt, x t := argmin ϕt, ϕ t := min ϕt and ψt := ft, xt + rt and ψ t := min ψt.

Cutler, Drusvyatskiy, Harchaoui

In particular, the equilibrium point xt is the minimizer of the equilibrium function ψt, and ψ t denotes its minimum value. Observe that when γ = 0, we have ϕt = ψt + ct for some constant of integration ct and hence we recover the setting of Section 2 with x t = xt.

5.2 Tracking the Equilibrium Point

In a nutshell, the results of Section 3 extend directly to tracking the equilibrium points xt, with µ replaced by µ and replaced by (deﬁned in Assumption 8 below). We begin with bounding the expected value E xt xt 2. Due to the fact that Algorithm 3 takes steps on the current functions ϕt but the minimizers we aim to track are those of the equilibrium functions ψt, we will rely at the outset on controlling the function gaps fi,x(u) fi,x(v) ft,y(u) ft,y(v)

(at ﬁrst for i = t, and later for general i and t). We achieve this control in terms of the temporal gradient drift.

Lemma 13 (Function gap variation) For all (i, x), (t, y) N Rd and u, v Rd, we have fi,x(u) fi,x(v) ft,y(u) ft,y(v) Gi,t + γ x y u v .

Proof Fix (i, x), (t, y) N Rd and u, v Rd, and set uτ := v + τ(u v) for all τ [0, 1]. By the fundamental theorem of calculus and Cauchy-Schwarz, we have fi,x(u) fi,x(v) ft,y(u) ft,y(v) = Z 1

fi,x(uτ) ft,y(uτ), u v dτ

Gi,t + γ x y u v .

Switching (i, x) and (t, y) completes the proof.

Using Lemmas 2 and 13, we obtain the following equilibrium one-step improvement.

Lemma 14 (Equilibrium one-step improvement) The iterates {xt} produced by Algorithm 3 with ηt < 1/L satisfy the bound:

2ηt(ψt(xt+1) ψ t ) (1 µηt) xt xt 2 (1 γηt) xt+1 xt 2

+ 2ηt zt, xt xt + η2 t 1 Lηt zt 2.

Proof By Lemma 13, we have ψt(xt+1) ψt( xt) ϕt(xt+1) ϕt( xt)

= ft, xt(xt+1) ft, xt( xt) ft,xt(xt+1) ft,xt( xt)

γ xt xt xt+1 xt .

Hence ψt(xt+1) ψ t ϕt(xt+1) ϕt( xt) + γ xt xt xt+1 xt . Moreover, Young s inequality implies

γ xt xt xt+1 xt γ

2 xt xt 2 + γ

2 xt+1 xt 2.

Multiplying through by 2ηt and applying Lemma 2 completes the proof.

Stochastic Optimization under Distributional Drift

For simplicity, we state the main results under the assumption that the second moments E 2 t and E zt 2 are uniformly bounded; more general guarantees that take into account weighted averages of the moments and allow for time-dependent learning rates follow from Lemma 14 as well.

Assumption 8 (Bounded second moments) There exist constants , σ > 0 such that the following two conditions hold for all t 0:

(i) (Drift) The equilibrium drift t satisﬁes E 2 t 2.

(ii) (Noise) The gradient noise zt satisﬁes E zt 2 σ2.

The following theorem establishes an expected improvement guarantee for Algorithm 3, thereby extending Theorem 3; see Section 6.1 for the precise statement (Corollary 27) and proof.

Theorem 15 (Expected distance) Suppose that Assumption 8 holds. Then the iterates produced by Algorithm 3 with constant learning rate η 1/2L satisfy the bound:

E xt xt 2 (1 µη)t x0 x0 2 | {z } optimization

µ |{z} noise

| {z } drift

With Theorem 15 in hand, we are led to deﬁne the following asymptotic tracking error of Algorithm 3 corresponding to E xt xt 2, together with the corresponding optimal step size:

E := min η (0,1/2L]

and η := min

( 1 2L, 2 2

Plugging η into the deﬁnition of E, we see that Algorithm 3 exhibits qualitatively diﬀerent behaviors in settings corresponding to high or low drift-to-noise ratio /σ:

µ2 2/3 otherwise.

As before, the high drift-to-noise regime /σ p

µ/16L3 is uninteresting from the viewpoint of stochastic optimization and we focus on the low drift-to-noise regime /σ < p

µ/16L3. The following theorem extends Theorem 4; see Theorem 28 for the formal statement and proof.

Theorem 16 (Time to track in expectation, informal) Suppose that Assumption 8 holds. Then there is a learning rate schedule {ηt} such that Algorithm 3 produces a point xt satisfying

E xt xt 2 E in time t L

µ log x0 x0 2

Cutler, Drusvyatskiy, Harchaoui

Next, we present high-probability guarantees for the tracking error xt xt 2 under the following standard light-tail assumption on the equilibrium drift and gradient noise.

Assumption 9 (Sub-Gaussian drift and noise) There exist constants , σ > 0 such that the following two conditions hold for all t 0:

(i) (Drift) The drift 2 t is sub-exponential conditioned on Ft with parameter 2:

E exp(λ 2 t ) | Ft exp(λ 2) for all 0 λ 2.

(ii) (Noise) The noise zt is norm sub-Gaussian conditioned on Ft with parameter σ/2:

P zt τ | Ft 2 exp( 2τ 2/σ2) for all τ > 0.

Note that the ﬁrst item of Assumption 9 is equivalent to asserting that the equilibrium drift t is sub-Gaussian conditioned on Ft, and that this condition holds trivially in the setting of Example 3 with = θβ/ µ provided the regularizers rt r are identical for all times t. Clearly Assumption 9 implies Assumption 8 with the same constants , σ. The following theorem shows that if Assumption 9 holds, then the expected bound on xt xt 2

derived in Theorem 15 holds with high probability.

Theorem 17 (High-probability distance tracking) Let {xt} be the iterates produced by Algorithm 3 with constant learning rate η 1/2L, and suppose that Assumption 9 holds. Then there is an absolute constant c > 0 such that for any speciﬁed t N and δ (0, 1), the following estimate holds with probability at least 1 δ:

xt xt 2 1 µη

2 t x0 x0 2 + c

Theorem 17 is an extension of Theorem 6. As a consequence of Theorem 17, we can again implement a step decay schedule in the low drift-to-noise regime to obtain the following eﬃciency estimate with high probability, thereby extending Theorem 7; see Section 6.2 for the precise statements (Theorems 30 and 31) and proofs.

Theorem 18 (Time to track with high probability, informal) Suppose that Assumption 9 holds and that we are in the low drift-to-noise regime /σ < p

µ/16L3. Then there is a learning rate schedule {ηt} such that for any speciﬁed δ (0, 1), Algorithm 3 produces a point xt satisfying

xt xt 2 E log e

with probability at least 1 δ in time

µ log x0 x0 2

5.3 Tracking the Equilibrium Value

The results outlined so far have focused on tracking the equilibrium point xt, i.e., the minimizer of ψt. In this section, we present results for tracking the equilibrium value ψ t in the parameter regime γ < µ/2. (12)

Stochastic Optimization under Distributional Drift

The regime (12) matches the one used in Theorem 7.3 of Drusvyatskiy and Xiao (2022) to obtain function gap bounds for biased PSG along an average iterate, and we employ a similar averaging technique to obtain our bounds. Imposing the regime (12), we deﬁne the positive parameter

ˆµ := µ 2γ.

Our strategy is to track the equilibrium value ψ t along the running average ˆxt of the iterates xt produced by Algorithm 3, as deﬁned in Algorithm 4 below. In a nutshell, the results of Section 4 extend directly to tracking the equilibrium value ψ t , with µ replaced by ˆµ and replaced by (deﬁned in Assumption 10 below).

Algorithm 4 Averaged Decision-Dependent PSG D-PSG(x0, µ, γ, {ηt}, T) Input: initial x0 = ˆx0, strong convexity parameter µ, gradient drift parameter γ [0, µ/2), and step sizes {ηt}T 1 t=0 (0, 1/ µ), where µ = µ γ; set ˆµ = µ 2γ Step t = 0, . . . , T 1:

Select gt = e ft,xt(xt)

Set xt+1 = proxηtrt(xt ηtgt)

Set ˆxt+1 = 1 ˆµηt 2 µηt

ˆxt + ˆµηt 2 µηt xt+1

Return ˆx T

We begin with bounding the expected value E[ψt(ˆxt) ψ t ]. This requires a weak form of uncorrelatedness between the gradient noise zi and the future equilibrium point xt, which we stipulate in the following analogue of Assumption 4.

Assumption 10 (Bounded second moments) The regularizers rt r are identical for all times t and there exist constants , σ > 0 such that the following two conditions hold for all 0 i < t:

(i) (Drift) The temporal gradient drift Gi,t satisﬁes E G2 i,t (ˆµ |i t|)2.

(ii) (Noise) The gradient noise zi satisﬁes E zi 2 σ2 and E zi, xt = 0.

Taking into account Lemma 1, it is clear that Assumption 10 implies the earlier Assumption 8 with the same constants , σ. Further, the condition on the drift holds trivially in the setting of Example 3 with = θβ/ˆµ provided µ > 2εβ. The following theorem presents an expected improvement guarantee for Algorithm 4, thereby extending Theorem 8; see Corollary 34 for the precise statement and proof.

Theorem 19 (Expected function gap) Let {ˆxt} be the iterates produced by Algorithm 4 with constant learning rate η 1/2L, and suppose that Assumption 10 holds. Then the following bound holds for all t 0:

E ψt(ˆxt) ψ t 1 ˆµη

2 t ψ0(x0) ψ 0

| {z } optimization

+ ησ2 |{z} noise + 2

ˆµη2 |{z} drift

Cutler, Drusvyatskiy, Harchaoui

With Theorem 19 in hand, we are led to deﬁne the following asymptotic tracking error of Algorithm 4 corresponding to E[ψt(ˆxt) ψ t ], together with the corresponding optimal step size:

b G := min η (0,1/2L]

and ˆη := min

( 1 2L, 2 2

A familiar dichotomy governed by the drift-to-noise ratio /σ arises. We again focus on the low drift-to-noise regime /σ < p

ˆµ/16L3. The following theorem extends Theorem 9; see Theorem 35 for the formal statement.

Theorem 20 (Time to track in expectation, informal) Suppose that Assumption 10 holds. Then there is a learning rate schedule {ηt} such that Algorithm 4 produces a point ˆxt satisfying

E[ψt(ˆxt) ψ t ] b G in time t L

ˆµ log ψ0(x0) ψ 0 b G

Next, we obtain high-probability analogues of Theorems 19 and 20. Naturally, such results should rely on light-tail assumptions on the temporal gradient drift Gi,t and the norm of the gradient noise zi . We state the guarantees under an assumption of sub-Gaussian drift and noise (Assumption 11 below). In particular, we require that the gradient noise zi is mean-zero conditioned on the σ-algebra

Fi,t := σ(Fi, xt)

for all 0 i < t; the property E[zi | Fi,t] = 0 would follow from independence of the gradient noise zi and the future equilibrium point xt.

Assumption 11 (Sub-Gaussian drift and noise) The regularizers rt r are identical for all times t and there exist constants , σ > 0 such that the following two conditions hold for all 0 i < t:

(i) (Drift) The drift G2 i,t is sub-exponential with parameter (ˆµ |i t|)2:

E exp λ G2 i,t exp λ(ˆµ |i t|)2 for all 0 λ (ˆµ |i t|) 2.

(ii) (Noise) The noise zi is mean-zero norm sub-Gaussian conditioned on Fi,t with parameter σ/2, i.e., E[zi | Fi,t] = 0 and

P zi τ | Fi,t 2 exp( 2τ 2/σ2) for all τ > 0.

Clearly the chain of implications

Assumption 11 = Assumption 10 = Assumption 8

holds, and the condition on the drift in Assumption 11 holds trivially in the setting of Example 3 with = θβ/ˆµ provided µ > 2εβ. The following theorem shows that if Assumption 11 holds, then the expected bound on ψt(ˆxt) ψ t derived in Theorem 19 holds with high probability, thereby extending Theorem 10.

Stochastic Optimization under Distributional Drift

Theorem 21 (Function gap with high probability) Let {ˆxt} be the iterates produced by Algorithm 4 with constant learning rate η 1/2L, and suppose that Assumption 11 holds. Then there is an absolute constant c > 0 such that for any speciﬁed t N and δ (0, 1), the following estimate holds with probability at least 1 δ:

ψt(ˆxt) ψ t c 1 ˆµη

2 t ψ0(x0) ψ 0 + ησ2 + 2

With Theorem 21 in hand, we may implement a step decay schedule as before to obtain the following eﬃciency estimate, thereby extending Theorem 11; see Section 6.4 for the precise statements (Theorems 39 and 41) and proofs.

Theorem 22 (Time to track with high probability, informal) Suppose that Assumption 11 holds and that we are in the low drift-to-noise regime /σ < p

ˆµ/16L3. Fix δ (0, 1). Then there is a learning rate schedule {ηt} such that Algorithm 4 produces a point ˆxt satisfying

ψt(ˆxt) ψ t b G log e

with probability at least 1 Kδ in time

ˆµ log ψ0(x0) ψ 0 b G

ˆµ b G log log e

, where K log2

6. Proofs of Main Results

Roadmap. In this section, we derive the results of the preceding sections under the uniﬁed framework presented in Section 5.1; we impose the assumptions and notation of Section 5.1 henceforth. Sections 6.1 and 6.2 handle distance tracking in expectation and with high probability, respectively; this corresponds to the results presented in Section 5.2 (entailing those of Sections 3.1 and 3.2). Then Sections 6.3 and 6.4 handle function gap tracking in expectation and with high probability, respectively; this corresponds to the results presented in Section 5.3 (entailing those of Sections 4.1 and 4.2).

6.1 Tracking the Equilibrium Point: Bounds in Expectation

The proof of Theorem 15 follows a familiar pattern in stochastic optimization. We begin by recalling Lemma 2, which gives a standard one-step improvement guarantee for the proximal stochastic gradient method on the ﬁxed problem min ϕt.

Lemma 23 (One-step improvement) For all x Rd, the iterates {xt} produced by Algorithm 3 with ηt < 1/L satisfy the bound:

2ηt(ϕt(xt+1) ϕt(x)) (1 µηt) xt x 2 xt+1 x 2 + 2ηt zt, xt x + η2 t 1 Lηt zt 2.

Proof Since ft := ft,xt is L-smooth, we have

ϕt(xt+1) = ft(xt+1) + rt(xt+1)

ft(xt) + ft(xt), xt+1 xt + L

2 xt+1 xt 2 + rt(xt+1)

= ft(xt) + rt(xt+1) + gt, xt+1 xt + L

2 xt+1 xt 2 + zt, xt+1 xt .

Cutler, Drusvyatskiy, Harchaoui

Next, given any δt > 0, Cauchy-Schwarz and Young s inequality yield

zt, xt+1 xt δt

2 zt 2 + 1 2δt xt+1 xt 2.

Therefore, given any x Rd, we have

ϕt(xt+1) ft(xt) + rt(xt+1) + gt, xt+1 xt + δ 1 t +L 2 xt+1 xt 2 + δt

= ft(xt) + rt(xt+1) + gt, xt+1 xt + 1 2ηt xt+1 xt 2

+ δ 1 t +L η 1 t 2 xt+1 xt 2 + δt

ft(xt) + rt(x) + gt, x xt + 1 2ηt x xt 2 1 2ηt x xt+1 2

+ δ 1 t +L η 1 t 2 xt+1 xt 2 + δt

where the last inequality holds because xt+1 = proxηtrt(xt ηtgt) is the minimizer of the η 1 t -strongly convex function rt + gt, xt + 1 2ηt xt 2. Now we estimate

ft(xt) + rt(x) + gt, x xt = ft(xt) + ft(xt), x xt + rt(x) + zt, xt x

2 x xt 2 + rt(x) + zt, xt x

2 x xt 2 + zt, xt x

using the µ-strong convexity of ft. Thus,

ϕt(xt+1) ϕt(x) µ

2 x xt 2 + zt, xt x + 1 2ηt x xt 2 1 2ηt x xt+1 2

+δ 1 t +L η 1 t 2 xt+1 xt 2 + δt

Finally, taking δt = ηt/(1 Lηt) and rearranging (note that ϕt(xt+1) is ﬁnite) yields

2ηt(ϕt(xt+1) ϕt(x)) (1 µηt) xt x 2 xt+1 x 2 + 2ηt zt, xt x + η2 t 1 Lηt zt 2,

as claimed.

It is critically important that the one-step improvement estimate in Lemma 23 holds with respect to any reference point x. In particular, as we already showed in Section 5.2, taking x = xt and applying Lemma 13 yields Lemma 14:

Lemma 24 (Equilibrium one-step improvement) The iterates {xt} produced by Algorithm 3 with ηt < 1/L satisfy the bound:

2ηt(ψt(xt+1) ψ t ) (1 µηt) xt xt 2 (1 γηt) xt+1 xt 2

+ 2ηt zt, xt xt + η2 t 1 Lηt zt 2.

With Lemma 24 in hand, we obtain the following recursion on xt xt 2.

Lemma 25 (Distance recursion) The iterates {xt} produced by Algorithm 3 with step size ηt < 1/L satisfy the bound:

xt+1 xt+1 2 (1 µηt) xt xt 2 + 2ηt zt, xt xt + η2 t 1 Lηt zt 2 + 1 + 1 µηt

Stochastic Optimization under Distributional Drift

Proof First, note

xt+1 xt+1 2 = xt+1 xt 2 + xt xt+1 2 + 2 xt+1 xt, xt xt+1

(1 + µηt) xt+1 xt 2 + 1 + 1 µηt

by Cauchy-Schwarz and Young s inequality. Further, the µ-strong convexity of ψt implies µ 2 xt+1 xt 2 ψt(xt+1) ψ t , which together with Lemma 24 implies

(1 + µηt) xt+1 xt 2 (1 µηt) xt xt 2 + 2ηt zt, xt xt + η2 t 1 Lηt zt 2.

The result follows.

Applying Lemma 25 recursively furnishes a bound on xt xt 2. When the step size is constant, the next proposition follows immediately.

Proposition 26 (Last-iterate progress) The iterates {xt} produced by Algorithm 3 with constant step size η < 1/L satisfy the bound:

xt xt 2 (1 µη)t x0 x0 2 + 2η

i=0 zi, xi xi (1 µη)t 1 i

i=0 zi 2(1 µη)t 1 i + 1 + 1

i=0 2 i (1 µη)t 1 i.

By taking expectations in Proposition 26, we obtain the following precise version of Theorem 15.

Corollary 27 (Expected distance) Suppose that Assumption 8 holds. Then the iterates {xt} generated by Algorithm 3 with constant learning rate η 1/2L satisfy the bound:

E xt xt 2 (1 µη)t x0 x0 2 + 2 ησ2

With Corollary 27 in hand, we can now prove an expected eﬃciency estimate for the online proximal stochastic gradient method using a step decay schedule, wherein the algorithm is implemented in epochs with the new learning rate chosen to be the midpoint between the current learning rate and the asymptotically optimal learning rate η . The following theorem provides a formal version of Theorem 16 (note that in the high drift-to-noise regime /σ p

µ/16L3, Theorem 16 holds trivially with the constant learning rate η = 1/2L). The argument is close in spirit to the justiﬁcations of the restart schemes used by Ghadimi and Lan (2013).

Theorem 28 (Time to track in expectation) Suppose that Assumption 8 holds and that we are in the low drift-to-noise regime /σ < p

µ/16L3. Set η = (2 2/ µσ2)1/3 and E = ( σ2/ µ2)2/3. Suppose moreover that we have available a positive upper bound on the initial squared distance D x0 x0 2. Consider running Algorithm 3 in k = 0, . . . , K 1 epochs, namely, set X0 = x0 and iterate the process

Xk+1 = D-PSG(Xk, ηk, Tk) for k = 0, . . . , K 1,

Cutler, Drusvyatskiy, Harchaoui

where the number of epochs is

and we set 5

and ηk = ηk 1 + η

2 , Tk = log(4)

Then the time horizon T = T0 + + TK 1 satisﬁes

and the corresponding tracking error satisﬁes E XK XK 2 E, where XK denotes the minimizer of ψT .

Proof For each index k, let tk := T0 + + Tk 1 (with t0 := 0), Xk be the minimizer of the corresponding equilibrium function ψtk, and

Then taking into account ηk η , Corollary 27 directly implies

E Xk+1 Xk+1 2 (1 µηk)Tk E Xk Xk 2 + 2

e µηk Tk E Xk Xk 2 + Ek.

We will verify by induction that the estimate E Xk Xk 2 2 Ek 1 holds for all indices k 1. To see the base case, observe

E X1 X1 2 e µη0T0 X0 X0 2 + E0 2 E0.

Now assume that the claim holds for some index k 1. We then conclude

E Xk+1 Xk+1 2 e µηk Tk E Xk Xk 2 + Ek

4E Xk Xk 2 + Ek

Ek 2 Ek 1 E Xk Xk 2 + Ek 2 Ek,

thereby completing the induction. Hence E XK XK 2 2 EK 1. Next, observe

µ (ηK 1 η ) = 2σ2

E XK XK 2 2(1 + 3

5. We use here the notation a+ = a 0 = max{a, 0} to denote the positive part of a real number a; note that for small D, the logarithms log( µLD/σ2) and log(D/ E) may be negative.

Stochastic Optimization under Distributional Drift

Finally, note

k=1 2k 2L 2K = 8L 2K 2 8 σ2 µ

This completes the proof.

6.2 Tracking the Equilibrium Point: High-Probability Guarantees

The proof of Theorem 17 is based on recursively controlling the moment generating function of xt xt 2. Namely, Lemma 25 in the regime ηt 1/2L directly yields

xt+1 xt+1 2 (1 µηt) xt xt 2 + 2ηt zt, vt xt xt + 2η2 t zt 2 + 2 µηt 2 t , (14)

where we set

xt xt if xt = xt 0 otherwise.

The goal is now to control the moment generating function E eλ xt xt 2 through the recursive inequality (14). The basic probabilistic tool to achieve this in similar settings under bounded noise assumptions was developed by Harvey et al. (2019); the following proposition is a slight generalization of Claim D.1 of Harvey et al. (2019) to the light-tail setting we require.

Proposition 29 (Recursive control on MGF) Consider scalar stochastic processes (Vt), (Dt), and (Xt) on a probability space with ﬁltration (Ht) such that Vt is nonnegative and Ht-measurable and the inequality

Vt+1 αt Vt + Dt p

Vt + Xt + κt

holds for some deterministic constants αt ( , 1] and κt R. Suppose that the moment generating functions of Dt and Xt conditioned on Ht satisfy the following inequalities for some deterministic constants σt, νt > 0:

E[exp(λDt) | Ht] exp(λ2σ2 t /2) for all λ 0 (e.g., Dt is mean-zero sub-Gaussian conditioned on Ht with parameter σt).

E[exp(λXt) | Ht] exp(λνt) for all 0 λ 1/νt (e.g., Xt is nonnegative and subexponential conditioned on Ht with parameter νt).

Then the inequality

E[exp(λVt+1)] exp λ(νt + κt) E exp λ 1 + αt

holds for all 0 λ min 1 αt

2σ2 t , 1 2νt .

Cutler, Drusvyatskiy, Harchaoui

Proof For any index t and any scalar λ 0, the tower rule implies

E[exp(λVt+1)] E exp λ αt Vt + Dt p

Vt + Xt + κt

= exp(λκt) E h exp(λαt Vt) E exp λDt p

Vt exp(λXt) | Ht i .

H older s inequality in turn yields

E exp λDt p

Vt exp(λXt) | Ht q

Vt Dt | Ht E exp(2λXt) | Ht

exp(2λ2Vtσ2 t ) exp(2λνt)

= exp(λ2σ2 t Vt) exp(λνt)

provided 0 λ 1 2νt . Thus, if 0 λ min 1 αt

2σ2 t , 1 2νt , then the following estimate holds:

E exp(λVt+1) exp(λκt) E exp(λαt Vt) exp(λ2σ2 t Vt) exp(λνt)

= exp λ(νt + κt) E exp(λ(αt + λσ2 t )Vt)

exp λ(νt + κt) E exp λ 1 + αt

The proof is complete.

We may now use Proposition 29 to derive the following precise version of Theorem 17.

Theorem 30 (High-probability distance tracking) Let {xt} be the iterates produced by Algorithm 3 with constant learning rate η 1/2L, and suppose that Assumption 9 holds. Then there exists an absolute constant 6 c > 0 such that for any speciﬁed t N and δ (0, 1), the following estimate holds with probability at least 1 δ:

xt xt 2 1 µη

2 t x0 x0 2 +

Proof Note ﬁrst that under Assumption 9, there exists an absolute constant c 1 such that zt 2 is sub-exponential conditioned on Ft with parameter cσ2 and zt is mean-zero sub-Gaussian conditioned on Ft with parameter cσ for all t (see Jin et al., 2019, Lemma 3). Therefore zt, vt is mean-zero sub-Gaussian conditioned on Ft with parameter cσ, while 2 t is sub-exponential conditioned on Ft with parameter 2 by Assumption 9. Thus, in light of inequality (14), we may apply Proposition 29 with Ht = Ft, Vt = xt xt 2, Dt = 2ηt zt, vt , Xt = 2η2 t zt 2 + 2 2 t / µηt, αt = 1 µηt, κt = 0, σt = 2ηtcσ, and νt = 2η2 t cσ2 + 2 2/ µηt, yielding the estimate

E exp λ xt+1 xt+1 2 exp λ 2η2 t cσ2 + 2 2

E exp λ 1 µηt

2 xt xt 2 (15)

0 λ min µ 8ηt(cσ)2 , 1 4η2 t cσ2 + 4 2/ µηt

6. Explicitly, one can take any c 1 such that zt 2 is sub-exponential conditioned on Ft with parameter cσ2 and zt is mean-zero sub-Gaussian conditioned on Ft with parameter cσ for all t.

Stochastic Optimization under Distributional Drift

Taking into account ηt η and iterating the recursion (15), we deduce

E exp λ xt xt 2 exp

2 t x0 x0 2 + λ 2η2cσ2 + 2 2

2 t x0 x0 2 + 4ηcσ2

0 λ min µ 8η(cσ)2 , 1 4η2cσ2 + 4 2/ µη

Moreover, setting

ν := 8η(cσ)2

and taking into account c 1 and µη 1, we have

and 1 ν = µ 8η(cσ)2 + 4 2/ µη2 min µ 8η(cσ)2 , 1 4η2cσ2 + 4 2/ µη

E h exp λ xt xt 2 1 µη

2 t x0 x0 2 i exp(λν) for all 0 λ 1/ν.

Taking λ = 1/ν and applying Markov s inequality completes the proof.

With Theorem 30 in hand, we can now prove a high-probability eﬃciency estimate for Algorithm 3 using a step decay schedule. The following theorem provides a formal version of Theorem 18. The argument follows the same reasoning as in the proof of Theorem 28, with Theorem 30 playing the role of Corollary 27. The proof appears in Appendix B (see Section B.1).

Theorem 31 (Time to track with high probability) Suppose that Assumption 9 holds and that we are in the low drift-to-noise regime /σ < p

µ/16L3. Set η = (2 2/ µσ2)1/3

and E = ( σ2/ µ2)2/3. Suppose moreover that we have available an upper bound on the initial squared distance D x0 x0 2. Consider running Algorithm 3 in k = 0, . . . , K 1 epochs, namely, set X0 = x0 and iterate the process

Xk+1 = D-PSG(Xk, ηk, Tk) for k = 0, . . . , K 1,

where the number of epochs is

and ηk = ηk 1 + η

2 , Tk = 2 log(12)

Cutler, Drusvyatskiy, Harchaoui

Then the time horizon T = T0 + + TK 1 satisﬁes

and for any speciﬁed δ (0, 1), the corresponding tracking error satisﬁes

XK XK 2 E log e

with probability at least 1 δ, where XK denotes the minimizer of ψT .

6.3 Tracking the Equilibrium Value: Bounds in Expectation

We turn now to tracking the equilibrium value. To begin, we require a more ﬂexible version of Lemma 24 which holds in the static regularizer setting rt r.

Lemma 32 (Equilibrium one-step improvement) The iterates {xt} produced by Algorithm 3 with rt r and ηt < 1/L satisfy the following bound for all indices i, t N and arbitrary α > 0:

2ηi ψt(xi+1) ψ t (1 µηi) xi xt 2 1 (γ + α)ηi xi+1 xt 2

+ 2ηi zi, xi xt + η2 i 1 Lηi zi 2 + ηi

Proof Taking into account rt r and applying Lemma 13, we have ψt(xi+1) ψt( xt) ϕi(xi+1) ϕi( xt)

= ft, xt(xi+1) ft, xt( xt) fi,xi(xt+1) fi,xi( xt)

Gi,t + γ xi xt xi+1 xt .

Hence ψt(xi+1) ψ t ϕi(xi+1) ϕi( xt) + Gi,t + γ xi xt xi+1 xt . Moreover, Young s inequality implies Gi,t + γ xi xt xi+1 xt γ

2 xi xt 2 + γ+α

2 xi+1 xt 2 + 1

Multiplying through by 2ηi and applying Lemma 23 completes the proof.

Turning the estimate in Lemma 32 into an eﬃciency guarantee for the average iterate is essentially standard and follows for example from the averaging techniques used by Drusvyatskiy and Xiao (2022), Ghadimi and Lan (2012), and Kulunchakov and Mairal (2020). The resulting progress along the average iterate is summarized in the following proposition, while the description of the key averaging lemma is placed in Appendix A. Henceforth, we impose the regime (12): γ < µ/2.

Stochastic Optimization under Distributional Drift

Proposition 33 (Progress along the average iterate) The iterates {ˆxt} produced by Algorithm 4 with rt r and constant step size η 1/2L satisfy the bound

ψt(ˆxt) ψ t (1 ˆρ)t ψt(x0) ψ t + ˆµ

4 x0 xt 2 + ˆρ

i=0 zi, xi xt (1 ˆρ)t 1 i

i=0 zi 2(1 ˆρ)t 1 i + ˆρ

i=0 G2 i,t(1 ˆρ)t 1 i,

where ˆρ := ˆµη/(2 µη).

Proof Setting α = ˆµ/2 in Lemma 32, we obtain the following recursion for all indices k 0 and t 1: ρ ψk(xt) ψ k (1 c1ρ)Vt 1 (1 + c2ρ)Vt + ωt, where ρ = 2η, c1 = µ/2, c2 = µ/4, Vi = xi xk 2, and ωt = 2η zt 1, xt 1 xk + 2η2 zt 1 2 + (2η/ˆµ) G2 t 1,k. The result follows by applying Lemma 42 with h = ψk ψ k and then taking k = t.

Taking expectations in Proposition 33 yields the following precise version of Theorem 19.

Corollary 34 (Expected function gap) Let {ˆxt} be the iterates produced by Algorithm 4 with constant step size η 1/2L, set ˆρ := ˆµη/(2 µη), and suppose that Assumption 10 holds. Then

E ψt(ˆxt) ψ t (1 ˆρ)t E ψt(x0) ψ t + ˆµ

4 x0 xt 2 + ησ2 + 8 2

for all t 0. Consequently, we have

E ψt(ˆxt) ψ t (1 ˆρ)t ψ0(x0) ψ 0 + ησ2 + 2

for all t 0, and the following asymptotic error bound holds:

lim sup t E ψt(ˆxt) ψ t ησ2 + 8 2

Proof The bound (16) follows by taking expectations in Proposition 33 and noting

i=0 E zi 2(1 ˆρ)t 1 i σ2

i=0 E G2 i,t(1 ˆρ)t 1 i (ˆµ )2(2 ˆρ)

by Assumption 10. Next, applying Lemma 13, Lemma 1, and Young s inequality together with the µ-strong convexity of ψ0 yields

ψt(x0) ψ t + ˆµ

4 x0 xt 2 3 ψ0(x0) ψ 0 + 5 G2 0,t/ µ, (17)

and then taking expectations and invoking Assumption 10 gives

E ψt(x0) ψ t + ˆµ

4 x0 xt 2 3 ψ0(x0) ψ 0 + 5ˆµ 2t2. (18)

Further, the inequality

e ˆµηt/2ˆµt2 16/ˆµη2 ˆµ, η, t > 0 (19)

Cutler, Drusvyatskiy, Harchaoui

combines with inequality (18) to yield

(1 ˆρ)t E ψt(x0) ψ t + ˆµ

4 x0 xt 2 3(1 ˆρ)t ψ0(x0) ψ 0 + 80 2

and the remaining assertions of the corollary follow.

We may now apply Corollary 34 to obtain a formal version of Theorem 20; the proof closely follows that of Theorem 28 and is included in Appendix B (see Section B.2).

Theorem 35 (Time to track in expectation) Suppose that Assumption 10 holds and that we are in the low drift-to-noise regime /σ < p

ˆµ/16L3. Set ˆη = (2 2/ˆµσ2)1/3 and b G = ˆµ( σ2/ˆµ2)2/3. Suppose moreover that we have available a positive upper bound on the initial gap D ψ0(x0) ψ 0. Consider running Algorithm 4 in k = 0, . . . , K 1 epochs, namely, set X0 = x0 and iterate the process

Xk+1 = D-PSG(Xk, µ, γ, ηk, Tk) for k = 0, . . . , K 1,

where the number of epochs is

and ηk = ηk 1 + ˆη

2 , Tk = 2 log(12)

Then the time horizon T = T0 + + TK 1 satisﬁes

and the corresponding tracking error satisﬁes E ψT (XK) ψ T b G.

6.4 Tracking the Equilibrium Value: High-Probability Guarantees

In this section, we derive the high-probability analogues of the results in Section 6.3. In light of Proposition 33, we seek upper bounds on the sums

i=0 zi, xi xt (1 ˆρ)t 1 i,

i=0 zi 2(1 ˆρ)t 1 i,

i=0 G2 i,t(1 ˆρ)t 1 i

that hold with high probability. The last two sums can easily be estimated under boundedness or light-tail assumptions on zi and Gi,t. Controlling the ﬁrst sum is more challenging because the error xi xt may in principle grow large. In order to control this term, we will use a remarkable generalization of Freedman s inequality, recently proved by Harvey et al. (2019) for the purpose of analyzing the stochastic gradient method on static nonsmooth problems (without a regularizer). The main idea is as follows. Fix a horizon t, assume E[zi | Fi,t] = 0 for all 0 i < t (recall that Fi,t := σ(Fi, xt)), and deﬁne the martingale diﬀerence sequence

di := zi, xi xt (1 ˆρ)t 1 i

Stochastic Optimization under Distributional Drift

adapted to the ﬁltration (Fi+1,t)t 1 i=0. Roughly speaking, under mild light-tail assumptions, the total conditional variance of the corresponding martingale Pt 1 i=0 di can be bounded above with high probability by an aﬃne transformation of itself, i.e., by an aﬃne combination of the sequence {di}t 1 i=0. In this way, the martingale is self-regulating. This is the content of the following proposition. The proof follows from Lemma 32 and algebraic manipulation and is placed in Appendix B (see Section B.3).

Proposition 36 (Self-regulation) The iterates {xt} produced by Algorithm 3 with rt r and constant step size η 1/2L satisfy the following bound for all λ (0, µη]:

i=0 xi xt 2(1 λ)2(t 1 i)

i=j+1 (1 λ)t 2 i

zj, xj xt (1 λ)t 1 j

λ(1 λ)t 1 x0 xt 2 + 2η2

j=0 zj 2(1 λ)t 2 j

j=0 G2 j,t(1 λ)t 2 j.

In order to bound the self-regulating martingale Pt 1 i=0 di, we will use the generalized Freedman inequality developed by Harvey et al. (2019), or rather a direct consequence thereof (see Harvey et al., 2019, Lemma C.3).

Theorem 37 (Consequence of generalized Freedman) Let (Di)n i=0 and (Vi)n i=0 be scalar stochastic processes on a probability space with ﬁltration (Hi)n+1 i=0 satisfying

E[exp(λDi) | Hi] exp(λ2Vi/2) for all λ 0.

Suppose that Di is Hi+1-measurable with E|Di| < and E[Di | Hi] = 0, and that Vi is nonnegative and Hi-measurable. Suppose moreover that there are constants α0, . . . , αn 0, δ [0, 1], and β(δ) 0 satisfying

i=0 αi Di + β(δ)

Set α := max{α0, . . . , αn}. Then for all τ > 0, the following bound holds:

δ + exp τ 4α + 8β(δ)/τ

Combining Proposition 36 and Theorem 37 yields the following tail bound for Pt 1 i=0 di.

Proposition 38 (Noise martingale tail bound) Let {xt} be the iterates produced by Algorithm 3 with constant step size η 1/2L, set ˆρ := ˆµη/(2 µη), and suppose that Assumption 11 holds. Then there is an absolute constant c > 0 such that for any speciﬁed t N, δ (0, 1), and τ > 0, the following bound holds:

i=0 zi, xi xt (1 ˆρ)t 1 i τ

δ + exp τ 4α + 8βt log(3e/δ)/τ

Cutler, Drusvyatskiy, Harchaoui

where α := 3η(cσ)2/ˆρ and

βt := (1 ˆρ)t 1 x0 x0 2 + 2t2 2(cσ)2

ˆρ + 2η2(cσ)4

ˆρ2 + 3ˆµ 2η(cσ)2

Proof By Assumption 11, there exists an absolute constant c 1 such that zi 2 is sub-exponential conditioned on Fi,t with parameter cσ2 and zi is mean-zero sub-Gaussian conditioned on Fi,t with parameter cσ for all indices 0 i < t. Then for each 0 i < t, the Fi+1,t-measurable random variable zi, xi xt is mean-zero sub-Gaussian conditioned on Fi,t with parameter cσ xi xt , so

E exp λ zi, xi xt (1 ˆρ)t 1 i | Fi,t exp λ2(cσ)2 xi xt 2(1 ˆρ)2(t 1 i)/2 λ R;

note also that E| zi, xi xt | < by H older s inequality, Assumption 8, and Corollary 27. Now ﬁx t 1 and observe that Proposition 36 yields the total conditional variance bound

i=0 (cσ)2 xi xt 2(1 ˆρ)2(t 1 i)

j=0 αj zj, xj xt (1 ˆρ)t 1 j + Rt,

where 0 αj α for all 0 j t 2 and

Rt := (cσ)2

ˆρ (1 ˆρ)t 1 x0 xt 2 + 2η2(cσ)2

j=0 zj 2(1 ˆρ)t 2 j + η(cσ)2

j=0 G2 j,t(1 ˆρ)t 2 j.

P Rt βt log 3e

1 δ δ (0, 1). (20)

To verify (20), observe ﬁrst that for all n 0, the sum Pn i=0 zi 2(1 ˆρ)n i is sub-exponential with parameter Pn i=0 cσ2(1 ˆρ)n i (cσ)2/ˆρ, so Markov s inequality implies

i=0 zi 2(1 ˆρ)n i (cσ)2

1 δ δ (0, 1). (21)

Further, for all 0 n < t, it follows from Assumption 11 and Lemma 1 that x0 xt 2 is subexponential with parameter 2( x0 x0 2 + 2t2) and Pn i=0 G2 i,t(1 ˆρ)n i is sub-exponential with parameter

i=0 (ˆµ )2(t i)2(1 ˆρ)n i = (ˆµ )2(1 ˆρ)n+1 t n X

i=0 (t i)2(1 ˆρ)t i 1 2(ˆµ )2

ˆρ3(1 ˆρ)t 1 n ,

so Markov s inequality implies

P n x0 xt 2 2( x0 x0 2 + 2t2) log e

o 1 δ δ (0, 1) (22)

i=0 G2 i,t(1 ˆρ)n i 2(ˆµ )2

ˆρ3(1 ˆρ)t 1 n log e

1 δ δ (0, 1). (23)

Stochastic Optimization under Distributional Drift

Thus, (21) (23) and a union bound yield (20). Consequently, Theorem 37 implies that the following bound holds for all δ (0, 1) and τ > 0:

i=0 zi, xi xt (1 ˆρ)t 1 i τ

δ + exp τ 4α + 8βt log(3e/δ)/τ

as claimed.

We may now deduce the following precise version of Theorem 21 using the tail bound furnished by Proposition 38.

Theorem 39 (Function gap with high probability) Let {ˆxt} be the iterates produced by Algorithm 4 with constant step size η 1/2L, set ˆρ := ˆµη/(2 µη), and suppose that Assumption 11 holds. Then there is an absolute constant c > 0 such that for any speciﬁed t N and δ (0, 1), the following estimate holds with probability at least 1 δ:

ψt(ˆxt) ψ t (1 ˆρ)t ψt(x0) ψ t + ˆµ

4 x0 xt 2 + η(cσ)2 + 8 2

ˆµη2 + 9ˆρ p

βt := (1 ˆρ)t 1 x0 x0 2 + 2t2 2(cσ)2

ˆρ + 2η2(cσ)4

ˆρ2 + 3ˆµ 2η(cσ)2

Proof A quick computation shows that given any δ (0, 1), we may take

in Proposition 38 to obtain

i=0 zi, xi xt (1 ˆρ)t 1 i < 5 p

We may now combine (21), (23), and (24) together with Proposition 33 and a union bound to conclude that for all δ (0, 1), the estimate

ψt(ˆxt) ψ t (1 ˆρ)t ψt(x0) ψ t + ˆµ

4 x0 xt 2 + η(cσ)2 + 2ˆµ 2

ˆρ2 + 5ˆρ p

holds with probability at least 1 4δ; noting ˆρ ˆµη/2 completes the proof.

Remark 40 To see that Theorem 39 entails Theorem 21, observe ﬁrst that in the setting of Theorem 39, upon setting C := max{c, 1} and selecting any t N, we have

(1 ˆρ)t x0 x0 2 + 2t2 ˆµησ2 + ησ2 +

while the AM-GM inequality implies

(1 ˆρ)t x0 x0 2 + 2t2 ˆµησ2 (1 ˆρ)t ˆµ x0 x0 2 + ˆµ 2t2 + ησ2,

Cutler, Drusvyatskiy, Harchaoui

inequality (19) implies

(1 ˆρ)t ˆµ x0 x0 2 + ˆµ 2t2 2(1 ˆρ)t ψ0(x0) ψ 0 + 16 2

and Young s inequality implies 2 σ ˆµη ησ2 + 2

8βt (1 ˆρ)t ψ0(x0) ψ 0 + ησ2 + 2

Further, inequalities (17) and (19) together with Assumption 11 imply that the estimate

(1 ˆρ)t ψt(x0) ψ t + ˆµ

4 x0 xt 2 3(1 ˆρ)t ψ0(x0) ψ 0 + 80 2

holds with probability at least 1 δ for all δ (0, 1). On the other hand, Theorem 39 shows that the estimate

ψt(ˆxt) ψ t (1 ˆρ)t ψt(x0) ψ t + ˆµ

4 x0 xt 2 + η(cσ)2 + 8 2

ˆµη2 + 5ˆρ p

holds with probability at least 1 δ for all δ (0, 1). Thus, a union bound reveals that the estimate

ψt(ˆxt) ψ t (1 ˆρ)t ψ0(x0) ψ 0 + ησ2 + 2

holds with probability at least 1 δ for all δ (0, 1).

We may now apply Theorem 21 to obtain a formal version of Theorem 22; the proof is analogous to that of Theorem 31 and appears in Appendix B (see Section B.4).

Theorem 41 (Time to track with high probability) Suppose that Assumption 11 holds and that we are in the low drift-to-noise regime /σ < p

ˆµ/16L3. Set ˆη = (2 2/ˆµσ2)1/3 and b G = ˆµ( σ2/ˆµ2)2/3. Suppose moreover that we have available a positive upper bound on the initial gap D ψ0(x0) ψ 0. Fix δ (0, 1) and consider running Algorithm 4 in k = 0, . . . , K 1 epochs, namely, set X0 = x0 and iterate the process

Xk+1 = D-PSG(Xk, µ, γ, ηk, Tk) for k = 0, . . . , K 1,

where the number of epochs is

and ηk = ηk 1 + ˆη

& 2 log 4c log(e/δ) +

for all k 1, where c > 0 is the absolute constant furnished by the bound (13). Then the time horizon T = T0 + + TK 1 satisﬁes

1 log log e

1 log log e

Stochastic Optimization under Distributional Drift

and the corresponding tracking error satisﬁes

ψT (XK) ψ T b G log e

with probability at least 1 Kδ.

7. Numerical Illustrations

We investigate the empirical behavior of our ﬁnite-time bounds on numerical examples with synthetic data. We consider examples of a) least-squares recovery; b) sparse least-squares recovery; c) ℓ2 2-regularized logistic regression; and investigate the behavior of xt x t 2 and ϕt(ˆxt) ϕ t in each case. The main ﬁndings are that our bounds exhibit: 1) the correct dependence on η, σ, and ; 2) excellent coverage in Monte-Carlo simulations. Code is available online at https://github.com/joshuacutler/Time Drift Experiments.

Least-squares recovery. Fix x0, x 0 Rd and consider a Gaussian random walk {x t } given by x t+1 = x t + vt, where vt is drawn uniformly from the sphere of radius in Rd. Given a ﬁxed rank-d matrix A Rn d with minimum singular value µ and maximum singular value

L, we aim to recover {x t } via the online least-squares problem

min x Rd E y Pt

1 2 Ax y 2,

where Pt = N(Ax t , Σt) with covariance matrix Σt satisfying tr Σt σ2/L. This amounts to the problem (5) with ft(x) = Ey Pt 1 2 Ax y 2 and rt = 0, and the minimizer and gradient drift satisfy

ft(x) ft+1(x) = A A(x t x t+1) L x t x t+1 = L

for all x Rd. We implement Algorithms 1 and 2 using the sample gradient gt = AT (Axt yt) at step t with yt Pt; the gradient noise zt = A (yt Ax t ) N(0, A Σt A) satisﬁes E zt 2 L tr Σt σ2. In our simulations, we set d = 50, n = 100, and Σt = (σ2/n L)In for all t, where In denotes the n n identity matrix. We initialize x0 and x 0 using standard Gaussian entries and generate A via singular value decomposition with Haar-distributed orthogonal matrices. In Figures 1 and 2, we use default parameter values µ = L = 1, σ = 10, = 1, and the corresponding asymptotically optimal step size η = η . Since ft is µ-strongly convex and L-smooth, this puts us in the low drift/noise regime in Figure 1: /σ < p

µ/16L3 = 1/4. To estimate the expected values and conﬁdence intervals of xt x t 2 and ϕt(ˆxt) ϕ t , we run 100 trials with horizon T = 100.

Cutler, Drusvyatskiy, Harchaoui

0 20 40 60 80 100 t

Guaranteed bound Asymptotic bound Empirical average 95% CI

0 20 40 60 80 100 t

E[ϕt(ˆxt) ϕ t]

Figure 1: Semilog plots of guaranteed bounds and empirical tracking errors with respect to iteration t for least-squares recovery. Shaded regions indicate the 95% conﬁdence intervals for xt x t 2 and ϕt(ˆxt) ϕ t ; empirical averages and conﬁdence intervals are computed over 100 trials. Default parameter values: µ = L = 1, σ = 10, = 1, and η = η .

0.0 0.1 0.2 0.3 0.4 0.5

E x T x T 2

0 5 10 15 20

0 5 10 15 20

0.0 0.1 0.2 0.3 0.4 0.5 η

E[ϕT(ˆx T) ϕ T]

0 5 10 15 20 σ

Guaranteed bound Empirical average 95% CI

0 5 10 15 20

Figure 2: Semilog plots of guaranteed bounds and empirical tracking errors at horizon T = 100 with respect to η, σ, and for least-squares recovery. Shaded regions indicate the 95% conﬁdence intervals for x T x T 2 and ϕT (ˆx T ) ϕ T ; empirical averages and conﬁdence intervals are computed over 100 trials. Default parameter values: µ = L = 1, σ = 10, = 1, and η = η .

Stochastic Optimization under Distributional Drift

Sparse least-squares recovery. Next, we consider least-squares recovery constrained to the closed ℓ1-ball in Rd, which we denote by B1. We aim to recover a sequence of sparse vectors in B1 generated as follows. Set s = log d , draw a vector u uniformly from the ℓ1-ball in Rs, ﬁx x 0 = (u, 0) Rd, and select (0,

2]. At step t, with probability p = (4 2 2)/(4 2), we set x t+1 = x t + vt, where vt is selected to have the same support as x t and satisfy vt = /

2 and x t +vt B1; otherwise, with probability 1 p, we obtain x t+1 from x t by swapping precisely one nonzero coordinate with a zero coordinate. The resulting sequence {x t } in B1 satisﬁes E x t x t+1 2 2. Given a ﬁxed rank-d matrix A Rn d with minimum singular value µ and maximum singular value

L, we aim to recover {x t } via the online constrained least-squares problem

min x B1 E y Pt

1 2 Ax y 2,

where Pt = N(Ax t , Σt) with covariance matrix Σt satisfying tr Σt σ2/L. This amounts to the problem (5) with ft(x) = Ey Pt 1 2 Ax y 2 and rt = δB1 (the convex indicator of B1), and the minimizer and gradient drift satisfy

E supx ft(x) ft+1(x) 2 L2 E x t x t+1 2 (L )2.

Fixing x0 drawn uniformly from B1, we implement Algorithms 1 and 2 initialized at x0 using the sample gradient gt = AT (Axt yt) at step t with yt Pt; hence E ft(xt) gt 2 σ2. In our simulations, we set d = 50, n = 100, and Σt = (σ2/n L)In for all t. We generate A via singular value decomposition with Haar-distributed orthogonal matrices. In Figures 3 and 4, we use default parameter values µ = L = 1, σ = 1/2, = 1/20, and the corresponding asymptotically optimal step size η = η . Since ft is µ-strongly convex and L-smooth, this puts us in the low drift/noise regime in Figure 3: /σ < p

µ/16L3 = 1/4. To estimate the expected values and conﬁdence intervals of xt x t 2 and ϕt(ˆxt) ϕ t , we run 100 trials with horizon T = 100.

0 20 40 60 80 100 t

Guaranteed bound Asymptotic bound Empirical average 95% CI

0 20 40 60 80 100 t

E[ϕt(ˆxt) ϕ t]

Figure 3: Semilog plots of guaranteed bounds and empirical tracking errors with respect to iteration t for sparse least-squares recovery. Shaded regions indicate the 95% conﬁdence intervals for xt x t 2 and ϕt(ˆxt) ϕ t ; empirical averages and conﬁdence intervals are computed over 100 trials. Default parameter values: µ = L = 1, σ = 1/2, = 1/20, and η = η .

Cutler, Drusvyatskiy, Harchaoui

0.0 0.1 0.2 0.3 0.4 0.5

E x T x T 2

0 1 2 3 4 5 10 3

0.0 0.1 0.2 0.3 0.4 0.5

0.0 0.1 0.2 0.3 0.4 0.5 η

E[ϕT(ˆx T) ϕ T]

0 1 2 3 4 5 σ

Guaranteed bound Empirical average 95% CI

0.0 0.1 0.2 0.3 0.4 0.5

Figure 4: Semilog plots of guaranteed bounds and empirical tracking errors at horizon T = 100 with respect to η, σ, and for sparse least-squares recovery. Shaded regions indicate the 95% conﬁdence intervals for x T x T 2 and ϕT (ˆx T ) ϕ T ; empirical averages and conﬁdence intervals are computed over 100 trials. Default parameter values: µ = L = 1, σ = 1/2, = 1/20, and η = η .

ℓ2 2-regularized logistic regression. Finally, we consider the time-varying ℓ2 2-regularized logistic regression problem

min x Rd 1 n

i=1 log 1 + exp ai, x Ax, bt

where the matrix A Rn d has ﬁxed rows a1, . . . , an Rd, {bt} is a random sequence of label vectors in {0, 1}n such that bt and bt+1 diﬀer in precisely one coordinate for each t, and µ > 0. This amounts to the problem (5) with ft(x) = 1

n(Pn i=1 log(1 + exp ai, x ) Ax, bt )+ µ

and rt = 0; setting L = 1 4n A 2 op + µ, it follows that ft is µ-strongly convex and Lsmooth. Letting {x t } denote the corresponding sequence of minimizers and setting = 1 µn maxi=1,...,n ai , it follows that the minimizer and gradient drift satisfy

µ x t x t+1 sup x ft(x) ft+1(x) µ .

We implement Algorithms 1 and 2 using the random summand sample gradient

gt = exp ak, xt 1 + exp ak, xt bk t

Stochastic Optimization under Distributional Drift

at step t, where k Unif{1, . . . , n} and bk t denotes the kth coordinate of bt; the gradient noise satisﬁes E ft(xt) gt 2 σ2, where

i,j=1 ai aj

2 max i=1,...,n ai 2 .

In our simulations, we set d = 20 and n = 200, ﬁx standard Gaussian vectors x0 Rd and a1, . . . , an Rd, ﬁx b0 drawn uniformly from {0, 1}n, and generate bt+1 from bt by ﬂipping a single coordinate selected uniformly at random. In Figure 5, we use default parameter values µ = 1 and the corresponding asymptotically optimal step size η = η . In Figure 6, we illustrate the dependence of tracking error on the regularization parameter µ; here, the asymptotically optimal step size η is used (which itself depends on µ). In Figure 7, we use the default parameter value µ = 1. To estimate the expected values and conﬁdence intervals of xt x t 2 and ϕt(ˆxt) ϕ t , we run 100 trials with horizon T = 600. The results conﬁrm our bounds and show that they capture the correct dependence on µ and η. In particular, Figure 7 illustrates that η is close to empirically optimal.

0 100 200 300 400 500 600 t

Guaranteed bound Asymptotic bound Empirical average 95% CI

0 100 200 300 400 500 600 t

E[ϕt(ˆxt) ϕ t]

Figure 5: Semilog plots of guaranteed bounds and empirical tracking errors with respect to iteration t for ℓ2 2-regularized logistic regression. Shaded regions indicate the 95% conﬁdence intervals for xt x t 2 and ϕt(ˆxt) ϕ t ; empirical averages and conﬁdence intervals are computed over 100 trials. Default parameter values: µ = 1 and η = η .

Cutler, Drusvyatskiy, Harchaoui

0 5 10 15 20 µ

E x T x T 2

Guaranteed bound Empirical average 95% CI

0 5 10 15 20 µ

E[ϕT(ˆx T) ϕ T]

Figure 6: Semilog plots of guaranteed bounds and empirical tracking errors at horizon T = 600 with respect to the strong convexity parameter µ for ℓ2 2-regularized logistic regression. Shaded regions indicate the 95% conﬁdence intervals for x T x T 2 and ϕT (ˆx T ) ϕ T ; empirical averages and conﬁdence intervals are computed over 100 trials, using the asymptotically optimal step size η (which itself depends on µ).

0.0 0.1 0.2 0.3 η

E x T x T 2

Guaranteed bound Asymptotically optimal step size Empirical average 95% CI

0.0 0.1 0.2 0.3 η

E[ϕT(ˆx T) ϕ T]

Figure 7: Semilog plots of guaranteed bounds and empirical tracking errors at horizon T = 600 with respect to the step size η for ℓ2 2-regularized logistic regression. Shaded regions indicate the 95% conﬁdence intervals for x T x T 2 and ϕT (ˆx T ) ϕ T ; empirical averages and conﬁdence intervals are computed over 100 trials. Default parameter value: µ = 1. Observe that η is close to empirically optimal.

Stochastic Optimization under Distributional Drift

Acknowledgments

This work was supported by NSF DMS-2023166, DMS-2134012, DMS-1651851, DMS-2133244, CCF-2019844, and CIFAR LMB. Part of this work was done while Z. Harchaoui was visiting the Simons Institute for the Theory of Computing.

Appendix A. Averaging Lemma

We will use a variation of the averaging strategy used by Ghadimi and Lan (2012); our approach here follows Drusvyatskiy and Xiao (2022, Section A) and Kulunchakov and Mairal (2020, Sections A.2 and A.3). To begin, consider a convex function h: Rd R { } and let {xt}t 0 be a sequence of vectors in Rd. Suppose that there are constants c1, c2 R, a sequence of nonnegative weights {ρt}t 1, and scalar sequences {Vt}t 0 and {ωt}t 1 satisfying the recursion ρth(xt) (1 c1ρt)Vt 1 (1 + c2ρt)Vt + ωt (25) for all t 1. The goal is to bound the function value h(ˆxt) evaluated along an average iterate ˆxt. Suppose that the relations c1 + c2 > 0, 1 c1ρt > 0, and 1 + c2ρt > 0 hold for all t 1. Deﬁne the augmented weights and products

ˆρt = (c1 + c2)ρt

1 + c2ρt and ˆΓt =

i=1 (1 ˆρi)

for each t 1, and set ˆΓ0 = 1. A straightforward induction yields the relation

ˆρi ˆΓi = 1

Now set ˆx0 = x0 and recursively deﬁne the average iterates

ˆxt = (1 ˆρt)ˆxt 1 + ˆρtxt

for all t 1. Unrolling this recursion, we may equivalently write

The following lemma provides the key estimate we will need.

Lemma 42 (Averaging) The estimate holds for all t 0:

h(ˆxt) c1 + c2 + Vt ˆΓt

h(x0) c1 + c2 + V0 +

ωi ˆΓi(1 + c2ρi)

Proof Observe that (27) expresses ˆxt as a convex combination of x0, . . . , xt by virtue of (26). Therefore, by the convexity of h, we may apply Jensen s inequality to obtain

ˆρi ˆΓi h(xi)

Cutler, Drusvyatskiy, Harchaoui

On the other hand, for each i 1, we may divide the recursion (25) by ˆΓi(1+c2ρi) to obtain ˆρi (c1 + c2)ˆΓi h(xi) Vi 1

ˆΓi + ωi ˆΓi(1 + c2ρi) ,

which telescopes to yield

ˆρi ˆΓi h(xi) V0 Vt

ωi ˆΓi(1 + c2ρi) .

Multiplying this inequality by ˆΓt and applying (28) yields

h(ˆxt) c1 + c2 ˆΓt

h(x0) c1 + c2 + V0 Vt

ωi ˆΓi(1 + c2ρi)

as claimed.

Appendix B. Additional Proofs

B.1 Proof of Theorem 31

For each index k, let tk := T0 + + Tk 1 (with t0 := 0), Xk be the minimizer of the corresponding function ψtk, and

where c 1 is an absolute constant satisfying the bound (11) in Theorem 17. Taking into account ηk η and our selection of c, Theorem 17 implies that for any speciﬁed index k and δ (0, 1), the following estimate holds with probability at least 1 δ:

Xk+1 Xk+1 2 1 µηk

2 Tk Xk Xk 2 + c

e µηk Tk/2 Xk Xk 2 + Ek log e

We will verify by induction that for each index k 1, the estimate Xk Xk 2 3 Ek 1 log(e/δ) holds with probability at least 1 δ for all δ (0, 1). To see the base case, observe that the estimate

X1 X1 2 e µη0T0/2 X0 X0 2 + E0 log e

holds with probability at least 1 δ for all δ (0, 1). Now assume that the claim holds for some index k 1, and let δ (0, 1); then Xk Xk 2 3 Ek 1 log(2e/δ) with probability

Stochastic Optimization under Distributional Drift

at least 1 δ/2. Thus, since we also have

Xk+1 Xk+1 2 e µηk Tk/2 Xk Xk 2 + Ek log 2e

12 Xk Xk 2 + Ek log 2e

Ek 6 Ek 1 Xk Xk 2 + Ek log 2e

with probability at least 1 δ/2, a union bound reveals

Xk+1 Xk+1 2 3

2 Ek log 2e

with probability at least 1 δ, thereby completing the induction. Hence, upon ﬁxing δ (0, 1), we have XK XK 2 3 EK 1 log(e/δ) with probability at least 1 δ. Next, observe

µ (ηK 1 η ) = 2σ2

with probability at least 1 δ. Finally, note

k=1 2k 2L 2K = 8L 2K 2 8 σ2 µ

This completes the proof.

B.2 Proof of Theorem 35

For each index k, let tk := T0 + + Tk 1 (with t0 := 0) and b Gk := ηkσ2 + 8 2/ˆµˆη2 . Then taking into account ηk ˆη , Corollary 34 and inequality (17) directly imply

E ψtk+1(Xk+1) ψ tk+1 1 ˆµηk

2 Tk E 3 ψtk(Xk) ψ tk + 5ˆµ 2T 2 k + ηkσ2 + 8 2

ˆµη2 k 3e ˆµηk Tk/2E ψtk(Xk) ψ tk + 5e ˆµηk Tk/2ˆµ 2T 2 k + b Gk.

We will verify by induction that the estimate E ψtk(Xk) ψ tk 11 b Gk 1 holds for all indices k 1. To see the base case, observe that inequality (19) facilitates the estimation

E ψt1(X1) ψ t1 3e ˆµη0T0/2 ψ0(x0) ψ 0 + 5e ˆµη0T0/2ˆµ 2T 2 0 + b G0 11 b G0.

Cutler, Drusvyatskiy, Harchaoui

Now assume that the claim holds for some index k 1. We then conclude

E ψtk+1(Xk+1) ψ tk+1 3e ˆµηk Tk/2E ψtk(Xk) ψ tk + 5e ˆµηk Tk/2ˆµ 2T 2 k + b Gk

4E ψtk(Xk) ψ tk + 13 2

ˆµη2 k + b Gk

b Gk 2 b Gk 1 E ψtk(Xk) ψ tk + 13 2

ˆµη2 k + b Gk 11 b Gk,

thereby completing the induction. Hence E ψT (XK) ψ T 11 b GK 1. Next, observe

2/3 = σ2(ηK 1 ˆη ) = σ2 η0 ˆη

E ψT (XK) ψ T 11 1

Finally, note

k=1 2k 2L 2K = 8L 2K 2 8 σ2ˆµ

1/3 = 8σ2 ˆµ 1 σ2

This completes the proof.

B.3 Proof of Proposition 36

Fix t 1. Given i 1 and α > 0, the µ-strong convexity of ψt and Lemma 32 imply

µη xi xt 2 2η ψt(xi) ψ t (1 µη) xi 1 xt 2 1 (γ + α)η xi xt 2

+ 2η zi 1, xi 1 xt + 2η2 zi 1 2 + η

α G2 i 1,t,

hence 1 + ( µ α)η xi xt 2 (1 µη) xi 1 xt 2 + 2η zi 1, xi 1 xt

+ 2η2 zi 1 2 + η

α G2 i 1,t.

Taking α = µ, we obtain

xi xt 2 (1 µη) xi 1 xt 2 + 2η zi 1, xi 1 xt + 2η2 zi 1 2 + η

µ G2 i 1,t.

Thus, given any λ (0, µη] and proceeding by induction, we conclude that the following estimate holds for all i 1:

xi xt 2 (1 λ)i x0 xt 2 + 2η

j=0 zj, xj xt (1 λ)i 1 j

+ 2η2 i 1 X

j=0 zj 2(1 λ)i 1 j + η

j=0 G2 j,t(1 λ)i 1 j.

Stochastic Optimization under Distributional Drift

i=0 xi xt 2(1 λ)2(t 1 i)

x0 xt 2 t 1 X

i=0 (1 λ)2(t 1) i + 2η

j=0 zj, xj xt (1 λ)2t 3 j i

+ 2η2 t 1 X

j=0 zj 2(1 λ)2t 3 j i + η

j=0 G2 j,t(1 λ)2t 3 j i.

Next, we compute

i=0 (1 λ)2(t 1) i = (1 λ)t 1 t 1 X

i=0 (1 λ)t 1 i < 1

and observe that for any scalar sequence {aj}t 2 j=0, we have

j=0 aj(1 λ)2t 3 j i =

i=j+1 (1 λ)t 2 i

aj(1 λ)t 1 j.

Further, if aj 0 for all j = 0, . . . , t 2, then we have

j=0 aj(1 λ)2t 3 j i =

i=j+1 (1 λ)t 1 i

aj(1 λ)t 2 j

j=0 aj(1 λ)t 2 j.

Hence the following estimation holds:

i=0 xi xt 2(1 λ)2(t 1 i)

i=j+1 (1 λ)t 2 i

zj, xj xt (1 λ)t 1 j

λ(1 λ)t 1 x0 xt 2 + 2η2

j=0 zj 2(1 λ)t 2 j

j=0 G2 j,t(1 λ)t 2 j.

This completes the proof.

B.4 Proof of Theorem 41

For each index k, let tk := T0 + + Tk 1 (with t0 := 0) and b Gk := ηkσ2 + 2/ˆµˆη2 . Then taking into account ηk ˆη and our selection of the absolute constant c > 0 via (13), it

Cutler, Drusvyatskiy, Harchaoui

follows that for any speciﬁed index k, the estimate

ψtk+1(Xk+1) ψ tk+1 c 1 ˆµηk

2 Tk ψtk(Xk) ψ tk + ηkσ2 + 2

c e ˆµηk Tk/2 ψtk(Xk) ψ tk + b Gk log e

holds with probability at least 1 δ. We will verify by induction that for each index k 1, the estimate

ψtk(Xk) ψ tk 3c b Gk 1 log e

holds with probability at least 1 kδ. To see the base case, observe that the estimate

ψt1(X1) ψ t1 c e ˆµη0T0/2 ψ0(x0) ψ 0 + b G0 log e

3c b G0 log e

holds with probability at least 1 δ. Now assume that the claim holds for some index k 1. Then because we also have

ψtk+1(Xk+1) ψ tk+1 c e ˆµηk Tk/2 ψtk(Xk) ψ tk + b Gk log e

c 1 4c log(e/δ) ψtk(Xk) ψ tk + b Gk

b Gk 2c b Gk 1 log(e/δ)

ψtk(Xk) ψ tk + b Gk

with probability at least 1 δ, a union bound reveals that the estimate

ψtk+1(Xk+1) ψ tk+1 3c b Gk log e

holds with probability at least 1 (k + 1)δ, thereby completing the induction. In particular, ψT (XK) ψ T 3c b GK 1 log(e/δ) with probability at least 1 Kδ. Next, observe

2/3 = σ2(ηK 1 ˆη ) = σ2 η0 ˆη

ψT (XK) ψ T 3c 1 2 + 3q

with probability at least 1 Kδ. Finally, note

+ + 1 log log e

k=1 2k 2L 2K = 8L 2K 2 8 σ2ˆµ

1/3 = 8σ2 ˆµ 1 σ2

This completes the proof.

Stochastic Optimization under Distributional Drift

Alekh Agarwal, Olivier Chapelle, Miroslav Dud ık, and John Langford. A reliable eﬀective terascale linear learning system. Journal of Machine Learning Research, 15(1):1111 1133, 2014.

Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, volume 24, pages 451 459. Curran Associates, Inc., 2011.

Peter L. Bartlett, Varsha Dani, Thomas P. Hayes, Sham M. Kakade, Alexander Rakhlin, and Ambuj Tewari. High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory. Omnipress, 2008.

Yahav Bechavod, Katrina Ligett, Zhiwei Steven Wu, and Juba Ziani. Gaming helps! Learning from strategic interactions in natural dynamics. In The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1234 1242. PMLR, 2021.

Albert Benveniste, Michel M etivier, and Pierre Priouret. Adaptive Algorithms and Stochastic Approximations. Springer, 1990.

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227 1244, 2015.

L eon Bottou. Stochastic learning. In Advanced Lectures on Machine Learning: ML Summer Schools 2003, volume 3176 of Lecture Notes in Artiﬁcial Intelligence, pages 146 168. Springer, 2003.

L eon Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, volume 7700 of Lecture Notes in Computer Science, pages 421 436. Springer, 2012.

L eon Bottou and Olivier Bousquet. The tradeoﬀs of large scale learning. In Advances in Neural Information Processing Systems, volume 20, pages 161 168. Curran Associates, Inc., 2007.

Michael Br uckner, Christian Kanzow, and Tobias Scheﬀer. Static prediction games for adversarial learning problems. Journal of Machine Learning Research, 13(1):2617 2654, 2012.

Joshua Cutler, Dmitriy Drusvyatskiy, and Zaid Harchaoui. Stochastic optimization under time drift: iterate averaging, step decay, and high-probability guarantees. In Advances in Neural Information Processing Systems, volume 34. Curran Associates, Inc., 2021.

Nilesh N. Dalvi, Pedro M. Domingos, Mausam, Sumit K. Sanghai, and Deepak Verma. Adversarial classiﬁcation. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99 108. ACM, 2004.

Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In Proceedings of the 32nd International Conference on Machine Learning, pages 1405 1411. JMLR, 2015.

Cutler, Drusvyatskiy, Harchaoui

Dmitriy Drusvyatskiy and Lin Xiao. Stochastic optimization with decision-dependent distributions. Mathematics of Operations Research, 2022.

V aclav Dupaˇc. A dynamic stochastic approximation method. The Annals of Mathematical Statistics, 36(6):1695 1702, 1965.

S. Fujita and T. Fukao. Convergence conditions of dynamic stochastic approximation method for nonlinear stochastic discrete-time dynamic systems. IEEE Transactions on Automatic Control, 17(5):715 717, 1972.

A. A. Gaivoronskii. Nonstationary stochastic programming problems. Cybernetics, 14(4): 575 579, 1978.

Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469 1492, 2012.

Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: Shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061 2089, 2013.

Lei Guo and Lennart Ljung. Exponential stability of general tracking algorithms. IEEE Transactions on Automatic Control, 40(8):1376 1387, 1995.

Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classiﬁcation. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, pages 111 122. ACM, 2016.

Nicholas J. A. Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Proceedings of the 32nd Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 1579 1613. PMLR, 2019.

Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1): 2489 2512, 2014.

Elad Hazan and C. Seshadhri. Eﬃcient learning algorithms for changing environments. In Proceedings of the 26th International Conference on Machine Learning, pages 393 400. ACM, 2009.

Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics, volume 38, pages 398 406. PMLR, 2015.

Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. ar Xiv:1902.03736, 2019.

Stochastic Optimization under Distributional Drift

L. V. Kantorovich and G. Sh. Rubinshte ın. On a space of completely additive functions. Vestnik Leningradskogo Universiteta. Matematika, Mekhanika, Astronomiya, 13(2):52 59, 1958.

Andrei Kulunchakov and Julien Mairal. Estimate sequences for stochastic composite optimization: Variance reduction, acceleration, and robustness to noise. Journal of Machine Learning Research, 21(155):1 52, 2020.

Harold J. Kushner and Gang George Yin. Stochastic Approximation Algorithms and Applications, volume 35 of Applications of Mathematics. Springer, 1997.

Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365 397, 2012.

Hunter Lang, Lin Xiao, and Pengchuan Zhang. Using statistics to automate stochastic optimization. In Advances in Neural Information Processing Systems, volume 32, pages 9536 9546. Curran Associates, Inc., 2019.

Liam Madden, Stephen Becker, and Emiliano Dall Anese. Bounds for the tracking error of ﬁrst-order online optimization methods. Journal of Optimization Theory and Applications, 189(2):437 457, 2021.

Celestine Mendler-D unner, Juan Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. In Advances in Neural Information Processing Systems, volume 33, pages 4929 4939. Curran Associates, Inc., 2020.

Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In 55th IEEE Conference on Decision and Control, pages 7195 7201. IEEE, 2016.

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

Juan Perdomo, Tijana Zrnic, Celestine Mendler-D unner, and Moritz Hardt. Performative prediction. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 7599 7609. PMLR, 2020.

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning. Omnipress, 2012.

David Ruppert. A new dynamic stochastic approximation procedure. The Annals of Statistics, 7(6):1179 1195, 1979.

Ali H. Sayed. Fundamentals of Adaptive Filtering. John Wiley & Sons, 2003.

Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright. Optimization for Machine Learning. The MIT Press, 2011.

Cutler, Drusvyatskiy, Harchaoui

Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. In Advances in Neural Information Processing Systems, volume 24, pages 2645 2653. Curran Associates, Inc., 2011.

Ya. Z. Tsypkin and Z. J. Nikolic. Adaptation and Learning in Automatic Systems. Elsevier Science, 1971.

Ya. Z. Tsypkin and B. T. Polyak. Optimal recurrent algorithms for identiﬁcation of nonstationary plants. Computers and Electrical Engineering, 18(5):365 371, 1992.

Katsuji Uosaki. Some generalizations of dynamic stochastic approximation processes. The Annals of Statistics, 2(5):1042 1048, 1974.

Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.

Craig Wilson, Venugopal V. Veeravalli, and Angelia Nedi c. Adaptive sequential stochastic optimization. IEEE Transactions on Automatic Control, 64(2):496 509, 2019.

Killian Wood, Gianluca Bianchin, and Emiliano Dall Anese. Online projected gradient descent for stochastic optimization with decision-dependent distributions. IEEE Control Systems Letters, 6:1646 1651, 2022.

Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems, volume 31, pages 1330 1340. Curran Associates, Inc., 2018.

Peng Zhao and Lijun Zhang. Improved analysis for dynamic regret of strongly convex and smooth functions. In Proceedings of the 3rd Annual Conference on Learning for Dynamics and Control, volume 144 of Proceedings of Machine Learning Research, pages 48 59. PMLR, 2021.

Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. In Advances in Neural Information Processing Systems, volume 33, pages 12510 12520. Curran Associates, Inc., 2020.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928 936. AAAI Press, 2003.