# stochasticconstrained_stochastic_optimization_with_markovian_data__d4273d6b.pdf Journal of Machine Learning Research 25 (2024) 1-69 Submitted 12/23; Published 12/24 Stochastic-Constrained Stochastic Optimization with Markovian Data Yeongjong Kim kim.yj@postech.ac.kr Department of Mathematics POSTECH Pohang 37673, South Korea Dabeen Lee dabeenl@kaist.ac.kr Department of Industrial and Systems Engineering KAIST Daejeon 34141, South Korea Editor: Prateek Jain This paper considers stochastic-constrained stochastic optimization where the stochastic constraint is to satisfy that the expectation of a random function is below a certain threshold. In particular, we study the setting where data samples are drawn from a Markov chain and thus are not independent and identically distributed. We generalize the drift-plus-penalty framework, a primal-dual stochastic gradient method developed for the i.i.d. case, to the Markov chain sampling setting. We propose three variants of drift-plus-penalty; two are for the case when the mixing time of the underlying Markov chain is known while the other is for the case of unknown mixing time. In fact, our algorithms apply to a more general setting of constrained online convex optimization where the sequence of constraint functions follows a Markov chain. The algorithms are adaptive in that the first two work without knowledge of the time horizon while the third uses Ada Grad-style algorithm parameters, which is of independent interest. We demonstrate the effectiveness of our proposed methods through numerical experiments on classification with fairness constraints. Keywords: stochastic-constrained stochastic optimization, constrained online convex optimization, Markov chain stochastic gradient descent, drift-plus-penalty, classification with fairness constraints 1. Introduction In this paper, we consider the stochastic-constrained stochastic optimization (SCSO) problem min x X Eξ µ[f(x, ξ)] s.t. Eξ µ[g(x, ξ)] 0 (SCSO) where the expectation is taken with respect to the random parameter ξ over a stationary distribution µ, f and g are convex loss and constraint functions, and X is a compact domain. This formulation of SCSO has applications in stochastic programming with a risk constraint (Rockafellar and Uryasev, 2000), chance-constrained programming (Nemirovski and Shapiro, 2007), portfolio optimization (Dentcheva and Ruszczynski, 2003), sparse matrix completion (Akhtar et al., 2021), semi-supervised learning (Chapelle et al., c 2024 Yeongjong Kim and Dabeen Lee. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1630.html. Kim and Lee 2010), classification with fairness constraints (Zafar et al., 2019; Celis et al., 2019; Donini et al., 2018), Neyman-Pearson classification (Scott and Nowak, 2005; Rigollet and Tong, 2011), ranking with fairness constraints (Celis et al., 2018), recommendation systems with fairness constraints (Yao and Huang, 2017), scheduling in distributed data centers (Yu et al., 2017), safe reinforcement learning (Garc ıa et al., 2015), and stochastic simple bilevel optimization (Jalilzadeh et al., 2023; Cao et al., 2023). Stochastic approximation (SA) algorithms are prevalent solution methods for SCSO. Basically, we run gradient-based algorithms with an oracle providing i.i.d. samples of f(x, ξ), g(x, ξ), f(x, ξ), g(x, ξ) for a given solution x. Lan and Zhou (2020) proposed the cooperative stochastic approximation scheme for SCSO, which is a stochastic extension of Polyak s subgradient method developed for constrained optimization (Polyak, 1967). Xiao (2019) developed the penalized stochastic gradient method that takes the square of the constrained function as a penalty term. Lin et al. (2020) developed a level setbased algorithm for SCSO. Akhtar et al. (2021) studied an augmented Lagrangian-based stochastic primal-dual algorithm, which was inspired by the primal-dual framework for online convex optimization with long-term constraints due to Mahdavi et al. (2012). Furthermore, motivated by recent success in adaptive gradient algorithms, Yan and Xu (2022) considered an adaptive primal-dual stochastic gradient method for SCSO. Zhang et al. (2023) proposed a stochastic variant of the proximal method of multipliers. Zhang et al. (2022) studied another stochastic proximal method of multipliers based on linearization. SA algorithms for other related problem settings are as follows. Yu et al. (2017) studied online convex optimization with stochastic constraints where the constraint functions are stochastic i.i.d. while the objective functions are arbitrary, for which they proposed the driftplus-penalty (DPP) algorithm. DPP applies to SCSO given that i.i.d. samples of g(x, ξ), g(x, ξ), and f(x, ξ) are available. Wei et al. (2020) developed an extension of DPP, and Lee et al. (2023) provided a projection-free algorithm for the setting. Moreover, SCSO can be formulated as a stochastic saddle-point problem by taking the Lagrangian if certain constraint qualifications hold. Nemirovski et al. (2009); Juditsky et al. (2011) developed stochastic mirror-prox algorithms for general stochastic saddle point problems. Zhao (2022) devised an accelerated stochastic framework for convex-concave saddle-point problems, and Yazdandoost Hamedani et al. (2023) devised a randomized adaptive primal-dual method. The aforementioned solution methods for SCSO require i.i.d. data samples from the stationary distribution µ when running SA algorithms. However, there are several application scenarios in which sampling from the stationary distribution µ independently and identically is difficult. For example, federated learning serves as an alternative to traditional machine learning systems that require data centralization, with the purpose of improving data privacy (Zhao et al., 2018). The basic framework is that data is stored on individual local devices while the training is governed by a central server. Another related setting is distributed optimization over sensor networks (Rabbat and Nowak, 2004; Lopes and Sayed, 2007; Johansson et al., 2007) and multi-agent systems (Johansson et al., 2008). For these applications, an enormous amount of data is distributed over distinct nodes of a network, for which data exchange and massage passing are between immediate neighboring nodes. One resolution approach for such application scenarios is Markov chain stochastic gradient descent (SGD) (Johansson et al., 2010; Ram et al., 2009a). Basically, Markov chain SGD takes a random walk over a network of local data centers and updates solutions Stochastic-Constrained Stochastic Optimization with Markovian Data based on data acquired from the data centers visited. Here, as the data is generated by a Markov random walk, there is inherent dependence and bias between data samples. More generally, the framework can be viewed as a variant of the Markov chain Monte Carlo method (Andrieu et al., 2003). That said, Markov chain SGD can be applied to other applications domains where data is collected from a Markov process, such as decentralized learning (Yang et al., 2021), robust estimation (Poljak and Tsypkin, 1980; Sarkar and Rakhlin, 2019), and reinforcement learning (Nagaraj et al., 2020; Kowshik et al., 2021). Following the Markov incremental subgradient methods due to Johansson et al. (2010); Ram et al. (2009a) designed for distributed optimization problems, Markov chain SGD with data sampled from a general Markov process have been studied. Duchi et al. (2012) developed Markov chain SGD with data sampled from an ergodic process. Later, Sun et al. (2018) studied Markov chain SGD for convex and nonconvex problems when the underlying Markov chain is nonreversible. Doan et al. (2020) proposed an accelerated version of Markov chain SGD for both convex and nonconvex settings. Chen et al. (2018); Nagaraj et al. (2020); Kumar and Sarkar (2023) applied the downsampling technique to reduce dependence among dependent samples. One limitation of these works, however, is that they require knowledge of the mixing time. When the mixing time of the underlying Markov chain is not a priori known, one may estimate the mixing time directly from the data, with which we can run Markov chain SGD. However, the sample complexity of learning the mixing time parameter can be large and depend on the structure of the given Markov chain (Wolfer and Kontorovich, 2019). Instead, Dorfman and Levy (2022) developed Markov chain SGD based on the multi-level Monte Carlo estimation scheme due to Giles (2015); Blanchet and Glynn (2015). Roy et al. (2022) studied nonconvex problems where the transition of the underlying Markov chain is state-dependent, motivated by strategic classification and reinforcement learning. Recent works (Doan, 2023; Even, 2023) established some performance guarantees of Markov chain SGD under minimal assumptions. Applications of SCSO naturally motivate and necessitate algorithmic frameworks that can handle data sets with inherent dependence and bias. Portfolio optimization in finance takes time series data. Machine learning with fairness constraints processes heterogeneous data sets from disjoint sources. Scheduling in distributed data centers will benefit from distributed optimization technologies. Safe and constrained reinforcement learning can be formulated as SCSO for which the training data is obtained from trajectories of the underlying Markov decision process. Despite this immediate need, no prior work exists for solving SCSO with non-i.i.d. data samples. Motivated by this, the objective of this paper is to develop stochastic approximation algorithms that run with data sampled from a Markov chain. 1.1 Our Contributions This paper initiates the study of stochastic approximation algorithms for stochastic-constrained stochastic optimization (SCSO) using non-i.i.d. data samples. Inspired by recent advances in Markov chain SGD, we develop primal-dual stochastic gradient methods using data sampled from a Markov chain, which can be viewed as primal-dual variants of Markov chain SGD. Specifically, we extend the drift-plus-penalty algorithm by Yu et al. (2017) that was originally Kim and Lee constraint optimality feasibility to τmix violation gap gap Algorithm 1 O τ 1 β mix T 1 β β 2 mix T β+1 O τ1 β mix T β + τβ/2 mix T (1 β)/2 O τβ/2 mix T (1 β)/2 Algorithm 1 O τmix T O τmix T O τmix under Slater s Algorithm 2 O τ β mix T 1 β 2 mix T β+1 O τβ mix T β + τ(1 β)/2 mix T (1 β)/2 O τ(1 β)/2 mix T (1 β)/2 Algorithm 2 O τmix T O τmix T O τmix under Slater s Algorithm 3 adaptive adaptive O τ1 β mix T β O τ(2β+1)/4 mix T (1 β)/2 Table 1: Bounds on regret, constraint violation, optimality gap, and feasibility gap under Algorithms 1, 2, and 3 (β (0, 1/2] is a predetermined algorithm parameter that controls the balance between regret and constraint violation) developed for the i.i.d. setting. We adopt the approach of ergodic mirror descent by Duchi et al. (2012) for the case of known mixing time and the framework of Dorfman and Levy (2022) for the setting where the mixing time is unknown. Our key technical contribution is to develop three variants of the drift-plus-penalty algorithm that can take a sequence of dependent constraint functions. These algorithms solve constrained online convex optimization where the objective functions can be arbitrary and the constraint functions are generated from a Markov chain. Two of them are for the case of known mixing time, while the other is for the case when the mixing time is unknown. We provide regret and constraint violation bounds for the algorithms, which delineate how their performance depends on the mixing time. Based on the regret and constraint violation guarantees, we analyze the optimality gap and feasibility gap for SCSO. The connection between the constrained online convex optimization formulation and SCSO for our Markovian setting is not as immediate as the i.i.d. setting because the expectation in (SCSO) is taken with respect to the stationary distribution of the underlying Markov chain. What follows is a more detailed description of our contributions. Our results are also summarized in Table 1. In Section 3, we consider the case when the mixing time of the underlying Markov chain is known, for which we develop Algorithm 1, a variant of the drift-plus-penalty algorithm. We first prove that for online convex optimization with Markovian constraints, the regret of Algorithm 1 is O(τ 1 β mix T 1 β) and the constraint violation is O(τ β/2 mix T (β+1)/2) where τmix is the mixing time, T is the length of horizon, β can be chosen to be any number in (0, 1/2], and O hides a poly log T factor. If we further assume that (SCSO) satisfies Slater s constraint qualification, then the regret and constraint violation of Algorithm 1 can be both bounded by O( τmix T). We remark that Algorithm 1 is Stochastic-Constrained Stochastic Optimization with Markovian Data adaptive in that its parameters are chosen without knowledge of T, unlike the vanilla drift-plus-penalty algorithm. These results generalize the work of Yu et al. (2017). Based on the regret and constraint violation analysis for Algorithm 1, we show that the averaging of the sequence of solutions generated by Algorithm 1 guarantees that the optimality gap is bounded by O(τ 1 β mix T β + τ β/2 mix T (1 β)/2) while the feasibility gap is bounded by O(τ β/2 mix T (1 β)/2). If we further assume that (SCSO) satisfies Slater s constraint qualification, then the optimality gap and feasibility gap of Algorithm 1 can be both bounded by O( p τmix/T). The optimality gap bound nearly matches the lower bound of Ω( p τmix/T) due to Duchi et al. (2012). In addition to Algorithm 1, we also develop Algorithm 2 for the setting where the mixing time is known. We refer to the algorithm as DPP-DD which stands for driftplus-penalty with data drop. Algorithm 2 applies a single DPP update every τmix samples and discards the rest of the data, following the idea of SGD-DD due to Nagaraj et al. (2020). As we take one sample in every batch of τmix samples, the data sequence gives rise to a Markov chain with mixing time 1. That said, Algorithm 2 is essentially applying Algorithm 1 with τmix = 1 for T/τmix time steps. Then we analyze the performance of Algorithm 2 based on the analysis of Algorithm 1. In Section 4, we propose Algorithm 3, another variant of the drift-plus-penalty algorithm, for the setting where the mixing time is unknown. The parameters of Algorithm 3 are set in an adaptive fashion, as in the Ada Grad method (Duchi et al., 2011; Levy, 2017). Then we apply Algorithm 3 to constrained online convex optimization where the objective and constraint functions are given by the Multi-level Monte Carlo estimation scheme as in Dorfman and Levy (2022). We provide adaptive regret and constraint violation bounds for Algorithm 3. We note that Algorithm 3 is the first Ada Grad-style adaptive variant of the drift-plus-penalty algorithm. In fact, Algorithm 3 applies to online convex optimization with adversarial constraints (Neely and Yu, 2017) and provides adaptive performance guarantees. We include this result in Appendix C. Combining the estimation accuracy bounds on the Multi-level Montel Carlo method by Dorfman and Levy (2022) and our adaptive regret and constraint violation bounds, we prove that the optimality gap under Algorithm 3 is bounded by O(τ 1 β mix T β) while the feasibility gap is bounded by O(τ (2β+1)/4 mix T (1 β)/2). In Section 5, we provide numerical results from experiments on a classification problem with fairness constraints. Specifically, we take the logistic regression formulation proposed in Zafar et al. (2019). The numerical results on random problem instances demonstrate the efficacy of our proposed algorithmic frameworks for solving SCSO with Markovian data. The main component of our analysis is to provide bounds on the terms t=1 Qtgt(x) , E [Qt] , E Qt Kim and Lee under our Markovian data regime. All these terms involve the virtual queue size Qt, and therefore, controlling the virtual queue size is crucial to guarantee a fast convergence rate. We include our proofs of the main theorems in Sections 6 and 7. We include some of the known lemmas and tools for analyzing the drift-plus-penalty method due to Yu et al. (2017) in Appendices A and D. In fact, Appendix A state these results for any adaptive version of the drift-plus-penalty algorithm, which uses time-varying parameters Vt and αt. 1.2 Related Work Although we have listed and explained important lines of previous work that motivate this paper, we supplement the list by mentioning a few more relevant results in stochastic approximation algorithms and Markov chain stochastic gradient methods. 1.2.1 Stochastic Approximation for Stochastic Optimization Starting from the seminar paper by Robbins and Monro (1951), stochastic approximation algorithms, also known as stochastic gradient methods, have been a central topic of research in the domain of machine learning, operations research, and optimization (Ermoliev, 1983; Pflug, 1996; Ruszczy nski and Syski, 1986; Bottou et al., 2018). There already exist numerous works on stochastic approximation algorithms for stochastic optimization. In particular, for SCSO or stochastic optimization with expectation constraints, various stochastic approximation methods were proposed and studied by Lan and Zhou (2020); Xiao (2019); Lin et al. (2020); Akhtar et al. (2021); Yan and Xu (2022); Zhang et al. (2023, 2022), as discussed earlier. We note that online convex optimization with stochastic constraints is a superclass of SCSO under the i.i.d. data regime. Mahdavi et al. (2012); Jenatton et al. (2016) developed augmented Lagrangian-based primal-dual algorithms for online convex optimization with deterministic long-term constraints. It turns out that these algorithms and their analysis can be adapted to the setting of stochastic constraints (Akhtar et al., 2021). Later, Yu et al. (2017); Wei et al. (2020) proposed the drift-plus-penalty algorithm that achieves better regret and constraint violation guarantees under Slater s condition. Yuan and Lamperski (2018); Yi et al. (2021); Guo et al. (2022) also provided algorithms for online convex optimization with stochastic constraints. 1.2.2 Markov Chain Stochastic Gradient Descent The asymptotic convergence of Markov chain SGD was studied in Bertsekas and Tsitsiklis (1996); Borkar (2008); Benveniste et al. (1990) based on ordinary differential equation methods. Lopes and Sayed (2007); Johansson et al. (2007, 2010); Ram et al. (2009a) developed incremental subgradient methods for the distributed optimization setting where there is a network of data centers and an algorithm performs a random walk over the network to obtain data samples. These methods are also referred to as token algorithms. More recently, Mao et al. (2020) devised what they called the walkman algorithm, which is a token algorithm based on an augmented Lagrangian method. Sun et al. (2022); Ayache et al. (2023) considered adaptive variants of token algorithms. Hendrikx (2023) developed a more general framework for token algorithms that allows multiple tokens for improving communication efficiency and may adopt existing optimization tools such as variance reduction and acceleration. Stochastic-Constrained Stochastic Optimization with Markovian Data In addition to random walk-based token algorithms, there exist general Markov chain sampling frameworks for stochastic gradient methods. As mentioned earlier, Duchi et al. (2012); Sun et al. (2018); Chen et al. (2018); Nagaraj et al. (2020); Doan et al. (2020); Dorfman and Levy (2022); Roy et al. (2022); Kumar and Sarkar (2023) developed Markov chain SGD methods for convex and nonconvex stochastic optimization problems. Moreover, Sun et al. (2020) developed a Markov chain sampling-based block coordinate descent method. Sun et al. (2023) proposed a decentralized variant of Markov chain SGD. Wang et al. (2022) considered the stability of Markov chain SGD and deduced its generalization bounds. Doan (2023) derived convergence guarantees on Markov chain SGD without a smoothness assusmption, and Even (2023) studied convergence of Markov chain SGD without the bounded gradient assumption. 2. Preliminaries In this section, we provide problem formulations for stochastic-constrained stochastic and online convex optimization under the Markovian data sampling regime. In addition, Section 2.1 gives the formal definition of the mixing time of a Markov chain. Section 2.4 describes a list of assumptions considered throughout the paper. 2.1 Markov Chain and Mixing Time Given two probability distributions P, Q over the probability space (S, F), the total variation distance between them is defined as P Q TV := sup A F |P(A) Q(A)| . Let {ξt} t=1 be a time-homogeneous ergodic Markov chain with a finite state space S. For a distribution ν over (S, F), we denote by Pt(ν, ) the conditional probability distribution of ξt+1 given ξ1 ν. Since {ξt} t=1 is ergodic, it has a unique stationary distribution µ, i.e., Pt(µ, ) = µ( ). In fact, the long-term distribution of the ergodic Markov chain converges to µ regardless of the initial distribution, which can be demonstrated as follows. It is known that Pt(ν, ) for any t and ν satisfies Pt(ν, ) µ TV Cαt for some α (0, 1) and C > 0 (Levin and Peres, 2017). Then we define quantities dmix and τmix(ϵ) as follows. dmix := sup ν P t(ν, ) µ TV , τmix(ϵ) := inf{t N : dmix(t) ϵ}. Moreover, following the convention, we define τmix as τmix := τmix(1/4), and we refer to τmix as the mixing time of the underlying Markov chain. It is known that dmix(lτmix) 2 l for every l N (Levin and Peres, 2017, Chapter 4), which implies that τmix(ϵ) log2 ϵ 1 τmix. Kim and Lee In particular, throughout the paper, we will use quantity τ defined as τ := τmix(1/T) = O(τmix log T) where T is the length of the time horizon. We remark that the finite state space assumption is necessary for Algorithm 3 as Lemmas 8 and 38 rely on Lemma 37 which requires the assumption. On the other hand, for Algorithm 1, the stochastic process needs not to be a Markov chain or time-homogeneous while it requires the exponentially fast convergence condition. 2.2 Stochastic-Constrained Online Convex Optimization with Markovian Data Let X Rd be a compact convex domain. Let {ft : X R} t=1 be a sequence of arbitrary convex functions. Another sequence of convex functions {gt : X R} t=1 is assumed to follow an ergodic Markov chain. More precisely, there exists a time-homogeneous ergodic Markov chain {ξt} t=1 such that gt(x) = g(x, ξt). Let g(x) = Eξ µ[g(x, ξ)] for x X where the expectation is taken with respect to the stationary distribution µ of the Markov chain. Then the problem is to solve and compute x argmin x X t=1 ft(x) s.t. g(x) 0. However, the information about the functions {ft}T t=1 and {gt}T t=1 is revealed online. Basically, at each step t, we choose our decision xt X before observing ft, gt, after which we receive feedback about them. The stochastic setting studied in Yu et al. (2017) is that constraint functions g1, . . . , g T are i.i.d., which means that ξ1, . . . , ξT are i.i.d., while we consider the case where constraint functions g1, . . . , g T follow an ergodic Markov chain and thus are dependent. That said, we may refer to the problem setting as online convex optimization with ergodic constraints. To measure the performance of a learning algorithm for the online optimization problem, we adopt the regret and cumulative constraint violation definition due to Yu et al. (2017). The regret and cumulative constraint violation of an algorithm that generates solutions x1, . . . , x T over T time steps are given by Regret(T) = t=1 ft(x ), Violation(T) = t=1 gt(xt). We want these quantities to have a sublinear growth in T. Note that the benchmark solution x satisfies the expectation constraint Eξ µ[g(x , ξ)] 0. 2.3 Stochastic-Constrained Stochastic Optimization with Markovian Data Next, we consider the setting where the objective functions {ft} t=1 as well as the constraint functions {gt} t=1 are given by an ergodic Markov chain. Without loss of generality, we Stochastic-Constrained Stochastic Optimization with Markovian Data may assume that ft(x) = f(x, ξt) and gt(x) = g(x, ξt) for some time-homogeneous ergodic Markov chain {ξt} t=1 with a stationary distribution µ. As before, let f(x) = Eξ µ[f(x, ξ)] and g(x) = Eξ µ[g(x, ξ)] for x X. Then the problem is to solve and compute x# argmin x X f(x) s.t. g(x) 0. Here, we do not have direct access to ( f, g), but we receive samples (ft, gt) which converge to ( f, g) in expectation. For the performance measure of a learning algorithm for this problem, we consider the following standard notions of optimality gap and constraint violation. For a solution x X Gap(x) = f(x) f(x#), Infeasibility(x) = g(x). We want these quantities to approach 0 as T grows. Hereinafter, we refer to Gap(x) and Infeasibility(x) as the optimality gap and the feasibility gap, respectively. In the special case where {ξt} t=1 are i.i.d. samples drawn from µ, solving stochasticconstrained online convex optimization would provide a solution to stochastic-constrained stochastic optimization. One can argue that if ξ1, . . . , ξT are i.i.d., then Gap( x T ) 1 T E [Regret(T)] , Infeasibility( x T ) 1 T E [Violation(T)] where x T denotes the simple average of x1, . . . , x T : However, in contrast to the i.i.d. case, the above inequalities do not hold for the case of an arbitrary ergodic Markov chain. This is because the distribution of ξt+1 conditioned on ξt is not equal to the stationary distribution µ. On the other hand, we will use the fact that the long-term distribution of an ergodic Markov chain converges to its stationary distribution. Based on this observation, we first bound the expected regret and the expected constraint violation of the online problem, and then we use this to bound the optimality gap and the feasibility gap of the stochastic optimization problem. 2.4 Notations and Assumptions We work over a norm and its dual norm . We choose a convex mirror map Φ : C R, where C Rd is a convex domain containing X. We use the corresponding Bregman divergence defined as D(x, y) = Φ(x) Φ(y) Φ(y) (x y). Assumption 1 There is a constant R > 0 such that D(x, y) R2 for any x, y X, and Φ is 2-strongly convex with respect to norm i.e. x y 2 D(x, y) for any x, y X. Moreover, f( , ξ) and g( , ξ) are differentiable for each ξ S. Kim and Lee We use notations Ft, Gt, Ht for t [T] given by Ft := ft(xt) , Gt := gt(xt) , Ht := |gt(xt)| which are parameters used in our adaptive algorithm. Due to the stochasticity of ft, gt, these are random variables. We assume these quantities are bounded. Assumption 2 There exist constants F, G, H > 0 such that xf(x, ξ) F, xg(x, ξ) G, |g(x, ξ)| H for any x X and ξ S. Assumptions 1 and 2 are common in online convex optimization, stochastic optimization, and the Markov chain SGD literature. The following assumption is referred to as Slater s condition or Slater s constraint qualification for constrained optimization. Assumption 3 (Slater s condition) There exists ˆx X such that g(ˆx) ϵ for some ϵ > 0. For online convex optimization with stochastic i.i.d. constraint functions, Slater s condition leads to improvement in regret and constraint violation (Yu et al., 2017). In Section 3, we will show that even for online convex optimization with ergodic constraint functions, assuming that Slater s condition holds results in an improvement in the regret and constraint violation of Algorithm 1. 3. Known Mixing Time We first focus on the setting where we have access to the mixing time τmix of the underlying Markov chain {ξt} t=1. Duchi et al. (2012) studied this case for stochastic convex minimization without a stochastic functional constraint, for which they modified the step size of the stochastic gradient descent method based on the mixing time parameter τmix. Inspired by this approach, we take and modify the drift-plus-penalty (DPP) algorithm due to Neely and Yu (2017); Yu et al. (2017) developed for stochastic-constrained online convex optimization. Based on DPP, we develop our algorithm by setting the algorithm parameters properly to adapt to the mixing time τmix. 3.1 Ergodic Drift-Plus-Penalty The DPP algorithm has two parameters, V and α, where V is the penalty parameter and α determines the step size. Yu et al. (2017) set V = T and α = T. In contrast to the vanilla DPP algorithm, our algorithm uses parameters Vt = (τmixt)β, αt = τmixt for iterations t = 1, . . . , T, where T is the length of the horizon and β is another algorithm parameter that controls the balance between the regret and the constraint violation. Our algorithm, which we call ergodic drift-plus-penalty (EDPP), is described in Algorithm 1. Stochastic-Constrained Stochastic Optimization with Markovian Data Algorithm 1 Ergodic Drift-Plus-Penalty (EDPP) Initialize: Initial iterates x1 X, Q1 = 0, and 0 < β 1/2. for t = 1 to T do Observe ft and gt. Set penalty parameter Vt and step size parameter αt as Vt = (τmixt)β, αt = τmixt. Primal update: Set xt+1 as xt+1 = argmin x X n (Vt ft(xt) + Qt gt(xt)) x + αt D(x, xt) o Dual update: Set Qt+1 as Qt+1 = h Qt + gt(xt) + gt(xt) (xt+1 xt) i Note that the mixing time τmix is now part of parameters Vt and αt, as in the ergodic mirror descent algorithm by Duchi et al. (2012). Second, Vt and αt are time-varying, so our algorithm is adaptive (Jenatton et al., 2016) and oblivious to the length of the time horizon T. One may wonder why we do not use T instead of t, i.e., V = (τmix T)β and α = τmix T. In fact, our numerical results, which will be presented in Section 5, demonstrate that Algorithm 1 with the adaptive parameters Vt and αt outperforms the algorithm with fixed parameters V = (τmix T)β and α = τmix T. However, the performance analysis of the vanilla DPP algorithm by Yu et al. (2017) does not immediately extend to such adaptive parameters. Hence, we prove that the DPP framework with parameters of varying t still achieves the desired regret and constraint violation guarantees. Let us also briefly explain how the DPP framework initially developed by Yu et al. (2017) as well as our Algorithm 1 works. We may regard Qt as the size of a virtual queue at time t. Then we consider the associated quadratic Lyapunov term Lt = Q2 t /2 and study the corresponding drift given by t = Lt+1 Lt = (Q2 t+1 Q2 t )/2. It is not difficult to see that t Qt gt(xt) + gt(xt) (xt+1 xt) + 1 2(Ht + Gt R)2 holds (Lemma 22). Here, the upper bound on the drift t has term Qt gt(xt) (xt+1 xt) that depends on the next iterate xt+1. Hence, by choosing xt+1 that minimizes Qt gt(xt) (xt+1 xt), we may attempt to control the drift. In fact, the primal update sets xt+1 to be the minimizer of Qt gt(xt) (x xt) | {z } drift + Vt ft(xt) (x xt) + αt D(x, xt) | {z } penalty over X. Consequently, at each iteration, we get to choose a solution that minimizes the drift term t and a penalty term for controlling the objective simultaneously. Kim and Lee 3.2 Performance Guarantees of Ergodic Drift-Plus-Penalty First, we analyze the regret and constraint violation of EDPP for the constrained online convex optimization setting under the Markovian sampling regime, formulated in Section 2.2. Recall that τmix = τmix(1/4) and that parameter τ is defined as τ = τmix(T 1), which satisfies τ log2 T τmix. Theorem 1 Suppose that Assumptions 1 and 2 hold. Then for online convex optimization with ergodic constraints, Algorithm 1 achieves E [Regret(T)] = O τ β mixτT 1 β , E [Violation(T)] = O τ β/2 mix T (β+1)/2 + p where the expectation is taken with respect to the randomness in running the algorithm. One of the key components of the analysis for proving Theorem 1 is that we bound the expected size of the virtual queue Qt at time t as follows. E[Qt] = O τ β/2 mix T (β+1)/2 + p Another key part is to analyze the term t=1 E [Qtgt(x)] where function gt is not independent of the virtual queue size Qt under our Markovian sampling regime. In contrast, if g1, . . . , g T were i.i.d., then E [Qtgt(x)] = E [Qt g(x)] would hold. Instead, we relate the term with t=1 E [Qtgt+τ 1(x)] and use the intuition that the distribution of the ergodic Markov chain after τ steps is close to its stationary distribution. We provide a bound on the term in Lemma 14. Similarly, we also need to analyze the term T X and an upper bound on the sum is given in Lemma 15. Based on these observations, we provide an upper bound on the expected virtual queue size E [Qt], which is given in Lemma 16. Then we apply the performance analysis results of the general adaptive drift-plus-penalty method provided in Appendix A. The complete proof of Theorem 1 is given in Section 6.1. Next we analyze the performance of EDPP on the stochastic-constrained stochastic optimization problem under Markovian data sampling, whose formulation is given by (SCSO). Stochastic-Constrained Stochastic Optimization with Markovian Data Theorem 2 Suppose that Assumptions 1 and 2 hold. Then for stochastic-constrained stochastic optimization (SCSO), Algorithm 1 guarantees that E [Gap( x T )] = O τ β mix T β + τ 1 τ 1 β/2 mix T (1 β)/2 + (τ 1)3/2 τmix T 1/2 + τ E [Infeasibility( x T )] = O τ 1 β/2 mix T (1 β)/2 + (τ 1)3/2 τmix T 1/2 + τ where x T = PT t=1 xt/T and the expectation is taken with respect to the randomness in running the algorithm. Note that Theorem 1 implies 1 T E [Regret(T)] = O τ β mix T β but the bound on the optimality gap given in Theorem 2 has additional terms due to the difference between the stationary distribution of the ergodic Markov chain and the distribution of ft+1 conditioned on ft. In fact, under Markovian sampling, we have t=1 E [ft(xt)] = t=1 E f(xt) . Here, to get around this issue, we also use the intuition that E f(xt) is close to E [ft+τ 1(xt)] as the distribution of the Markov chain after τ steps is close to its stationary distribution. The proof of Theorem 2 is given in Section 6.2. Moreover, since τ = O(τmix), it follows that E [Regret(T)] = O τ 1 β mix T 1 β , E [Violation(T)] = O τ β/2 mix T (β+1)/2 + p E [Gap( x T )] = O τ 1 β mix T β + τ β/2 mix T (1 β)/2 + τmix E [Infeasibility( x T )] = O τ β/2 mix T (1 β)/2 + τmix where the O hides a log T factor. In particular, if we set β = 1/3, then we have E [Regret(T)] = O(τ 2/3 mix T 2/3), E [Violation(T)] = O(τ 1/6 mix T 2/3 + τ 1/2 mix T 1/2), E [Gap( x T )] = O(τ 2/3 mix T 1/3 + τmix T 1), and E [Infeasibility( x T )] = O(τ 1/6 mix T 1/3 + τ 1/2 mix T 1/2 + τmix T 1). Furthermore, observe that if {ξt} t=1 is a sequence of i.i.d. random variables, then we have τmix = τ = 1. In this case, by Theorems 1 and 2, Algorithm 1 guarantees that E [Regret(T)] = O(T 1 β), E [Violation(T)] = O(T (β+1)/2), E [Gap( x T )] = O(T β), and E [Infeasibility( x T )] = O(T (1 β)/2) for any β (0, 1/2], which recovers the result of Jenatton et al. (2016). If we further assume that Slater s constraint qualification holds, we can argue that we get a better control on the size of E[Qt]. Note that the upper bound on the expected queue size Kim and Lee given by Lemma 16 is E[Qt] = O(τ β/2 mixt(β+1)/2) which holds regardless of whether Slater s condition holds or not. On the other hand, we will argue that under Slater s condition, we have E[Qt] = O τ(τ + t) τmixt = O( τmixt). This is consistent with Yu et al. (2017) as they proved that E[Qt] = O( T) for the i.i.d. setting. In fact, our proof for bounding E[Qt] is more involved than the argument of Yu et al. (2017) because we use adaptive step sizes for Algorithm 1 and consider non-i.i.d. constraint functions. This leads to improvements as stated in the following result. Theorem 3 Suppose that Assumptions 1 3 hold. Then for online convex optimization with ergodic constraints and stochastic-constrained stochastic optimization, Algorithm 1 with β = 1/2 guarantees E [Regret(T)] = O , E [Violation(T)] = O τ(τ + T) τmix T E[Gap( x T )] = O τ 3/2 mix T E[Infeasibility( x T )] = O τ 3/2 mix T + τ 2 τ 1/2 mix T 3/2 where the expectation is taken with respect to the randomness in running the algorithm. Our analysis takes into account the time-varying algorithm parameters Vt and αt as well as the fact that the functions are correlated according to a Markov chain. To consider this, we prove Lemmas 17 and 18 that lead to time-varying bounds on the expected virtual queue size. The complete proof of Theorem 3 is included in Section 6.3. Since τ = O(τmix), E [Regret(T)] = O p τmix T , E [Violation(T)] = O τmix T + τ 3/2 mix E[Gap( x T )] = O τmix , E[Infeasibility( x T )] = O T + τ 3/2 mix T 3/2 Note that the optimality gap bound of O( p τmix/T) nearly matches the lower bound of Ω( p τmix/T) by Duchi et al. (2012) for the case of no functional constraint. While Algorithm 1 achieves a nearly optimal bound on the optimality gap under Slater s condition, 3.3 Drift-Plus-Penalty with Data Drop As mentioned in the introduction, the downsampling technique intentionally takes a subset of data and drops the rest, with the purpose of reducing dependence among data samples from a Markov chain. Algorithm 2 applies the downsampling technique to the drift-plus-penalty algorithm. The algorithm works as follows. We take an episode with τmix consecutive time steps. Hence, if we run the algorithm for T time steps, it would have T/τmix episodes of Stochastic-Constrained Stochastic Optimization with Markovian Data Algorithm 2 Drift-Plus-Penalty with Data Drop (DPP-DD) Initialize: Initial iterates x1 X, Q1 = 0, and 0 < β 1/2. for k = 1 to T/τmix do Set x(k 1)τmix+i = x(k 1)τmix+1 for i = 2, . . . , τmix. Observe fkτmix and gkτmix. Set penalty parameter Vk and step size parameter αk as Vk = kβ, αk = k. Primal update: Set xkτmix+1 as xkτmix+1 = argmin x X n (Vk fkτmix(xkτmix) + Qk gkτmix(xkτmix)) x + αk D(x, xkτmix) o Dual update: Set Qk+1 as Qk+1 = h Qk + gkτmix(xkτmix) + gkτmix(xkτmix) (xkτmix+1 xkτmix) i τmix time steps while the last episode has T T/τmix τmix time steps. Then we choose a fixed solution for all τmix time steps of an episode. To be more specific, we take the solution prepared for the first time step of an episode and use it for the rest of the episode. At the end of each episode, we observe a data sample that gives rise to an objective function and a cost function. Based on the functions, we prepare the solution for the next episode following the drift-plus-penalty update. Note that the penalty and step size parameters are chosen as if the mixing time of the underlying Markov chain is 1. This is justified because we take a data sample in every batch of τmix steps, and the resulting subsequence itself gives rise to a Markov chain with mixing time 1. That said, Algorithm 2 can be interpreted as Algorithm 1 applied to a Markov chain whose mixing time is 1 for T/τmix time steps. Theorem 4 Suppose that Assumptions 1 and 2 hold. Then for online convex optimization with ergodic constraints, Algorithm 2 achieves E [Regret(T)] = O τ β mix T 1 β , E [Violation(T)] = O τ (1 β)/2 mix T (β+1)/2 Proof Running Algorithm 1 for T/τmix times steps with a Markov chain of mixing time 1 incurs the expected regret of O(τ β 1 mix T 1 β) and the constraint violation of O(τ (1+β)/2 mix T (β+1)/2). As Algorithm 2 takes a fixed solution for the τmix time steps of each episode, the regret under Algorithm 2 is what is obtained after multiplying the regret of Algorithm 1 by τmix. This gives us the desired regret upper bound of O(τ β mix T 1 β). Likewise, the constraint violation under Algorithm 2 is O τ (1 β)/2 mix T (β+1)/2 . Moreover, based on Theorem 2, we may deduce the following bounds on the optimality gap and the feasibility gap under Algorithm 2. Kim and Lee Theorem 5 Suppose that Assumptions 1 and 2 hold. Then for stochastic-constrained stochastic optimization (SCSO), Algorithm 2 guarantees that E [Gap( x T )] = O τ β mix T β + τ (1 β)/2 mix T (1 β)/2 + τmix E [Infeasibility( x T )] = O τ (1 β)/2 mix T (1 β)/2 + τmix where x T = PT t=1 xt/T and the expectation is taken with respect to the randomness in running the algorithm. Lastly, we derive the following result under Slater s condition. Theorem 6 Suppose that Assumptions 1 3 hold. Then for online convex optimization with ergodic constraints and stochastic-constrained stochastic optimization, Algorithm 2 with β = 1/2 guarantees E [Regret(T)] = O p τmix T , E [Violation(T)] = O τmix T + τ 3/2 mix E[Gap( x T )] = O τmix , E[Infeasibility( x T )] = O T + τ 3/2 mix T 3/2 where the expectation is taken with respect to the randomness in running the algorithm. 4. Unknown Mixing Time Next, we study the setting where the mixing time τmix is not observable. Even if we do not know the mixing time of the underlying Markov chain, we would still want to provide a learning algorithm that provides performance guarantees of similar orders. To achieve this goal, we develop yet another variant of the drift-plus-penalty algorithm which incorporates the multi-level Monte Carlo (MLMC) gradient estimation scheme (Giles, 2015; Blanchet and Glynn, 2015; Dorfman and Levy, 2022). Dorfman and Levy (2022) first introduced the approach of combining stochastic gradient descent with the MLMC gradient estimation framework for stochastic optimization with no stochastic functional constraint. For the case of known mixing time, we may update the step size based on the mixing time τmix to achieve an optimal dependence on the parameter. When τmix is not known, Dorfman and Levy (2022) used Ada Grad-based adaptive step sizes (Duchi et al., 2011; Levy, 2017; Ward et al., 2019). We take this idea to develop an Ada Grad variant of the drift-plus-penalty algorithm for SCSO, described in Algorithm 3, that incorporates the MLMC gradient estimation framework. The Ada Grad version of DPP itself is of independent interest. 4.1 Multi-Level Monte Carlo Sampling The idea behind the multi-level Monte Carlo estimation scheme is to obtain many consecutive samples from an ergodic Markov chain and take their average. At the same time, we may control the expected number of consecutive samples required for each time step by O(log T). Stochastic-Constrained Stochastic Optimization with Markovian Data More precisely, for each time step t, we Nt sample ξ(1) t , . . . , ξ(Nt) t where Nt itself is a random variable given by ( Nt, if Nt T 2 1, otherwise and Nt = 2Jt with Jt Geom(1/2). Note that in our case, the condition is that Nt T 2 where the bound on Nt is T 2, while it was set to T in Dorfman and Levy (2022). With this sampling strategy, we define Ft as the σ-field {N1, . . . , Nt} n ξ(1) s , . . . , ξ(Ns) s o! Let Et [ ] denote the conditional expectation with respect to Ft, i.e., Et [ ] = E [ | Ft]. Next, for t 1 and N 1, we define f N t (x) := 1 i=1 f(x, ξ(i) t ), g N t (x) := 1 i=1 g(x, ξ(i) t ). Based on this, we define the MLMC estimators of f and g as follows. (ft, gt) = (f1 t , g1 t ) + ( Nt (f Nt t , g Nt t ) (f Nt/2 t , g Nt/2 t ) , if Nt > 1 0, otherwise. Basically, functions ft and gt are obtained after applying the MLMC estimation scheme to the underlying ergodic Markov chain. One thing to note, however, is that ft and gt are not necessarily convex anymore (Dorfman and Levy, 2022). To remedy this issue, what we can argue instead is that Et 1[ft] and Et 1[gt] are convex. Based on (Dorfman and Levy, 2022, Lemma 3.1), we deduce the following lemma. Lemma 7 Let jmax := max j N : 2j T 2 = 2 log2 T . Then for each t, Et 1[ft] = Et 1 h f2jmax t i , Et 1[ ft] = Et 1 h f2jmax t i , Et 1[gt] = Et 1 h g2jmax t i , Et 1[ gt] = Et 1 h g2jmax t i . Moreover, we have E ft(xt) 2 = O(F 2τmix), E gt(xt) 2 = O(G2τmix), E |gt(xt)|2 = O(H2τmix). Lastly, the expected number of samples for time step t satisfies E[Nt] 2 log2 T + 1. The reason for setting the upper bound T 2 on Nt instead of T is to achieve high accuracy of estimation for ft, gt, ft, and gt that leads to the desired performance guarantees of Algorithm 3. To be specific, we use the following estimation bounds based on (Dorfman and Levy, 2022, Lemma A.6). Kim and Lee Lemma 8 There exists C(T) > 0 with C(T) = O log(T) log τmix T 2 log(T) 1/2 f2jmax t (x) f(x) 2 C(T)2 τmix T 2 , Et 1 h f2jmax t (x) f(x) 2 i C(T)2 τmix g2jmax t (x) g(x) 2 C(T)2 τmix T 2 , Et 1 h g2jmax t (x) g(x) 2 i C(T)2 τmix hold for any x X that is measurable with respect to Ft 1 and any t [T]. The complete proof of this lemma is given in Appendix E. 4.2 Adaptive Drift-Plus-Penalty The second component of our algorithm for the unknown mixing time setting is the Ada Grad variant of the drift-plus-penalty algorithm. To develop Ada Grad-style step sizes, let us define the following sequence of parameters. For some positive constant δ > 0, we set a0 = S0 = δ, at := F 2 t 4 + R2G2 t + H2 t + δ, St := δ + for t 1. Here, we may choose any positive number for δ. Recall that Ft = ft(xt) , Gt = gt(xt) , and Ht = |gt(xt)|. Based on these parameters, we set the algorithm parameters Vt and αt as follows. Vt = Sβ t 1 R , αt = St 1 for some 0 < β 1/2. Here, the penalty parameter Vt and the step size parameter αt can be chosen without knowledge of the global upper bounds F, G, H on Ft, Gt, Ht. Now we are ready to describe our algorithm, which we call the MLMC adaptive driftplus-penalty (MDPP) algorithm. Here, MDPP is a combination of the Ada Grad-style adaptive drift-plus-penalty algorithm with the MLMC estimator presented in the previous subsection. More importantly, the algorithm is designed to solve constrained online convex optimization where the MLMC estimators f1, . . . , f T are the objective loss functions and the MLMC estimators g1, . . . , g T are the constraint functions. The following lemma provides an adaptive regret guarantee and an adaptive constraint violation bound for Algorithm 3 for the associated online convex optimization. Lemma 9 Suppose that Assumptions 1 and 2 hold. Then for the constrained online convex optimization problem where the MLMC estimators f1, . . . , f T are the objective loss functions and the MLMC estimators g1, . . . , g T are the constraint functions, Algorithm 3 achieves the Stochastic-Constrained Stochastic Optimization with Markovian Data Algorithm 3 MLMC Adaptive Drift-Plus-Penalty (MDPP) Initialize: Initial iterates x1 X, Q1 = 0 and parameters 0 < β 1/2, δ > 0. for t = 1 to T do Observe ft and gt via MLMC method. Set penalty parameter Vt, step size parameter αt as (1). Primal update: Set xt+1 as xt+1 = argmin x X n (Vt ft(xt) + Qt gt(xt)) x + αt D(x, xt) o Dual update: Set Qt+1 as Qt+1 = h Qt + gt(xt) + gt(xt) (xt+1 xt) i following. For any x X with g(x) 0, we have = O E [ST ]1 β + τ 1/2 mix E [ST ]1/2 β + τ 1/2 mix T 1/4E [ST ]1/4 β/2 + τ 1/2 mix T 1 β , E[ST ]1/2 + T 1/4E[ST ]β/2+1/4 + τ 1/2 mix + E[ST ] + T 1/2E[ST ]β+1/2 τ β/2+1/4 mix T β/2+1/2 +τ 1/2 mix + E[ST ]1/2 + E[ST ]β/2+1/4 τ β/2+1/4 mix T β/2+1/2 + (log ST )2τ β/2 1/4 mix T β/2+1/2 ! Recall that the parameters Vt and αt depend on the parameter δ. Here, we may decide any positive number for δ, its choice does affect the performance of Algorithm 3. Although the bounds given in Lemma 9 do not exhibit an explicit dependence on δ, our proof of Lemma 9 in Section 7.2 reveals that increasing δ increases the objective gap E[PT t=1 ft(xt) PT t=1 ft(x)] and decreases the constraint violation E[PT t=1 gt(xt)]. Likewise, decreasing δ decreases the objective gap and increases the constraint violation. Lemma 7 implies that E [ST ] = O(τmix T). Plugging in this bound on E [ST ] to the adaptive performance guarantees given in Lemma 9, we deduce the following result. Proposition 10 Suppose that Assumptions 1 and 2 hold. Then for the constrained online convex optimization problem where the MLMC estimators f1, . . . , f T are the objective loss functions and the MLMC estimators g1, . . . , g T are the constraint functions, Algorithm 3 Kim and Lee achieves the following. For any x X with g(x) 0, we have = O τ 1 β mix T 1 β , = O τ β/2+1/4 mix T β/2+1/2 + τ 3/4 β/2 mix T 1/2 β/2 . for any x X satisfying g(x) 0. We remark that the performance bounds given in Lemma 9 and Proposition 10 are not comparable to the regret and constraint violation bounds for stochastic-constrained stochastic optimization. When each pair of loss and constraint functions for stochastic-constrained stochastic optimization corresponds to a single data, the associated regret and constraint violation measure are given by the following. Regret(T) = j=1 f(j) t (xt) j=1 f(j) t (x ), Violation(T) = j=1 g(j) t (xt). Here, f(1) t , . . . , f(Nt) t are the Nt sampled functions from which we derive the MLMC estimator ft for t [T]. In contrast, Lemma 9 and Proposition 10 analyze the performance of Algorithm 3 on the sequence of the MLMC estimators. Despite this, it would be an interesting question to understand the performance of Algorithm 3 for the latter online convex optimization setting where each function pair corresponds to a single data. The proof of Lemma 9 is given in Section 7. One of the main components of the analysis is to provide adaptive bounds on the terms E [Qt] and E [Qt/Vt]. This is possible thanks to our subtle choice of parameters Vt and αt. It turns out that the Ada Grad style analysis for drift-plus-penalty is more sophisticated than that for unconstrained stochastic gradient descent. Finally, we state the following theorem providing upper bounds on the optimality gap and the feasibility gap under Algorithm 3 for SCSO. The argument is to use the results of Proposition 10 and the estimation error bounds due to Lemma 8. In contrast to the setting of Section 3 for which we had to rely on the mixing property of ergodic Markov chains directly, the MLMC estimators are already close to f and g. Theorem 11 Suppose that Assumptions 1 and 2 hold. Then for stochastic-constrained stochastic optimization (SCSO), Algorithm 3 guarantees that E [Gap( x T )] = O τ 1 β mix T β E [Infeasibility( x T )] = O τ (2β+1)/4 mix T (1 β)/2 + τ (3 2β)/4 mix T (β+1)/2 Stochastic-Constrained Stochastic Optimization with Markovian Data PPPPPPPP P state label 1 1 1 purple yellow 2 blue orange 3 green red Table 2: Colors for (State, Label) Pairs 5. Numerical Experiments We examine the performance of the ergodic drift-plus-penalty algorithm (Algorithm 1) for the known mixing time case and the MLMC adaptive drift-plus-penalty algorithm (Algorithm 3) for the unknown mixing time case on a linear classification problem with fairness constraints using synthetic data. We follow the experimental setup of Zafar et al. (2019). We adopt Zafar et al. (2019) for creating data points and sensitive features (with φ = π/2) and imposing fairness constraints. To make our Markov chain more meaningful and complex enough, we experiment with a 3-state chain instead of a 2-state chain in contrast to Dorfman and Levy (2022). The 3-state Markov chain is given with the following transition matrix 1 2p p p p 1 2p p p p 1 2p which has stationary distribution (1/3, 1/3, 1/3). Theorem 12 (Levin and Peres, 2017, Theorems 12.4 and 12.5) For an ergodic and reversible Markov chain with n states, whose transition matrix is P, let 1 = λ1 > λ2 . . . λn be the eigenvalues of P and µmin be the minimum entry of the stationary distribution. Then, the mixing time of the chain satisfies |λ2| 1 max{|λ2|, |λn|} log 2 τmix 1 1 max{|λ2|, |λn|} log 4 µmin Then it follows that the mixing time of the 3-state Markov chain satisfies 3p log 2 τmix 1 In our experiment, we used 1/3p as an approximation of the mixing time. For each set of data points, we generate two clusters, each of which has 1,000 data points in R2 sampled from a multivariate normal distribution. All data points from a cluster have the same label in { 1, 1}. We denote the index set corresponding to each state j {1, 2, 3} as Dj and the whole index set as D = D1 D2 D3. The data clusters are drawn in Figure 1. The color corresponding to each state and label is summarized in Table 2. We then generate binary sensitive feature zi {0, 1} randomly for each data point xi Rd, i.e., gender. We want the binary-sensitive feature of the data points to have low covariance with the results Kim and Lee Figure 1: Data Points from our classifier. More details about how to create sensitive features are included in the supplement. We use logistic regression classifiers with the following loss functions fj(w, b) = 1 |Dj| i Dj log(1 + e yi(w xi+b)) and constraint functions gj(w, b) = 1 |Dj| i Dj (zi z)(w xi + b) c hj(w, b) = 1 i Dj (zi z)(w xi + b) c for each state j, where z = P i D zi/|D| and c > 0. Then the stochastic-constrained stochastic optimization problem is to minimize the usual logistic regression loss function under fairness constraints, which was proposed by Zafar et al. (2019), as follows. min (w,b) X 1 |D| i D log(1 + e yi(w xi+b)) s.t. c 1 |D| i D (zi z)(w xi + b) c. Solving this problem with our framework can be viewed as a distributed optimization scheme of Ram et al. (2009b) and Johansson et al. (2010). Basically, there are agents 1,2, and 3 sharing z, and agent i has data Di. After we update our parameters wt and bt, they are sent to agent it, and the agent sends us back the information fit(wt, bt), gt(wt, bt), git(wt, bt), ht(wt, bt), hit(wt, bt). Here, we may impose that the sequence of selected agents gives rise to an ergodic Markov chain. The experimental setup involves two constraints. Although we consider the single constraint setting in the paper for simplicity, our results can be easily extended to the case Stochastic-Constrained Stochastic Optimization with Markovian Data with multiple constraint functions g1, . . . , gn. For each gi, we obtain sampled functions gt,i for t [T]. Then for each t, we update xt and {Qt,i}n i=1 as xt+1 = argmin x X Vt ft(xt) + i=1 Qt,igt,i(xt) ! x + αt D(x, xt) Qt+1,i = h Qt,i + gt,i(xt) + gt,i(xt) (xt+1 xt) i + , i = 1, . . . , n. For Algorithm 1, we use the parameters Vt = (τmixt)β and αt = τmixt as before. For Algorithm 3, we define the MLMC estimator gt,i using {g(j) t,i }Nt j=1 for each i [n] and define a0 = S0 = δ, at = F 2 t 4 + i=1 R2G2 t,i + i=1 H2 t,i, St = δ + where Gt,i = gt,i(xt) , Ht,i = |gt,i(xt)|. The parameters Vt and αt are defined as in (1). We compare our DPP-based algorithms with some existing algorithms developed for stochastic-constrained stochastic optimization with i.i.d. data. The list of algorithms that we tested is given as follows. PD : Primal-dual method by Mahdavi et al. (2012). PD2 : Primal-dual method by Jenatton et al. (2016). DPP : Drift-plus-penalty algorithm by Yu et al. (2017). EDPP-t : Ergodic drift-plus-penalty (Algorithm 1). EDPP-T : modification of Algorithm 1 with non-adaptive parameters Vt = τmix T and αt = τmix T. MDPP : MLMC adaptive drift-plus-penalty (Algorithm 3). For MDPP, we observed that the MLMC estimator usually has a high variance in practice, making experimental results unstable. Hence, we truncated MLMC sampling so that the number of samples per iteration is at most 24. We chose δ = F 2/4 + 2R2G2 + 2H2, for which we computed the constants F, G, H > 0 such that Ft F, Gt,i G, Ht,i H. The results on the regret and the cumulative constraint violations are shown in Figures 4 and 5, respectively. We set parameters to p = 0.001 and c = 0.5 with which we ran the list of algorithms with the same initial parameters and sequence of states. We first ran MDPP with 25,000 iterations which created 101,034 samples. The results on the optimality gap are summarized in Figure 2, and the results on the infeasibility are presented in Figure 3 and Table 3. As shown in the figures, Our algorithms (EDPP-T, EDPP-t, MDPP) outperform the other algorithms in terms of the optimality gap. DPP also shows a good optimality gap but it ends with a positive constraint 1 infeasibility. In contrast, EDPP-T, EDPP-t, and MDPP all end with a negative constraint infeasibility. Note that after 20,000 samples, EDPP-T Kim and Lee Figure 2: Optimality Gap (Left), Enlarged Figure around 5,000 - 30,000 Samples (Right) Figure 3: Constraint 1 Infeasibility (Left), Constraint 2 Infeasibility (Right) Algorithm Final values of constraint 1 infeasibility PD 0.0533 PD2 0.4997 DPP-T 0.0622 DPP-t 0.0800 EDPP-T 0.2042 EDPP-t 0.0904 MDPP 0.1536 Table 3: Final Values of Constraint 1 Infeasibility Stochastic-Constrained Stochastic Optimization with Markovian Data Figure 4: Regret Figure 5: Constraint 1 Cumulative Violation (Left), Constraint 2 Cumulative Violation (Right) achieves the smallest optimality gap, followed by EDPP-t and MDPP. However, Figure 3 shows that EDPP-T incurs a significantly higher infeasibility for constraint 1, given by i D (zi z)(w xi + b) c than the other algorithms. In contrast, EDPP-t outperforms DPP-T and DPP-t in terms of both the optimality gap and the infeasibility measure. It is also interesting to see that DPP-T and DPP-t behave similarly, while DPP-t performs better than DPP-T. Figure 4 shows the regret values under various algorithms for the online convex optimization setting where each pair (ft, gt) of functions corresponds to one data sample. That said, we excluded MDPP as it requires multiple samples in one round. In addition, we excluded PD2, which exhibits much higher regret values than the other algorithms, to focus on and Kim and Lee better present the performance of the other algorithms. Figure 5 shows that EDPP-T incurs a positive cumulative constraint 1 violation, while the other algorithms result in a negative cumulative constraint violation. We may check from Figures 4 and 5 that EDPP-t performs the best for online convex optimization with ergodic constraints. 6. Analysis of Ergodic Drift-Plus-Penalty for the Known Mixing Time Case This section presents the proofs of Theorems 1 to 3 given in Section 3. Section 6.1 contains the proof of Theorem 1 which provides regret and constraint violation bounds on the formulation of online convex optimization with ergodic constraints. Then Section 6.2 presents the proof of Theorem 2 that gives bounds on the optimality gap and the feasibility gap for SCSO. In Section 6.3, we prove Theorem 3 for the case where Slater s condition is satisfied. As before, throughout this section, we denote by Ft the σ-field generated by the information accumulated up to time step t. That is, Ft = σ ({ξ1, . . . , ξt}) . Moreover, Et[ ] in this section refers to the conditional expectation with respect to Ft, i.e., Et [ ] = E [ | Ft] (In Sections 4 and 7, Ft denotes the σ-field generated up to time step t under the MLMC estimation scheme). Furthermore, Ps [t] for t > s denotes the probability measure of ξs conditional on Ft. 6.1 Ergodic Drift-Plus-Penalty for Online Convex Optimization with Ergodic Constraints The three important components of our analysis are the one (Lemma 14) bounding the term E h PT t=1 Qtgt(x) i , the part (Lemma 15) providing an upper bound on the term E h PT t=1(Qt/Vt)gt(x) i , and Lemma 16 that derives an upper bound on the expected queue size E [Qt]. Then, plugging in the deduced bounds to the lemmas in Appendix A analyzing the general template of adaptive drift-plus-penalty, we prove Theorem 1. By the update rule and the convexity of gt, we deduce the following straightforward bound on |Qt+1 Qt|. Lemma 13 For t 1, we have H GR Qt+1 Qt H. Proof Note that we have Qt+1 = Qt + gt(xt) + gt(xt)T (xt+1 xt) + [Qt + gt(xt+1)]+ Qt + H. For the lower bound, Qt+1 Qt + gt(xt) + gt(xt)T (xt+1 xt) Qt |gt(xt)| | gt(xt) xt+1 xt as required. Stochastic-Constrained Stochastic Optimization with Markovian Data As Q1 = 0, it follows that Qt (t 1)H. Recall that we denote τ = τmix(T 1). The next lemma provides a bound on the term PT t=1 E [Qtgt(x)]. As mentioned in Section 3.1, we need to take into account that g1, . . . , g T are not i.i.d. while they are generated by an ergodic Markov chain. Lemma 14 For any x X such that g(x) 0, t=1 Qtgt(x) H(H + GR)(τ 1)T + 2H Proof If T < τ, then t=1 Qtgt(x) t=1 (t 1)H2 = T(T 1)H2 2 < H2(τ 1)T, and the statement follows. If T τ, then t=1 Qtgt(x) t=1 E[Qtgt(x)] + t=1 E[Qt+τ 1gt+τ 1(x)] t=1 (t 1)H2 + t=1 E[(Qt+τ 1 Qt)gt+τ 1(x)] + t=1 E[Qtgt+τ 1(x)] (τ 1)(τ 2)H2 t=1 (τ 1)H(H + GR) + t=1 E [Qt Et 1[gt+τ 1(x) g(x)]] = (τ 1)(2T τ)H(H + GR) t=1 E [Qt Et 1[gt+τ 1(x) g(x)]] where the second inequality holds because (Qt+τ 1 Qt)gt+τ 1(x) (H + GR)H due to Lemma 13, g(x) 0, and Qt is Ft 1-measurable. Here, to bound Et 1[gt+τ 1(x) g(x)], we consider Et 1[gt+τ 1(x) g(x)] = Et 1[g(x, ξt+τ 1) g(x, ξ)] A H(d Pt+τ 1 [t 1] dµ) + Z AC H(d Pt+τ 1 [t 1] dµ) 2H Pt+τ 1 [t 1] µ TV where Pt+τ 1 [t 1] is the probability measure of ξt+τ 1 conditional on Ft 1, ξ µ, and A is any measurable set. By the definition of τ = τmix(T 1), it follows that Pt+τ 1 [t 1] µ TV 1/T, which implies that Et 1[gt+τ 1(x) g(x)] 2H/T. From this, we deduce the desired statement. Next we provide an upper bound on the term PT t=1 E [Qtgt(x)/Vt], as we mentioned in Section 3.1. Kim and Lee Lemma 15 For any x X such that g(x) 0, 2H(H + GR)(τ + 1) τ β mix(1 β) (T + 1)1 β. Proof Consider the case T < τ first. Note that Vt gt(x) H2 t=1 t1 β H2 τ β mix(2 β) (T + 1)2 β 2H2(τ + 1) τ β mix(1 β) (T + 1)1 β where the second inequality is due to Corollary 32 and the third inequality is from T < τ. Next we consider the case T τ. Note that t=1 E Qt+τ 1 Vt+τ 1 gt+τ 1(x) . Here the first term on the right-hand side can be bounded as Vt gt(x) H2 t=1 t1 β H2 τ β mix(2 β) τ 2 β H2τ τ β mix(1 β) (T + 1)1 β as above. Moreover, the second term can be bounded as follows. t=1 E Qt+τ 1 Vt+τ 1 gt+τ 1(x) t=1 E (Qt+τ 1 Qt) Vt+τ 1 gt+τ 1(x) + t=1 E Qt Vt+τ 1 gt+τ 1(x) (τ 1)H(H + GR) 1 (t + τ 1)β + 1 Vt+τ 1 E[Qt Et 1[gt+τ 1(x) g(x)]] (τ 1)H(H + GR) τ β mix(1 β) (T 1 β (τ 1)1 β) + 2H t=1 E Qt Vt+τ 1 τ β mix(1 β) τ(T + 1)1 β + 2H τ β mix(1 β) (T + 1)1 β + 2H2 τ β mix(1 β) (T + 1)1 β + 2H2 τ β mix T(2 β) (T + 1)2 β H(H + GR)(τ + 1) τ β mix(1 β) (T + 1)1 β, where the second inequality is from Lemma 13 and g(x) 0, the third inequality holds due to Corollary 32 and the choice of τ, the sixth inequality comes from Corollary 32, and the Stochastic-Constrained Stochastic Optimization with Markovian Data last inequality holds because T(2 β) > (T + 1)(1 β). Based on Lemmas 14 and 15, we can now prove the first part of Theorem 1, which upper bounds the regret of Algorithm 1 for online convex optimization with ergodic constraints. Proof [The first part of Theorem 1] Lemma 24 implies that VT R2 + F 2 Vt αt + (H + GR)2 Vt gt(x ) . Next using Lemma 15 with x satisfying g(x ) 0 and plugging in our choice of αt and Vt, we deduce the following. R2(τmix T)1 β + F 2τ β 1 mix 4β T β + (H + GR)2τ β mix 2(1 β) T 1 β + 2H(H + GR)(τ + 1) τ β mix(1 β) (T + 1)1 β = O(τ β mixτT 1 β), as required. Next, we prove the second part of Theorem 1, which gives an upper bound on constraint violation under Algorithm 1. The following lemma provides a time-varying bound on the expected virtual queue size. Lemma 16 For t [T + 1], E[Qt] is bounded above by 2I(t 1) + β + 3 2FRτ β mixtβ+1 2τmix(t 1) + 3H 2(τ 1)(t 1) + 4H where I = H2 + G2R2 + F 2/4. Proof To prove the lemma, we argue by induction. Note that the statement of the lemma trivially holds when t = 1 because Q1 = 0. Suppose the statement holds for t s for some s 1. What remains is to provide an upper bound on E[Qs+1]. Note that E h Qt gt(xt) + gt(xt) (x xt) i = E h Qt Et 1 h gt(xt) + gt(xt) (x xt) ii E[Qt Et 1[gt(x)]] = E[Qtgt(x)]. Then Lemma 27 together with Jensen s inequality implies the following. v u u t2s (H2 + R2G2) + 2RF t=1 Vt + 2R2αs + 2 t=1 E[Qtgt(x)]. Kim and Lee Moreover, Ps t=1 E[Qtgt(x)] can be upper bounded based on Lemma 14. Then it follows that v u u t2Is + 2FRτ β mix(s + 1)β+1 1 + β + 2R2τmixs + 2H(H + GR)(τ 1)s + 4H 2FRτ β mix(s + 1)β+1 2R2τmixs + p 2H2(τ 1)s + 2 2FRτ β mix(s + 1)β+1 2R2τmixs + p 2H2(τ 1)s + 2H + 1 for s T. By the induction hypothesis, it follows that E[Qt] for any 1 t s is upper bounded by 2I(t 1) + β + 3 2FRτ β mixtβ+1 2τmix(t 1) + 3H 2(τ 1)(t 1) + 4H 2Is + β + 3 2FRτ β mix(s + 1)β+1 2τmixs + 3H 2(τ 1)s + 4H. This leads to the desired upper bound on E[Qs+1]. We are now ready to prove the second part of Theorem 1, which proves the constraint violation bound of Algorithm 1. Proof [The second part of Theorem 1] Combining Lemmas 25 and 16, we deduce that E[QT+1] + FG αt = O τ β/2 mix T β/2+1/2 + p as required. 6.2 Ergodic Drift-Plus-Penalty for Stochastic-Constrained Stochastic Optimization In this section, we provide our formal proof of Theorem 2. To better organize and present the result, we divide the analysis into two, one of which is for bounding the optimality gap while the other is for bounding the feasibility gap. The main idea is to use the performance bounds given in Theorem 1 and relate them to the optimality gap and the feasibility gap. The relation between them is not as clear as in the i.i.d. setting, but we use the mixing property of ergodic Markov chains. Stochastic-Constrained Stochastic Optimization with Markovian Data Proof [The first part of Theorem 2] For T τ = τmix(T 1), we have the following f(xt) f(x) = f(xt) f(x) ft+τ 1(xt) + ft+τ 1(x) t=1 (ft+τ 1(xt) ft+τ 1(xt+τ 1)) t=τ (ft(xt) ft(x)) + f(xt) f(x) . We consider the four parts of the right-hand side separately. Here, the first part satisfies Et 1 f(xt) f(x) ft+τ 1(xt) + ft+τ 1(x) = Z (f(xt, ξ) f(x, ξ)) dµ(ξ) d Pt+τ 1 [t 1] (ξ) FR Z |dµ(ξ) d Pt+τ 1 [t 1] (ξ)| 2FR µ Pt+τ 1 [t 1] TV Hence, it follows that PT τ+1 t=1 E[ f(xt) f(x) ft+τ 1(xt) + ft+τ 1(x)] 2FR. To bound the second part, we consider the following. E[ft+τ 1(xt) ft+τ 1(xt+τ 1)] = s=t E[ft+τ 1(xs) ft+τ 1(xs+1)] s=t E[F xs xs+1 ] F 2αs E[Vs F + Qs G] = F 2τ β 1 mix 2 s=t sβ 1 + FG F 2τ β 1 mix 2β (t + τ 2)β (t 1)β + FG Kim and Lee where the last inequality holds due to Corollary 32. Here, we consider the following to provide an upper bound on the second term on the right-most side. 2I(t + τ 2) p 2FRτ β mix β + 1 (t + τ 2)β+1 q 2τmix(t + τ 2) p 2(τ 1)(t + τ 2) p 2(τ 1)(t 1) + 4H log t + τ 2 Here, we handle the above terms in the following manner. Note that (t + τ 2)β+1 q = (T 1) β+1 2 + + (T τ + 1) β+1 2 (τ 2) β+1 (τ 1)T β/2+1/2 holds and that (t + τ 2) p (t 1) = (T 1) 1 2 + + (T τ + 1) 1 2 (τ 2) 1 2 1 (τ 1)T 1/2. The resulting terms form dominant terms, which means that t=1 E[ft+τ 1(xt) ft+τ 1(xt+τ 1)] = O τ β/2 1 mix (τ 1)T β/2+1/2 + τ 1 mix(τ 1)3/2T 1/2 . Moreover, by Theorem 1 and the triangular inequality, the third part is bounded as t=τ (ft(xt) ft(x)) = O τ β mixτT 1 β + (τ 1) . The fourth part can be bounded as follows. f(xt) f(x) # Stochastic-Constrained Stochastic Optimization with Markovian Data Combining the bounds on the four parts, we can conclude that for any x X such that g(x) 0, f(xt) f(x) # = O τ β mixτT 1 β + τ β/2 1 mix (τ 1)T β/2+1/2 + τ 1 mix(τ 1)3/2T 1/2 + τ , which implies E f( x T ) f(x) 1 f(xt) f(x) # τ β mix T β + τ 1 τ β/2 mix T (1 β)/2 + (τ 1)3/2 τmix T 1/2 + τ as required. Next we prove the feasibility gap bound of Algorithm 1, which is given as the second part of Theorem 2. Proof [Proof of the second part of Theorem 2] For T τ = τmix(T 1), we have the following, t=1 g(xt) = t=1 ( g(xt) gt+τ 1(xt)) + t=1 (gt+τ 1(xt) gt+τ 1(xt+τ 1)) t=τ gt(xt) + t=T τ+2 g(xt) where the right-hand side consists of four terms. We separately upper bound the four parts of the right-hand side. As before, the first part satisfies Et 1[ g(xt) gt+τ 1(xt)] = Z g(xt, ξ) dµ(ξ) d Pt+τ 1 [t 1] (ξ) H Z |dµ(ξ) d Pt+τ 1 [t 1] (ξ)| 2H µ Pt+τ 1 [t 1] TV This implies that PT τ+1 t=1 E[ g(xt) gt+τ 1(xt)] 2H. Next, the second part satisfies E[gt+τ 1(xt) gt+τ 1(xt+τ 1)] = s=t E[gt+τ 1(xs) gt+τ 1(xs+1)] s=t E[G xs xs+1 ] Kim and Lee As in the proof of the first part of Theorem 2, we deduce that t=1 E[gt+τ 1(xt) gt+τ 1(xt)] O τ β/2 1 mix (τ 1)T β/2+1/2 + τ 1 mix(τ 1)3/2T 1/2 . Moreover, it follows from Theorem 1 and the triangular inequality that the third part can be bounded as t=τ+2 gt(xt) = O τ β/2 mix T β/2+1/2 + p (τ 1)T + τ 1 . Lastly, the following provides an upper bound on the fourth part. t=T τ+2 g(xt) Combining the bounds on the four parts, we have that O τ β/2 mix T β/2+1/2 + τ β/2 1 mix (τ 1)T β/2+1/2 + τ 1 mix(τ 1)3/2T 1/2 + τ , implying in turn that E [ g( x T )] 1 τ β/2 1 mix τ T (1 β)/2 + (τ 1)3/2 τmix T 1/2 + τ as required. 6.3 Ergodic Drift-Plus-Penalty under Slater s Condition In this section, we prove Theorem 3, which analyzes the performance of Theorem 3 under Slater s constraint qualification stated in Assumption 3. We will see that Slater s condition does lead to improvement. Basically, we deduce a reduction on E[Qt]. To argue this, we follow the proof path of Yu et al. (2017). We first present a lemma, which is analogous to (Yu et al., 2017, Lemma 5). The difference is that we use a time-varying parameter θ(t) which allows time-varying algorithm parameters Vt and αt. In fact, its proof is similar to that of (Yu et al., 2017, Lemma 5), so we defer it to the appendix (Appendix D). Lemma 17 Let {Zt, t 0} be a discrete-time stochastic process adapted to a filtration {Wt, t 0} with Z0 = 0 and W0 = { , S}. Suppose there exists a positive integer t0, real numbers 0 < ζ δmax, and a non-decreasing function θ(t) > 0 such that |Zt+1 Zt| δmax, E[Zt+t0 Zt|Wt] ( t0δmax, if Zt < θ(t) t0ζ, if Zt θ(t) Stochastic-Constrained Stochastic Optimization with Markovian Data for all positive integer t. Then, E[Zt] θ(t) + t0δmax + t0 4δ2 max ζ log 8δ2 max ζ2 for all positive integer t. Here, the important part is that although θ(t) is time-varying, the parameter t0 is some fixed value that does not depend on t. That is why the condition in Lemma 17 can be recursively applied. Lemma 17 implies that if the stochastic process given by {Qt, t 0} where Q0 = 0 satisfies the drift condition as in Lemma 17 with appropriate parameters δmax, t0, and ζ, then the expected queue size E [Qt] can be bounded. Moreover, by properly setting θ(t) as well as the parameters, we may control the size of E [Qt]. The next lemma shows that the stochastic process given by {Qt, t 0} indeed satisfies the desired drift condition. While proving the lemma, we need to consider the term Et 1 [Qigi(ˆx)] for i t where ˆx is the solution satisfying Slater s constraint qualification, i.e. g(ˆx) ϵ. Here, we need to relate g(ˆx) ϵ and the term Et 1 [Qigi(ˆx)], from which we deduce the drift condition. Although gi(ˆx) is not necessarily negative, we again use the property of ergodic Markov chains that the distribution of gi(ˆx) conditional on Ft 1 gets close to the stationary distribution for a sufficiently large i. Lemma 18 Let T 4H/ϵ, t0 > 2τ = 2τmix(T 1), and θt0(t) = 2 ϵ(t0 2τ) ϵ(t0 τ)2(H + GR) + 4τ(t0 + t 1)(H + GR)2 + 2 ϵ(t0 2τ) 2t0(H2 + G2R2) + 2RFτ β mixt0(t + t0 1)β . Then Algorithm 1 under Assumption 3 satisfies |Qt+1 Qt| H + GR, Et 1[Qt+t0 Qt] ( (H + GR)t0, if Qt < θt0(t) ϵt0/4, if Qt θt0(t). Proof |Qt+1 Qt| H + GR follows directly from Lemma 13. For the second part, note that by Jensen s inequality, we have Et 1[Qt+t0]2 Et 1[Q2 t+t0]. Moreover, observe that θt0(t) 2ϵ(t0 τ)2(H + GR) ϵ(t0 2τ) (t0 τ)(H + GR) ϵt0 Kim and Lee where the last inequality holds because t0 > 2τ and H > ϵ. This means that Qt > ϵt0/4 if Qt > θ(t0). Therefore, it is sufficient to show that if Qt θ(t0), Et 1[Q2 t+t0] Qt ϵt0 Recall that By Lemma 26 and the convexity of gt, we have for any i 1, i H2 + G2R2 + Qigi(ˆx) + Vi RF + αi (D(ˆx, xi) D(ˆx, xi+1)) . Thus, we have i=t Et 1[ i] t0(H2 + G2R2) + i=t Et 1[Qigi(ˆx)] + RFτ β mix + αt D(ˆx, xt) + i=t (αi+1 αi)D(ˆx, xi+1) αt+t0 1D(ˆx, xt+t0) t0(H2 + G2R2) + RFτ β mixt0(t + t0 1)β + αt+t0 1R2 + i=t Et 1[Qigi(ˆx)]. We factor the last term as follows. i=t Et 1[Qigi(ˆx)] i=t Et 1[Qigi(ˆx)] + i=t Et 1[(Qi+τ 1 Qi)gi+τ 1(ˆx)] i=t Et 1[Qi(gi+τ 1(ˆx) g(ˆx))] + i=t Et 1[Qi g(ˆx)] i=t 1 i + (t0 τ + 1)(τ 1)H(H + GR) + 2H T ϵ t+t0 τ X i=t Et 1[Qi] where the inequality is due to Et 1[ g(xt) gt+τ 1(xt)] H Z |dµ(ξ) d Pt+τ 1 [t 1] (ξ)| 2H µ Pt+τ 1 [t 1] TV 2H Stochastic-Constrained Stochastic Optimization with Markovian Data Then, since T 4H/ϵ and thus 2H/T ϵ/2, it follows that i=t Et 1[Qigi(ˆx)] (t0 + t 2)(τ 1)H(H + GR) ϵ i=t Et 1[Qi] (t0 + t 2)(τ 1)H(H + GR) ϵ i=0 (Qt i(H + GR)) (t0 + t 2)(τ 1)H(H + GR) ϵ 2 (t0 τ)Qt (t0 τ)2(H + GR) . As a result, we obtain Et 1[Q2 t+t0] = Q2 t + 2 i=t Et 1[ i] Q2 t ϵ(t0 τ)Qt + ϵ(t0 τ)2(H + GR) + 2(t0 + t 2)(τ 1)H(H + GR) + 2t0(H2 + G2R2) + 2RFτ β mixt0(t + t0 1)β + 2τmix(t + t0 1)R2 Q2 t ϵ(t0 τ)Qt + ϵ(t0 τ)2(H + GR) + 4τ(t0 + t 1)(H + GR)2 + 2t0(H2 + G2R2) + 2RFτ β mixt0(t + t0 1)β If Qt satisfies ϵ 2(t0 2τ)Qt ϵ(t0 τ)2(H + GR) + 4τ(t0 + t 1)(H + GR)2 + 2t0(H2 + G2R2) + 2RFτ β mixt0(t + t0 1)β, which is equivalent to Qt θt0(t), then Et 1[Q2 t+t0] Q2 t ϵ 2t0Qt Qt ϵt0 as required. Having prepared Lemmas 17 and 18, we can derive refined bounds on the expected virtual queue size. Based on this, we are ready to prove Theorem 3. Proof [Proof of Theorem 3] As the result is trivial when T 4H/ϵ where H and ϵ are constants, we may assume without loss of generality that T 4H/ϵ. Moreover, we will use Lemma 17 for many values of t0 greater than 2τ. Setting δmax = H + GR, ζ = ϵ/4, and θ = θt0 where θt0 is defined as in Lemma 18, Lemma 17 gives us that E[Qt+1] θt0(t) + t0(H + GR) + t0 16(H + GR)2 ϵ log 128(H + GR)2 Kim and Lee In particular, as this inequality holds for any t0 > 2τ, the inequality with t0 = 2τ + τmixt holds. Recall that in Lemma 18, we set θt0(t) = 2 ϵ(t0 2τ) ϵ(t0 τ)2(H + GR) + 4τ(t0 + t 1)(H + GR)2 + 2 ϵ(t0 2τ) 2t0(H2 + G2R2) + 2RFτ β mixt0(t + t0 1)β . Moreover, ϵ(t0 2τ) 2 θt0(t) = ϵ τmixt 2 θt0(t) ϵ τmixt This implies the following. ϵ(τ + τmixt )2(H + GR) + 4τ(2τ + τmixt + t 1)(H + GR)2 + 2(2τ + τmixt )(H2 + G2R2) + 2RFτ β mix(2τ + τmixt )(t + 2τ + τmixt 1)β 2ϵ(H + GR)(τ 2 + 4τmixt) + 4τ(2τ + 2 τmixt + t)(H + GR)2 + 4(τ + τmixt)(H2 + G2R2) + 4RFτ β mix(τ + τmixt)(t + 2τ + 2 τmixt)β 2ϵ(H + GR)(τ 2 + 4τmixt) | {z } (a) + 4τ(2τ + τmix + 2t)(H + GR)2 | {z } (b) + 4(τ + τmixt)(H2 + G2R2) | {z } (c) + 4RFτ β mix(τ + τmixt)(2t + 2τ + τmix)β | {z } (d) where the second inequality follows from (a + b)2 2(a2 + b2), x 2x which holds for x 1 and the third inequality follows from AM-GM inequality. We see that term (c) grows more slowly than term (d) which is of order O τ β mix(τ + t)β(τ + τmixt) = O τ β mix(τ β + tβ)(τ + τmixt) . Here, we used the known fact that (a + b)β aβ + bβ for a, b 0 and β < 1. Likewise, term (a) grows more slowly than term (b) which is of order O (τ(τ + t)) . Then it follows that θt0(t) = O τ β mix(τ β + tβ) 1 + τ τmixt + τ(τ + t) τmixt τ β mixτ β | {z } (a ) + τ β mixtβ | {z } (b ) + τ 1+βτ β mix τmixt | {z } (c ) + ττ β mixtβ τmixt | {z } (d ) + τ 2 τmixt | {z } (e ) + τt τmixt | {z } (f ) Stochastic-Constrained Stochastic Optimization with Markovian Data Here, term (c ) grows more slowly than term (e ), and term (f ) dominates term (b ). Also, terms (a ) and (d ) are dominated by τ = O τ r τ = O (e ) + (f ) , which is due to the AM-GM inequality. Therefore, θt0(t) = O τ(τ + t) τmixt Furthermore, since t0 = 2τ + τmixt = O τ(τ + t) τmixt which holds because τ(τ + t) 2τ τt, we have E[Qt] O τ(τ + t) τmixt which essentially gives rise to the desired refined bound on the expected virtual queue size. Having provided the bound on the expected queue size, we now proceed to provide bounds on the performance guarantees of Algorithm 1. First of all, note that the regret bound given by Theorem 1 still holds with β = 1/2, as we use the same algorithm parameters. Then we deduce that E [Regret(T)] = O For constraint violation, we deduce from Lemma 25 and (2) that E [Violation(T)] E[QT+1] + FG τ(τ + T + 1) p τmix(T + 1) + τ β 1 mix t=1 tβ 1 + 1 τmix = O τ(τ + T) τmix T Next, we consider the optimality gap. In the proof of Theorem 2, we argued that E[ft+τ 1(xt) ft+τ 1(xt+τ 1)] F 2τ β 1 mix 2β (t + τ 2)β (t 1)β + FG Here, using (2) again, the second term on the right-hand side can be bounded as follows. λ((t 1) 1/2) λ((t + τ 2) 1/2) | {z } (a ) τ 3/2 mix ( Kim and Lee ( x, if x > 0 3/2, if x = 0 . Therefore, it follows that t=1 E[ft+τ 1(xt) ft+τ 1(xt+τ 1)] = O τ β 1 mix 2β (T 1)β + + (T τ + 1)β 2 + 1 1/2 + + (τ 2) 1/2 if τ 2. Thus, we deduce t=1 E[ft+τ 1(xt) ft+τ 1(xt+τ 1)] = O τ β 1 mix (τ 1)T β + τ 2 τ 1τ 3/2 mix + τ(τ 1)τ 3/2 mix τ 3/2 mix + τ 2 We combine this result with the other parts of the proof of the first part of Theorem 2 to get f(xt) f(x) # τ 3/2 mix + τ 2 τ 3/2 mix + τ β mixτT 1 β + τ τ 3/2 mix + τ 2 for any x X such that g(x) 0 since β = 1/2. Similarly, we get t=1 E[gt+τ 1(xt) gt+τ 1(xt+τ 1)] = O τ 3/2 mix + τ 2 As before, we combine this result with the other parts of the proof of the second part of Theorem 2. Then we get τ 3/2 mix + τ 2 τ 3/2 mix + τ(τ + T) τmix T + τ τ 3/2 mix + τ 2 τ 3/2 mix + τ 2 τmix T Finally, we obtain that E[Gap( x T )] = O τ 3/2 mix T + τ 2 E[Infeasibility( x T )] = τ 3/2 mix T + τ 2 T + τ 2 τmix T 3/2 as required. Stochastic-Constrained Stochastic Optimization with Markovian Data 7. Analysis of MLMC Adaptive Drift-Plus-Penalty for the Unknown Mixing Time Case This section presents a complete proof of Theorem 11. Section 7.1 provides bounds on several terms that involve the virtual queue size. In Section 7.2, we state the proof of Lemma 9, and lastly, we prove Theorem 11. 7.1 Controlling the Expected Virtual Queue Size In this section we provide upper bounds on terms E [Qt/Vt] and E [Qt] under the MLMC Adaptive Drift-Plus-Penalty algorithm (Algorithm 3). Lemma 19 gives a refined bound on the term E [gt(x)]. Using this, Lemma 20 derives an adaptive upper bound on the term E [Qt/Vt], and Lemma 21 deduces an adaptive upper bound on the term E [Qt]. Note that if x X satisfies g(x) 0, then gt(x) gt(x) g(x). Then we deduce Et 1[gt(x)] Et 1[gt(x) g(x)] = Et 1 h g2jmax t (x) g(x) i Et 1 h |g2jmax t (x) g(x)| i . where the equality is due to Lemma 7. Applying Lemma 8 together with Jensen s inequality on the right-hand side, we obtain Et 1[gt(x)] C(T)τ 1/2 mix T 1 = O τ 1/2 mix T 1 . (4) In fact, we may derive a tighter bound on the term Et 1[gt(x)] as follows. Lemma 19 For any x X such that g(x) 0, Et 1[gt(x)] D(T)τ 1/4 mix T 1/2, where D(T) = p C(T)H(4 log2 T + 3) Proof Note that gt(x) |gt(x)| (2Nt + 1)H by the definition of gt. This implies that Et 1[gt(x)] Et 1[(2Nt + 1)H] = (4 log2 T + 3)H, where the equality follows from Lemma 7 and the fact that Nt is independent of Ft 1. Moreover, as we argued above, we have Et 1[gt(x)] Et 1 h |g2jmax t (x) g(x)| i q Et 1 |g2jmax t (x) g(x)|2 C(T)τ 1/2 mix T 1, where the last inequality follows from Jensen s inequality and the equality follows from Lemma 8. By taking the geometric mean of these two upper bounds, we get that Et 1[gt(x)] p C(T)H(4 log2 T + 3)τ 1/4 mix T 1/2, as required. Kim and Lee Next, Lemma 27 implies that Qs+1 is at most v u u t2 t=1 (R2G2 t + H2 t ) 2R2αs | {z } (c) t=1 Qt(gt(xt) + gt(xt) (x xt)). Here, term (c) is equal to 2Ss 1 which is at most 2Ss, and term (b) is at most 2Ss. For term (a), we have t=1 Sβ t 1Ft 2Sβ s t=1 Ft 2Sβ s t=1 F 2 t 2 s Sβ+1/2 s , where the second inequality holds by the power mean inequality. Thus, Q2 s+1 2 s Sβ+1/2 s + 4Ss + 2 t=1 Qt(gt(xt) + gt(xt) (x xt)), (5) which in turn implies that 2s1/4Sβ/2+1/4 s + t=1 Qt (gt(xt) + gt(xt) (x xt)) (6) Moreover, (6) implies that 2RS1/2 β s + 2Rs1/4S1/4 β/2 s + v u u t 2 Vs+1 Vt (gt(xt) + gt(xt) (x xt)) as Vt is non-decreasing. Taking the expectation, applying Jensen s inequality, and using the fact that Vs+1 sδ, it follows that 2RE[Ss]1/2 β + 2Rs1/4E[Ss]1/4 β/2 Vt Et 1[gt(xt) + gt(xt) (x xt)] . Here, the last term on the right-hand side can be further upper bounded as follows. v u u t 2R Vt Et 1[gt(xt) + gt(xt) (x xt)] Vt Et 1[gt(x)] where we used the fact that Et 1[gt(xt) + gt(xt) (x xt)] = Et 1 h gjmax(xt) + gjmax(xt) (x xt) i Et 1 gjmax(x) = Et 1 [gt(x)] Stochastic-Constrained Stochastic Optimization with Markovian Data holds due to Lemma 7. Furthermore, v u u t 2R Vt Et 1[gt(x)] (sδ)β D(T)τ 1/4 mix T 1/2 s X (sδ)β D(T)τ 1/4 mix T 1/2 + 1 δβ τ 1/4 mix T 1/2 β + 1 where the second inequality follows from the AM-GM inequality and the last equality follows from s T. Thus, we have 2RE[Ss]1/2 β + 2Rs1/4E[Ss]1/4 β/2 + 2RD(T) δβ τ 1/4 mix T 1/2 β + 1 2RE[ST ]1/2 β + 2Rs1/4E[ST ]1/4 β/2 + 2RD(T) δβ τ 1/4 mix T 1/2 β + 1 Now we can bound E [Qt/Vt] as follow. Lemma 20 For t [T + 1], 2RE[ST ]1/2 β + 2 2RT 1/4E[ST ]1/4 β/2 + 4RD(T) δβ τ 1/4 mix T 1/2 β. Proof For t = 1, Q1 = 0 and the inequality of the lemma holds. Assume that it holds for t = 1, . . . , s with s T. Substituting the inequalities for t = 1, . . . , s into (8), we derive the inequality for t = s + 1, as required. Next, we deduce an upper bound on E [Qt]. First, we observe that (6) and Jensen s inequality imply the following. For s [T], 2E[Ss]1/2 + 2s1/4E[Ss]β/2+1/4 + v u u t2C(T) τmix 2E[Ss]1/2 + 2s1/4E[Ss]β/2+1/4 + s C(T) τmix 2E[Ss]1/2 + 2s1/4E[Ss]β/2+1/4 + C(T)τ 1/2 mix + 1 where the first inequality is implied by combining (6) and (7) and applying Lemma 19. Next we provide an adaptive upper bound on E [Qt] Kim and Lee Lemma 21 For any t [T + 1], 2E[St 1]1/2 + 5 2 3 (t 1)1/4E[St 1]β/2+1/4 + 2C(T)τ 1/2 mix. Proof We argue by induction. The inequality trivially holds when t = 1 as Q1 = 0. Suppose that the inequality holds for t = 1, 2, . . . , s. Then 2E[Ss]1/2 + 2s1/4E[Ss]β/2+1/4 + C(T)τ 1/2 mix 2E[St 1]1/2 + 10 3 (t 1)1/4E[St 1]β/2+1/4 + 2C(T)τ 1/2 mix 2E[Ss]1/2 + 2s1/4E[Ss]β/2+1/4 + C(T)τ 1/2 mix 2s E[Ss]1/2 + 4 2 3 s5/4E[Ss]β/2+1/4 + 2s C(T)τ 1/2 mix 2E[Ss]1/2 + 5 2 3 s1/4E[Ss]β/2+1/4 + 2C(T)τ 1/2 mix, where the first inequality is from (9) and the second inequality is by Corollary 32. 7.2 MLMC Adaptive Drift-Plus-Penalty for Stochastic-Constrained Stochastic Optimization Based on the bounds on Et 1[gt(x)], E[Qt/Vt], and E[Qt], in this section, we prove Lemma 9 and Theorem 11. To better present the proof of Lemma 9, we divide the analysis into two parts, one for the regret and the other for the constraint violation. Then, using Lemma 9, we prove Theorem 11. Proof [Proof of the first part of lemma 9] We first bound By Lemma 24, this is less than or equal to Vt F 2 t 4αt (Ht + Gt R)2 Here, the first term is equal to RS1 β T 1 which is less or equal to RS1 β T , and the second term satisfies Vt F 2 t 4αt t=1 Sβ 1 t 1 F 2 t t=1 Sβ 1 t 1 at = O E h Sβ T i = O E h Sβ T i where the inequality holds by Corollary 34. The third term satisfies (Ht + Gt R)2 R2G2 t + H2 t Sβ t 1 = O E h S1 β T i . Stochastic-Constrained Stochastic Optimization with Markovian Data By Jensen s inequality, we have E h S1 β T i E [ST ]1 β and E h Sβ T i E [ST ]β. The fourth term can be bounded as follows. Note that Vt Et 1[gt(x)] . We have two upper bounds on Et 1 [gt(x)], one given in (4) and the other one due to Lemma 19. Let m(T) be the minimum of the two upper bounds, i.e., m(T) = min n C(T)τ 1/2 mix T 1, D(T)τ 1/4 mix T 1/2o . Then it follows that Vt Et 1[gt(x)] 2RE[ST ]1/2 β + 2 2RT 1/4E[ST ]1/4 β/2 + 4RD(T) δβ τ 1/4 mix T 1/2 β C(T)τ 1/2 mix T 1 4 2RE[ST ]1/2 β + 2 2RT 1/4E[ST ]1/4 β/2 + 4RD(T)2 δβ τ 1/2 mix T β = O τ 1/2 mix E[ST ]1/2 β + τ 1/2 mix T 1/4E[ST ]1/4 β/2 + τ 1/2 mix T 1 β where the first inequality follows from (4) and Lemmas 19 and 20. Combining the bounds on the four terms, we have proved the desired bound on E h PT t=1 ft(xt) PT t=1 ft(x) i . Proof [Proof of the second part of Lemma 9] By Lemma 25, we have t=1 gt(xt) QT+1 + Gt 2αt (Vt Ft + Qt Gt). Moreover, by Lemma 21, we have E[QT+1] O E[ST ]1/2 + T 1/4E[ST ]β/2+1/4 + τ 1/2 mix . Next it follows from Corollary 34 that Sβ 1 t 1 at = O E h Sβ T i = O E [ST ]β . Furthermore, G2 t 2αt Qt max t [T] Qt R2G2 t 2St 1 2 log ST I δ max t [T] Qt Kim and Lee where the second inequality is implied by Lemma 33. Here, the last term can be upper bounded using the AM-GM inequality as follows. I + δ 2 log ST I δ max t [T] Qt maxt [T] Q2 t 2(Iτmix)β/2+1/4T β/2+1/2 + 1 2 (Iτmix)β/2+1/4T β/2+1/2. Let s = argmaxt [T] Qt. Then by (5) and (7), E Q2 s 4E[Ss 1] + 2 s E[Ss 1]β+1/2 + 2 t=1 E h Qt Et 1 h gt(xt) + gt(xt) (x xt) ii 4E[ST ] + 2 TE[ST ]β+1/2 + 2 t=1 E[Qt Et 1[gt(x)]]. t=1 E[Qt Et 1[gt(x)]] C(T)τ 1/2 mix T 1 s 1 X C(T)τ 1/2 mix T 1 T X 2E[ST ]1/2 + 5 2 3 (t 1)1/4E[ST ]β/2+1/4 + 2C(T)τ 1/2 mix C(T)τ 1/2 mix T 1 2TE[ST ]1/2 + 4 2 3 T 5/4E[ST ]β/2+1/4 + 2C(T)τ 1/2 mix T = O τ 1/2 mix E[ST ]1/2 + τ 1/2 mix T 1/4E[ST ]β/2+1/4 + τmix , where the first inequality follows from (4), the second inequality follows from Lemma 21, and the third inequality follows from Corollary 32. Thus, we get E[ST ]1/2 + T 1/4E[ST ]β/2+1/4 + τ 1/2 mix + E[ST ] + T 1/2E[ST ]β+1/2 τ β/2+1/4 mix T β/2+1/2 +τ 1/2 mix + E[ST ]1/2 + E[ST ]β/2+1/4 τ β/2+1/4 mix T β/2+1/2 + (log ST )2τ β/2 1/4 mix T β/2+1/2 ! as required. Proof [Proof of Theorem 11] As f is convex, f( x T ) f(x#) 1 f(xt) f(x#) . Stochastic-Constrained Stochastic Optimization with Markovian Data We can decompose the right-hand side as follows f(xt) f(x#) T = f(xt) ft(xt) T + ft(xt) ft(x#) T + ft(x#) f(x#) Here, Jensen s inequality and Lemma 8 imply that 1 T E | f(xt) ft(xt)| 1 E | f(xt) ft(xt)|2 = O(τ 1/2 mix T 2), 1 T E h | f(x#) ft(x#)| i 1 E | f(x#) ft(x#)|2 = O(τ 1/2 mix T 2). Moreover, we have derived Proposition 10 which provides an upper bound on the expectation of PT t=1 ft(xt) PT t=1 ft(x#). Consequently, E h f( x T ) f(x#) i O (τmix)1 βT β . For the second part, Jensen s inequality and Lemma 8 imply that 1 T E [| g(xt) gt(xt)|] 1 E [| g(xt) gt(xt)|2] = O(τ 1/2 mix T 2) As a result, E[ g( x T )] 1 t=1 E[ g(xt)] t=1 E[ g(xt) gt(xt)] + 1 t=1 E[gt(xt)] = O τ 1/2 mix T 2 + τ β/2+1/4 mix T β/2 1/2 + τ 3/4 β/2 mix T β/2 1/2 = O τ β/2+1/4 mix T β/2 1/2 + τ 3/4 β/2 mix T β/2 1/2 , where the first inequality holds because g is convex and the second equality follows from Proposition 10. Acknowledgments We thank the anonymous referee and the editor for their careful review of the paper. This work was supported by the National Research Foundation of Korea (NRF) grants (No. RS-2024-00350703 and No. RS-2023-00219980) and the Institute of Information & communications Technology Planning & evaluation (IITP) grant (No. RS-2024-00437268) funded by the Korea government (MSIT). Kim and Lee Algorithm 4 Adaptive Drift-Plus-Penalty Initialize: Initial iterates x1 X, Q1 = 0. for t = 1 to T do Observe ft and gt. Set penalty parameter Vt and step size parameter αt such that 0 Vt 1 Vt , 0 αt 1 αt, and 0 αt 1/Vt 1 αt/Vt. Primal update: Set xt+1 as xt+1 = argmin x X n (Vt ft(xt) + Qt gt(xt)) x + αt D(x, xt) o Dual update: Set Qt+1 as Qt+1 = h Qt + gt(xt) + gt(xt) (xt+1 xt) i Appendix A. Analysis of Adaptive Drift-Plus-Penalty Algorithm 4 is a general template for adaptive variants of the drift-plus-penalty algorithm whose parameters Vt and αt satisfy that {Vt}T t=1, {αt}T t=1, and {αt/Vt}T t=1 are non-decreasing sequences of non-negative numbers. In this section, we analyze the general template of DPP given by Algorithm 4, based on which we deduce performance guarantees on Algorithms 1 and 3. Recall that t = (Q2 t+1 Q2 t )/2 is the Lyapunov drift term. The following lemma provides a bound on the drift term. Lemma 22 For t 1, t Qt gt(xt) + gt(xt) (xt+1 xt) + 1 2(Ht + Gt R)2. Proof As Qt+1 = max Qt + gt(xt) + gt(xt) (xt+1 xt), 0 , we have that Q2 t+1 Qt + gt(xt) + gt(xt) (xt+1 xt) 2. Hence, it follows that t = Q2 t+1 2 Q2 t 2 Qt gt(xt) + gt(xt) (xt+1 xt) + 1 2 gt(xt) + gt(xt) (xt+1 xt) 2 Qt gt(xt) + gt(xt) (xt+1 xt) + 1 2(Ht + Gt R)2 where the last inequality is from Assumption 2. Since our drift-plus-penalty algorithm is a mirror descent version, we need the following lemma, which is obtained by substituting y = xt, x = xt+1, and f(x) = (Vt ft(xt) + Qt gt(xt)) x into (Wei et al., 2020, Lemma 2.1). Stochastic-Constrained Stochastic Optimization with Markovian Data Lemma 23 (Wei et al., 2020, Equation (22)) For any x X and t 1, (Vt ft(xt) + Qt gt(xt)) (xt+1 xt) + αt D(xt+1, xt) (Vt ft(xt) + Qt gt(xt)) (x xt) + αt D(x, xt) αt D(x, xt+1). Recall that ft and gt for the known mixing time case in Section 3 correspond to one sample and are assumed to be convex. In contrast, ft and gt for the unknown mixing time setting in Section 4 come from the MLMC estimation scheme with multiple data samples and thus are not necessarily convex. Nevertheless, we use the fact that Et 1[ft] and Et 1[gt] are convex. Based on this, we deduce the following lemma. Lemma 24 Suppose that Et 1[ft] = Et 1[ ˆft] and Et 1[gt] = Et 1[ˆgt] where ˆft and ˆgt are convex functions and that Et 1[ ft] = Et 1[ ˆft] and Et 1[ gt] = Et 1[ ˆgt]. Then for any x X, Vt F 2 t 4αt (Ht + Gt R)2 Proof Dividing both sides of the inequality given in Theorem 23 by Vt and addding ft(xt) + Qtgt(xt)/Vt to both sides, we get ft(xt) + Qt Vt gt(xt) + ft(xt) + Qt Vt gt(xt) (xt+1 xt) + αt Vt D(xt+1, xt) ft(xt) + Qt Vt gt(xt) + ft(xt) + Qt Vt gt(xt) (x xt) + αt Vt D(x, xt) αt Vt D(x, xt+1). Here, the left-hand side is bounded below by ft(xt) + ft(xt) (xt+1 xt) + αt Vt D(xt+1, xt) + t 2Vt (Ht + Gt R)2 by Lemma 22. Moreover, we have ft(xt) (xt+1 xt) + αt Vt D(xt+1, xt) Ft xt+1 xt + αt Vt xt+1 xt 2 = Vt F 2 t 4αt + αt xt+1 xt Vt Ft Vt F 2 t 4αt where the first inequality holds by 2-strong convexity of Φ with respect to . Hence, we deduce that ft(xt) Vt F 2 t 4αt + t 2Vt (Ht + Gt R)2 ft(xt) + Qt Vt gt(xt) + ft(xt) + Qt Vt gt(xt) (x xt) + αt Vt D(x, xt) αt Vt D(x, xt+1). Kim and Lee For the right-hand side of this inequality, consider αt Vt (D(x, xt) D(x, xt+1)) = α1 V1 D(x, xt) + t=2 D(x, xt) αt VT D(x, xt+1) where the inequality holds since {αt/Vt}T t=1 is a non-decreasing sequence. Furthermore, 1 Vt (Q2 t+1 Q2 t ) = Q2 1 2V1 + Q2 T+1 2VT + 1 Q2 1 2V1 = 0, where the inequality holds since the sequence {Vt}T t=1 is non-negative and non-decreasing. Combining these inequalities, we deduce that ft(xt) + ft(xt) (x xt) gt(xt) + gt(xt) (x xt) + αT Vt F 2 t 4αt + 1 (Ht + Gt R)2 Recall that Et 1[ft] = Et 1[ ˆft] and Et 1[gt] = Et 1[ˆgt] where ˆft and ˆgt are convex and that Et 1[ ft] = Et 1[ ˆft] and Et 1[ gt] = Et 1[ ˆgt]. Then it follows that Et 1[ft(xt)] + Et 1[ ft(xt)] (x xt) = Et 1[ ˆft(xt)] + Et 1[ ˆft(xt)] (x xt) Et 1[ ˆft(x)] = Et 1[ft(x)] where the second inequality holds because ˆft is convex and xt is Ft 1-measurable. Likewise, we deduce that Et 1[gt(xt)] + Et 1[ gt(xt)] (x xt) Et 1[gt(x)]. Taking the expectations of both sides of (10), we obtain the inequality of this lemma, as required. Next, we state a lemma that will be useful to provide an upper bound on the constraint violation. Lemma 25 Algorithm 4 achieves xt+1 xt 1 2αt (Vt Ft + Qt Gt), t=1 gt(xt) QT+1 + Gt 2αt (Vt Ft + Qt Gt). Stochastic-Constrained Stochastic Optimization with Markovian Data Proof As Qt+1 Qt + gt(xt) + gt(xt) (xt+1 xt), it follows that gt(xt) Qt+1 Qt gt(xt) (xt+1 xt) Qt+1 Qt + Gt xt+1 xt . On the other hand, if we set x = xt for the inequaity of Lemma 23, we get αt D(xt+1, xt) + αt D(xt, xt+1) Vt ft(xt) + Qt gt(xt) (xt xt+1). Here, the left-hand side is greater or equal to 2αt xt+1 xt 2 while the right-hand side is less or equal to (Vt Ft + Qt Gt) xt+1 xt . Therefore, it follows that xt+1 xt 1 2αt (Vt Ft + Qt Gt), (11) which implies T X t=1 gt(xt) QT+1 + Gt 2αt (Vt Ft + Qt Gt), as required. To bound the constraint violation, we still need to bound the virtual queue size QT+1. We also need the following lemma. Lemma 26 For any x X, t H2 t + R2G2 t + Qt gt(xt) + gt(xt) (x xt) + Vt RFt + αt(D(x, xt) D(x, xt+1). Proof By Lemma 22, t Qt gt(xt) + gt(xt) (xt+1 xt) + 1 2(Ht + RGt)2 Qt gt(xt) + gt(xt) (xt+1 xt) + H2 t + R2G2 t where the last inequality comes from the fact that (A + B)2 2(A2 + B2). Here, using Lemma 23, the right-hand side can be further bounded above as follows. Qt(gt(xt) + gt(xt) (xt+1 xt)) + H2 t + R2G2 t Qt gt(xt) + gt(xt) (x xt) + Vt ft(xt) (x xt) Vt ft(xt) (xt+1 xt) αt D(xt+1, xt) + αt D(x, xt) D(x, xt+1) + H2 t + R2G2 t . Moreover, it follows from the Cauchy-Schwarz inequality that Vt ft(xt) (x xt) Vt ft(xt) (xt+1 xt) = Vt ft(xt) (x xt+1) Vt RFt by Cauchy-Schwarz inequality. Then we have proved the lemma, as desired. Based on Lemma 26, we may provide the following bound on the virtual queue size. Kim and Lee Lemma 27 For any x X, s [T], H2 t + R2G2 t + Vt RFt + 2 t=1 Qt (gt(xt) + gt(xt) (x xt)) + 2R2αs. Proof From Lemma 26, we get H2 t + R2G2 t + Qt gt(xt) + gt(xt) (x xt) + Vt RFt + α1D(x, x1) + t=2 D(x, xt)(αt αt 1) αs D(x, xt+1). Since αt is non-decreasing and D(x, xt) R2, it follows that α1D(x, x1) + t=2 D(x, xt)(αt αt 1) αs D(x, xt+1) R2αs. This implies the desired bound on Qs+1. Appendix B. Sum of Sequences In this section, we consider some series of numbers and provide bounds on their partial sums to make our paper self-contained. Given a sequence {xt} t=1 of numbers, we use notation Xs := Ps t=1 xt to denote its partial sums. Lemma 28 If f : R+ R+ is continuous and non-increasing, then t=1 f(Xt)xt x1f(X1) + Z XT X1 f(x)dx Z XT for any nonnegative x1, . . . , x T . Proof By considering the area between f(x) and the x-axis over the interval [Xt 1, Xt] of length xt, we have f(Xt)xt R Xt Xt 1 f(x)dx since f is non-increasing. Then it follows that t=1 f(Xt)xt = x1f(X1) + t=2 f(Xt)xt x1f(X1) + Z XT X1 f(x)dx Z XT as required. As a consequence of Lemma 28, we deduce the following list of bounds on series. Stochastic-Constrained Stochastic Optimization with Markovian Data Corollary 29 (Auer et al., 2002) Corollary 30 xt Xt 1 + log X1(XT ). Next we consider the following. Lemma 31 If f : R+ R+ is continuous and non-decreasing, then t=1 f(Xt)xt+1 Z XT +1 for any nonnegative x1, . . . , x T . Proof By considering the area between f(x) and the x-axis over the interval [Xt, Xt+1] of length xt+1, we have f(Xt)xt+1 R Xt+1 Xt f(x)dx since f is non-decreasing. Then it follows that t=1 f(Xt)xt+1 = x2f(X1) + t=2 f(Xt)xt+1 x2f(X1) + Z XT +1 X2 f(x)dx Z XT +1 as required. Lemma 31 implies the following. Corollary 32 For q > 0, t=1 tq 1 q + 1 (T + 1)q+1 1 . Moreover, when each xt is bounded by some fixed constant, we can deduce the following result. For ease of notation, we start a sequence with x0 and, with abuse of notation, define partial sum Xs = Ps t=0 xt with x0 = X0 = δ. Lemma 33 If 0 xt C for t = 0, . . . , T for some fixed constant C and f : R++ R+ is continuous and non-increasing, then t=1 f(Xt 1)xt Cf(δ) + Z max{δ,XT C} Proof If we consider the area between f(x) and the x-axis over the interval [Xt 1, Xt], we have f(Xt 1)xt R Xt Xt 1 f(x)dx in which the inequality direction is the opposite of what we Kim and Lee want. However, since xt C, we can use the idea of translation by C in the x-axis direction in the following way. Let ( f(δ), x ( , δ], f(x), x (δ, ). Then the graph of the translation f(x C) is above the squares of height f(Xt 1) on the interval [Xt 1, Xt]. Thus, t=1 f(Xt 1)xt Z XT X0 f(x C)dx ( (XT δ)f(δ) Cf(δ), if XT δ + C, Cf(δ) + R XT C δ f(x)dx, if XT > δ + C which implies the desired statement of this lemma. As a corollary of Lemma 33 with f(x) = x γ, we deduce the following inequality. Corollary 34 If xt C, 0 < γ = 1, then t=1 X γ t 1xt Cδ γ + 1 1 γ max{δ, XT C}1 γ δ1 γ . Appendix C. Online Convex Optimization with Adversarial Losses and Constraints In this section, we show the performance of Algorithm 3, which is an Ada Grad-style variant of the drift-plus-penalty algorithm for online convex optimization with adversarial loss and constraint functions. To deal with adversarial constraint functions, we set the parameters Vt and αt differently as follows. at := F 2 t 4 + R2G2 t + H2 t , St := and then set parameters as Vt = Sβ t R , αt = St for some 0 < β 1/2. One distinction from (4) is the presence of additional parameter δ, and another difference is that Vt and αt are defined with St, not St 1. We now assume that the convex constraint functions g1, . . . , g T as well as the convex loss functions f1, . . . , f T are chosen adversarially. Following Neely and Yu (2017), we set the benchmark x as an optimal solution to t=1 ft(x) subject to gt(x) 0 for t = 1, . . . , T. Then the goal is to obtain upper bounds on Regret(T) = t=1 ft(x ), Violation(T) = Stochastic-Constrained Stochastic Optimization with Markovian Data in sublinear orders of T by properly choosing our inputs xt. For this, we need the following theorem. Theorem 35 Algorithm 3 with Vt and α set as in (12) guarantees that Regret(T) = O S1 β T , Violation(T) = O S1/2 T + T 1/4Sβ/2+1/4 T . Proof Applying Lemma 24 with ˆft = ft, ˆgt = gt, and x = x , we obtain Regret(T) αT Vt F 2 t 4αt + 1 (Ht + Gt R)2 The first term of the right hand side is RS1 β T = O(S1 β T ), and the second term satisfies Vt F 2 t 4αt = R t=1 Sβ 1 t F 2 t R t=1 Sβ 1 t at R β Sβ T = O Sβ T , where the last inequality follows from Lemma 28. The third term satisfies (Ht + Gt R)2 R2G2 t + H2 t Sβ t R at Sβ t R 1 β S1 β T = O S1 β T . Combining these two inequalities, we get Regret(T) = O S1 β T . Next, we prove the second part of the theorem. By Lemma 25, we have t=1 gt(xt) QT+1 + Gt 2αt (Vt Ft + Qt Gt). If we apply Lemma 27 with x = x , we obtain H2 t + R2G2 t + Vt RFt + 2R2αT t=1 (R2G2 t + H2 t ) 2R2αT | {z } (c) where the first inequality follows from the convexity of gt and gt(x ) 0. Here, term (c) is equal to 2ST , and term (b) is less than or equal to 2ST . For term (a), we have t=1 Sβ t Ft 2Sβ T t=1 Ft 2Sβ T t=1 F 2 t 4 TSβ+1/2 T , Kim and Lee where the second inequality holds by the power mean inequality. Thus, QT+1 (a) + (b) + (c) 2 2S1/2 T + 2T 1/4Sβ/2+1/4 T = O S1/2 T + T 1/4Sβ/2+1/4 T . (13) We also have that t=1 Sβ 1 t RFt Gt/2 t=1 Sβ 1 t (F 2 t /4 + R2G2 t )/2 t=1 Sβ 1 t at/2 Sβ T 2β where the last inequality follows from Lemma 28. Lastly, G2 t 2αt Qt 2G2 t αt S1/2 t + G2 t αt T 1/4Sβ/2+1/4 t at S1/2 t + T 1/4 T X at S3/4 β/2 t 2ST + T 1/4 β/2 + 1/4Sβ/2+1/4 T = O S1/2 T + T 1/4Sβ/2+1/4 T , where the first inequality follows from (13) and the last inequality follows from Lemma 28. Combining the results, we get Violation(T) = O S1/2 T + T 1/4Sβ/2+1/4 T , as required. Appendix D. Proof of the Time-Varying Drift Lemma In this section, we prove Lemma 17 for the case of time-varying parameter θt0(t). We closely follow the proof of (Yu et al., 2017, Lemma 5). Lemma 36 Let r = ζ/(4t0δ2 max) and ρ = 1 ζ2/(8δ2 max) = 1 rt0ζ/2. Then E h er Z(t)i ert0δmax for all t 0. Stochastic-Constrained Stochastic Optimization with Markovian Data Proof Since 0 < ζ < δmax, we have 0 < ρ < 1 < erδmax. Define η(t) = Z(t + t0) Z(t). Note that |η(t)| t0δmax for all t 0 which implies that |rη(t)| ζ/(4δmax) 1. Then, er Z(t+t0) = er Z(t)erη(t) er Z(t) 1 + rη(t) + 2r2t2 0δ2 max = er Z(t) 1 + rη(t) + 1 where the inequality follows from the fact that ex 1 + x + 2x2 for |x| 1, |rη(t)| 1, and |η(t)| t0δmax while the equality follows by substituting r = ζ/(4t0δ2 max). Next, we consider the cases Z(t) θ(t) and Z(t) < θ(t), separately. First, consider the case Z(t) θ(t). Taking the conditional expectation of each side of (14) gives us the following. E h er Z(t+t0) | Z(t) i E er Z(t)(1 + rη(t) + 1 2rt0ζ) | Z(t) er Z(t) 1 rt0ζ + 1 = er Z(t) 1 rt0ζ where the inequality follows from the fact that E[Z(t + t0) Z(t)|F(t)] t0ζ when Z(t) θ(t) while the second equality follows from the fact that ρ = 1 rt0ζ/2. Likewise, for the case Z(t) < θ(t), we deduce that E h er Z(t+t0) | Z(t) i = E h er Z(t)erη(t) | Z(t) i = er Z(t)E h erη(t) | Z(t) i ert0δmaxer Z(t), where the inequality follows from the fact that η(t) t0δmax. Putting the two cases together, we deduce that E h er Z(t+t0)i = P(Z(t) θ(t))E h er Z(t+t0) | Z(t) θ(t) i + P(Z(t) < θ(t))E h er Z(t+t0) | Z(t) < θ(t) i ρE h er Z(t) | Z(t) θ(t) i P(Z(t) θ(t)) + ert0δmax E h er Z(t) | Z(t) < θ(t) i P(Z(t) < θ(t)) = ρE h er Z(t)i + ert0δmax ρ E h er Z(t) | Z(t) < θ(t) i P(Z(t) < θ(t)) ρE h er Z(t)i + ert0δmax ρ erθ(t) ρE h er Z(t)i + ert0δmaxerθ(t) where the first inequality follows from the analysis of the two separate cases and the second inequality follows from the fact that ert0δmax > ρ. Then we argue by induction to prove the statement of this lemma. We first consider the base case t {0, 1, . . . , t0}. Since Z(t) tδmax for all t 0, it follows that E[er Z(t)] ertδmax ert0δmax ert0δmax Kim and Lee for all t {1, . . . , t0}, where the last inequality follows because erθ(t)/(1 ρ) 1. Next we assume that the inequality holds for all t {0, 1, . . . , s} with some s t0 and consider iteration t = s + 1.Note that E h er Z(s+1)i ρE h er Z(s+1 t0)i + ert0δmaxerθ(s+1 t0) 1 ρ erθ(s+1 t0) + ert0δmaxerθ(s+1 t0) 1 ρ erθ(s+1 t0) 1 ρ erθ(s+1) where the second inequality comes from the induction hypothesis by noting that 0 τ + 1 t0 τ while the last inequality follows from the fact that θ(t) is non-decreasing. Based on this lemma, we prove Lemma 17. Proof [Proof of Lemma 17] Note that erx is convex in x when r > 0. By Jensen s inequality, er E[Z(t)] E[er Z(t)] er(θ(t)+t0δmax) where the inequality is implied by Lemma 36. Taking logarithm on both sides and dividing by r yields that E[Z(t)] θ(t) + t0δmax + 1 r log 1 1 ρ = θ(t) + t0δmax + t0 4δ2 max ζ log 8δ2 max ζ2 , where the equality holds because recalling that r = ζ 4t0δ2max and ρ = 1 ζ2 8δ2max . Appendix E. Properties of the MLMC Estimator In this section, we prove Lemmas 7 and 8. Lemma 37 (Dorfman and Levy, 2022, Lemma A.6) Let h : X S Rk for some k 1. Suppose that there exists some constant L > 0 such that h(x, ξ) L for every (x, ξ) X S, where the norm satisfies η 2 for some η > 0. We denote by h(x) := Eξ µ[h(x, ξ)], h N t (x) := 1 i=1 h(x, ξ(i) t ). Stochastic-Constrained Stochastic Optimization with Markovian Data Suppose that x is Ft 1 measurable and N A for some A N. If 2τmix 2 log A N, then Et 1 h N t (x) h(x) 12Lη τmix 2 log A log(τmix 2 log A N) + 6Lητmix 2 log A Et 1 h N t (x) h(x) 2 576L2η2τmix 2 log A N (1 + log(τmix 2 log A N)) + 72L2η2τ 2 mix 2 log A 2 We point out that (Dorfman and Levy, 2022, Lemma A.6) was originally stated for the ℓ2 norm, but the statement holds for any norm over a finite-dimensional vector space assuming that η 2 for some fixed constant η > 0 and O hides the dependence on η. In the following lemma, we simplify the upper bounds of Lemma 37. Lemma 38 Let h : X S Rk for some k 1. Suppose that there exists some constant L > 0 such that h(x, ξ) L for every (x, ξ) X S, where the norm satisfies η 2 for some η > 0. We denote by h(x) := Eξ µ[h(x, ξ)], h N t (x) := 1 i=1 h(x, ξ(i) t ). Suppose that x is Ft 1 measurable and N A for some A N. Then Et 1 h N t (x) h(x) 12L max(1, η) τmix 2 log A log(τmix 2 log A N) , Et 1 h N t (x) h(x) 2 576L2 max(1, η)2τmix 2 log A N (2 + log(τmix 2 log A N)). Proof First, we consider the case where 2τmix 2 log A > N. Then by the triangle inequality, Et 1 h N t (x) h(x) 2L < 2L 2τmix 2 log A 12L max(1, η) τmix 2 log A log(τmix 2 log A N) , Et 1 h N t (x) h(x) 2 4L2 < 4L2 2τmix 2 log A 576L2 max(1, η)2τmix 2 log A N (2 + log(τmix 2 log A N)). Kim and Lee Next, we consider the case where 2τmix 2 log A N. By Lemma 37, Et 1 h N t (x) h(x) 12Lη τmix 2 log A log(τmix 2 log A N) + 6Lητmix 2 log A Et 1 h N t (x) h(x) 2 576L2η2τmix 2 log A N (1 + log(τmix 2 log A N)) + 72L2η2τ 2 mix 2 log A 2 Since τmix 2 log A /N 1 2, we deduce that 6Lητmix 2 log A N 12Lητmix 2 log A τmix 2 log A 72L2η2τ 2 mix 2 log A 2 N 72L2η2τmix 2 log A N + 8L2η2τmix 2 log A 576L2η2τmix 2 log A As a result, we obtain Et 1 h N t (x) h(x) 12Lη τmix 2 log A log(τmix 2 log A N) 12L max(1, η) τmix 2 log A log(τmix 2 log A N) , Et 1 h N t (x) h(x) 2 576L2η2τmix 2 log A N (2 + log(τmix 2 log A N)) 576L2 max(1, η)2τmix 2 log A N (2 + log(τmix 2 log A N)), as required. Proof [Proof of Lemma 7] We first argue that Et 1[ft(x)] = Et 1 h f2jmax t (x) i for any x and t. Note that Et 1 [ft] = Et 1 f1 t + j=1 P (Jt = j) 2j Et 1 h f2j t f2j 1 t i = Et 1 h f2jmax t i as P (Jt = j) = 1/2j Similarly, we can show that Et 1[gt] = Et 1 h g2jmax t i , Et 1[ ft] = Et 1 h f2jmax t i , Et 1[ gt] = Et 1 h g2jmax t i Stochastic-Constrained Stochastic Optimization with Markovian Data holds for any x and t. For the second part, we have E h |gt|2i 2E h gt g1 t 2i + 2H2 since x + y 2 ( x + y )2 2 x 2 + 2 y 2 and g1 t H. Note that E h gt g1 t 2i = j=1 P(Jt = j)22j E g2j t g2j 1 t 2 = j=1 2j E g2j t g2j 1 t 2 because P (Jt = j) = 1/2j. Here we can bound the right-most side based on the following. E g2j t g2j 1 t 2 2E g2j t g 2 + 2E g2j 1 t g 2 = O τmix where the last inequality follows from Lemma 38. Then it follows that E h gt g1 t 2i O(jmaxτmix) = O(τmix) where the last equality holds because jmax = O(log T). For the last part, E[Nt] = 1 + j=1 P(Jt = j)(2j 1) 1 + jmax 1 + 2 log2 T, as required. Proof [Proof of Lemma 8] By Assumption 2, we can apply Lemma 38 to gt, gt, ft. Assumptions 1, 2 implies that |ft(xt)| J for some J > 0. Thus, we can apply Lemma 38 to ft as well. Let L = max(F, G, H, J), A = T 2, and η 2 for some η > 0. By Lemma 38, the statement follows for 576L2 max(1, η)2 4 log T (2 + log(τmix 4 log T 2jmax)). Since 2jmax = O(T 2), the order of C(T) follows. Z. Akhtar, A. S. Bedi, and K. Rajawat. Conservative stochastic optimization with expectation constraints. IEEE Transactions on Signal Processing, 69:3190 3205, 2021. doi: 10.1109/ TSP.2021.3082467. C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine Learning, 50(1):5 43, 2003. doi: 10.1023/A:1020281327116. URL https://doi.org/10.1023/A:1020281327116. Kim and Lee P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48 75, 2002. ISSN 0022-0000. doi: https://doi.org/10.1006/jcss.2001.1795. URL https://www.sciencedirect.com/ science/article/pii/S0022000001917957. G. Ayache, V. Dassari, and S. E. Rouayheb. Walk for learning: A random walk approach for federated learning from heterogeneous data. IEEE J.Sel. A. Commun., 41(4):929 940, apr 2023. ISSN 0733-8716. doi: 10.1109/JSAC.2023.3244250. URL https://doi.org/10. 1109/JSAC.2023.3244250. A. Benveniste, P. Priouret, and M. M etivier. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, Berlin, Heidelberg, 1990. ISBN 0387528946. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1st edition, 1996. ISBN 1886529108. J. H. Blanchet and P. W. Glynn. Unbiased monte carlo for optimization and functions of expectations via multi-level randomization. In 2015 Winter Simulation Conference (WSC), pages 3656 3667, 2015. doi: 10.1109/WSC.2015.7408524. V. S. Borkar. Stochastic Approximation A Dynamical Systems Viewpoint. Texts and Readings in Mathematics ; 48. Hindustan Book Agency, Gurgaon, 1st ed. 2008. edition, 2008. ISBN 93-86279-38-X. L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. doi: 10.1137/16M1080173. URL https: //doi.org/10.1137/16M1080173. J. Cao, R. Jiang, N. Abolfazli, E. Y. Hamedani, and A. Mokhtari. Projection-free methods for stochastic simple bilevel optimization with convex lower-level problem, 2023. L. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with Fairness Constraints. In I. Chatzigiannakis, C. Kaklamanis, D. Marx, and D. Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018), volume 107 of Leibniz International Proceedings in Informatics (LIPIcs), pages 28:1 28:15, Dagstuhl, Germany, 2018. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik. ISBN 978-3-95977076-7. doi: 10.4230/LIPIcs.ICALP.2018.28. URL http://drops.dagstuhl.de/opus/ volltexte/2018/9032. L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi. Classification with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 19, page 319 328, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560. 3287586. URL https://doi.org/10.1145/3287560.3287586. O. Chapelle, B. Schlkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010. ISBN 0262514125. Stochastic-Constrained Stochastic Optimization with Markovian Data M. Chen, L. Yang, M. Wang, and T. Zhao. Dimensionality reduction for stationary time series via stochastic nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/ d25414405eb37dae1c14b18d6a2cac34-Paper.pdf. D. Dentcheva and A. Ruszczynski. Optimization with stochastic dominance constraints. SIAM Journal on Optimization, 14(2):548 566, 2003. doi: 10.1137/S1052623402420528. URL https://doi.org/10.1137/S1052623402420528. T. T. Doan. Finite-time analysis of markov gradient descent. IEEE Transactions on Automatic Control, 68(4):2140 2153, 2023. doi: 10.1109/TAC.2022.3172593. T. T. Doan, L. M. Nguyen, N. H. Pham, and J. Romberg. Convergence rates of accelerated markov gradient descent with applications in reinforcement learning, 2020. M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil. Empirical risk minimization under fairness constraints. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, page 2796 2806, Red Hook, NY, USA, 2018. Curran Associates Inc. R. Dorfman and K. Y. Levy. Adapting to mixing time in stochastic optimization with Markovian data. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5429 5446. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/dorfman22a.html. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121 2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html. J. C. Duchi, A. Agarwal, M. Johansson, and M. I. Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549 1578, 2012. doi: 10.1137/110836043. URL https: //doi.org/10.1137/110836043. Y. Ermoliev. stochastic quasigradient methods and their application to system optimization . Stochastics, 9(1-2):1 36, 1983. doi: 10.1080/17442508308833246. M. Even. Stochastic gradient descent under Markovian sampling schemes. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 9412 9439. PMLR, 23 29 Jul 2023. URL https: //proceedings.mlr.press/v202/even23a.html. J. Garc ıa, Fern, and o Fern andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(42):1437 1480, 2015. URL http://jmlr.org/ papers/v16/garcia15a.html. Kim and Lee M. B. Giles. Multilevel monte carlo methods. Acta Numerica, 24:259 328, 2015. doi: 10.1017/S096249291500001X. H. Guo, X. Liu, H. Wei, and L. Ying. Online convex optimization with hard constraints: Towards the best of two worlds and beyond. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=rwdp Fgf Vpv N. H. Hendrikx. A principled framework for the design and analysis of token algorithms. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 470 489. PMLR, 25 27 Apr 2023. URL https://proceedings. mlr.press/v206/hendrikx23a.html. A. Jalilzadeh, F. Yousefian, and M. Ebrahimi. Stochastic approximation for estimating the price of stability in stochastic nash games, 2023. R. Jenatton, J. Huang, and C. Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 402 411, New York, New York, USA, 20 22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/jenatton16.html. B. Johansson, M. Rabi, and M. Johansson. A simple peer-to-peer algorithm for distributed optimization in sensor networks. In 2007 46th IEEE Conference on Decision and Control, pages 4705 4710, 2007. doi: 10.1109/CDC.2007.4434888. B. Johansson, A. Speranzon, M. Johansson, and K. H. Johansson. On decentralized negotiation of optimal consensus. Automatica, 44(4):1175 1179, 2008. ISSN 0005-1098. doi: https://doi.org/10.1016/j.automatica.2007.09.003. URL https://www.sciencedirect. com/science/article/pii/S0005109807003962. B. Johansson, M. Rabi, and M. Johansson. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3): 1157 1170, 2010. doi: 10.1137/08073038X. URL https://doi.org/10.1137/08073038X. A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17 58, 2011. doi: 10.1287/10-SSY011. URL https://doi.org/10.1287/10-SSY011. S. Kowshik, D. Nagaraj, P. Jain, and P. Netrapalli. Streaming linear system identification with reverse experience replay. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 30140 30152. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/ paper_files/paper/2021/file/fd2c5e4680d9a01dba3aada5ece22270-Paper.pdf. S. Kumar and P. Sarkar. Streaming PCA for markovian data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=FQGRkwm Rzm. Stochastic-Constrained Stochastic Optimization with Markovian Data G. Lan and Z. Zhou. Algorithms for stochastic optimization with function or expectation constraints. Computational Optimization and Applications, 76(2):461 498, 2020. doi: 10.1007/s10589-020-00179-x. URL https://doi.org/10.1007/s10589-020-00179-x. D. Lee, N. Ho-Nguyen, and D. Lee. Projection-free online convex optimization with stochastic constraints, 2023. D. Levin and Y. Peres. Markov Chains and Mixing Times. MBK. American Mathematical Society, 2017. ISBN 9781470429621. URL https://books.google.com/books?id= f208Dw AAQBAJ. K. Levy. Online to offline conversions, universality and adaptive minibatch sizes. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ ce5140df15d046a66883807d18d0264b-Paper.pdf. Q. Lin, S. Nadarajah, N. Soheili, and T. Yang. A data efficient and feasible level set method for stochastic convex optimization with expectation constraints. Journal of Machine Learning Research, 21(143):1 45, 2020. URL http://jmlr.org/papers/v21/19-1022.html. C. G. Lopes and A. H. Sayed. Incremental adaptive strategies over distributed networks. IEEE Transactions on Signal Processing, 55(8):4064 4077, 2007. doi: 10.1109/TSP.2007.896034. M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: Online convex optimization with long term constraints. Journal of Machine Learning Research, 13(81):2503 2528, 2012. URL http://jmlr.org/papers/v13/mahdavi12a.html. X. Mao, K. Yuan, Y. Hu, Y. Gu, A. H. Sayed, and W. Yin. Walkman: A communicationefficient random-walk algorithm for decentralized optimization. IEEE Transactions on Signal Processing, 68:2513 2528, 2020. doi: 10.1109/TSP.2020.2983167. D. Nagaraj, X. Wu, G. Bresler, P. Jain, and P. Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 16666 16676. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ c22abfa379f38b5b0411bc11fa9bf92f-Paper.pdf. M. J. Neely and H. Yu. Online convex optimization with time-varying constraints, 2017. URL https://arxiv.org/abs/1702.04783. A. Nemirovski and A. Shapiro. Convex approximations of chance constrained programs. SIAM Journal on Optimization, 17(4):969 996, 2007. doi: 10.1137/050622328. URL https://doi.org/10.1137/050622328. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009. doi: 10.1137/070704277. URL https://doi.org/10.1137/070704277. Kim and Lee G. Pflug. The Interface Between Simulation and Optimization. Springer, 1996. B. Poljak and J. Tsypkin. Robust identification. Automatica, 16(1):53 63, 1980. ISSN 0005-1098. doi: https://doi.org/10.1016/0005-1098(80)90086-2. URL https://www. sciencedirect.com/science/article/pii/0005109880900862. B. Polyak. A general method of solving extremum problems. Doklady Akademii Nauk SSSR, 174(1):33 36, 1967. M. Rabbat and R. Nowak. Distributed optimization in sensor networks. In Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004, pages 20 27, 2004. doi: 10.1145/984622.984626. S. S. Ram, A. Nedi c, and V. V. Veeravalli. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization, 20(2):691 717, 2009a. doi: 10.1137/080726380. URL https://doi.org/10.1137/080726380. S. S. Ram, A. Nedi c, and V. V. Veeravalli. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization, 20(2):691 717, 2009b. doi: 10.1137/080726380. URL https://doi.org/10.1137/080726380. P. Rigollet and X. Tong. Neyman-pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research, 12(86):2831 2855, 2011. URL http://jmlr.org/ papers/v12/rigollet11a.html. H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 407, 1951. doi: 10.1214/aoms/1177729586. URL https://doi. org/10.1214/aoms/1177729586. R. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2: 21 41, 2000. A. Roy, K. Balasubramanian, and S. Ghadimi. Constrained stochastic nonconvex optimization with state-dependent markov data. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 23256 23270. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 93b11b5128ced940120f41ce9b216f39-Paper-Conference.pdf. A. Ruszczy nski and W. Syski. A method of aggregate stochastic subgradients with on-line stepsize rules for convex stochastic programming problems, pages 113 131. Springer Berlin Heidelberg, Berlin, Heidelberg, 1986. ISBN 978-3-642-00927-3. doi: 10.1007/BFb0121128. URL https://doi.org/10.1007/BFb0121128. T. Sarkar and A. Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5610 5618. PMLR, 09 15 Jun 2019. URL https: //proceedings.mlr.press/v97/sarkar19a.html. Stochastic-Constrained Stochastic Optimization with Markovian Data C. Scott and R. Nowak. A neyman-pearson approach to statistical learning. IEEE Transactions on Information Theory, 51(11):3806 3819, 2005. doi: 10.1109/TIT.2005.856955. T. Sun, Y. Sun, and W. Yin. On markov chain gradient descent. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/ 1371bccec2447b5aa6d96d2a540fb401-Paper.pdf. T. Sun, Y. Sun, Y. Xu, and W. Yin. Markov chain block coordinate descent. Computational Optimization and Applications, 75(1):35 61, 2020. T. Sun, D. Li, and B. Wang. Adaptive random walk gradient descent for decentralized optimization. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 20790 20809. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/sun22b.html. T. Sun, D. Li, and B. Wang. On the decentralized stochastic gradient descent with markov chain sampling. IEEE Transactions on Signal Processing, 71:2895 2909, 2023. doi: 10.1109/TSP.2023.3297053. P. Wang, Y. Lei, Y. Ying, and D.-X. Zhou. Stability and generalization for markov chain stochastic gradient methods. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 37735 37748. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ f61538f83b0f19f9306d9d801c15f41c-Paper-Conference.pdf. R. Ward, X. Wu, and L. Bottou. Ada Grad stepsizes: Sharp convergence over nonconvex landscapes. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6677 6686. PMLR, 09 15 Jun 2019. URL https://proceedings. mlr.press/v97/ward19a.html. X. Wei, H. Yu, and M. J. Neely. Online primal-dual mirror descent under stochastic constraints. Proc. ACM Meas. Anal. Comput. Syst., 4(2), jun 2020. doi: 10.1145/3392157. URL https://doi.org/10.1145/3392157. G. Wolfer and A. Kontorovich. Estimating the mixing time of ergodic markov chains. In A. Beygelzimer and D. Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 3120 3159. PMLR, 25 28 Jun 2019. URL https://proceedings.mlr.press/v99/wolfer19a.html. X. Xiao. Penalized stochastic gradient methods for stochastic convex optimization with expectation constraints. Technical report, September 2019. URL https:// optimization-online.org/?p=15973. Kim and Lee Y. Yan and Y. Xu. Adaptive primal-dual stochastic gradient method for expectationconstrained convex stochastic programs. Mathematical Programming Computation, 14 (2):319 363, 2022. doi: 10.1007/s12532-021-00214-w. URL https://doi.org/10.1007/ s12532-021-00214-w. Z. Yang, Y. Lei, P. Wang, T. Yang, and Y. Ying. Simple stochastic and online gradient descent algorithms for pairwise learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 20160 20171. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ a87d27f712df362cd22c7a8ef823e987-Paper.pdf. S. Yao and B. Huang. Beyond parity: Fairness objectives for collaborative filtering. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/ paper/2017/file/e6384711491713d29bc63fc5eeb5ba4f-Paper.pdf. E. Yazdandoost Hamedani, A. Jalilzadeh, and N. S. Aybat. Randomized primal-dual methods with adaptive step sizes. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 11185 11212. PMLR, 25 27 Apr 2023. URL https://proceedings.mlr.press/v206/yazdandoost-hamedani23a.html. X. Yi, X. Li, T. Yang, L. Xie, T. Chai, and K. Johansson. Regret and cumulative constraint violation analysis for online convex optimization with long term constraints. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11998 12008. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/yi21b.html. H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic constraints. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/ file/da0d1111d2dc5d489242e60ebcbaf988-Paper.pdf. J. Yuan and A. Lamperski. Online convex optimization for cumulative constraints. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ 9cb9ed4f35cf7c2f295cc2bc6f732a84-Paper.pdf. M. B. Zafar, I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi. Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research, 20(75): 1 42, 2019. URL http://jmlr.org/papers/v20/18-262.html. L. Zhang, Y. Zhang, J. Wu, and X. Xiao. Solving stochastic optimization with expectation constraints efficiently by a stochastic augmented lagrangian-type algorithm. INFORMS Stochastic-Constrained Stochastic Optimization with Markovian Data Journal on Computing, 34(6):2989 3006, 2022. doi: 10.1287/ijoc.2022.1228. URL https: //doi.org/10.1287/ijoc.2022.1228. L. Zhang, Y. Zhang, X. Xiao, and J. Wu. Stochastic approximation proximal method of multipliers for convex stochastic programming. Mathematics of Operations Research, 48 (1):177 193, 2023. doi: 10.1287/moor.2022.1257. URL https://doi.org/10.1287/moor. 2022.1257. R. Zhao. Accelerated stochastic algorithms for convex-concave saddle-point problems. Mathematics of Operations Research, 47(2):1443 1473, 2022. doi: 10.1287/moor.2021.1175. URL https://doi.org/10.1287/moor.2021.1175. Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra. Federated learning with non-iid data, 2018. URL https://arxiv.org/abs/1806.00582.