# online_quantile_regression__1e14afdf.pdf Journal of Machine Learning Research 26 (2025) 1-55 Submitted 4/24; Revised 7/25; Published 10/25 Online Quantile Regression Yinan Shen YINANSHE@USC.EDU Department of Mathematics University of Southern California Los Angeles, CA 90089, USA Dong Xia MADXIA@UST.HK Department of Mathematics Hong Kong University of Science and Technology Hong Kong SAR, China Wen-Xin Zhou WENXINZ@UIC.EDU Department of Information and Decision Sciences University of Illinois Chicago Chicago, IL 60607, USA Editor: Genevera Allen This paper addresses the challenge of integrating sequentially arriving data into the quantile regression framework, where the number of features may increase with the number of observations, the time horizon is unknown, and memory resources are limited. Unlike least squares and robust regression methods, quantile regression models different segments of the conditional distribution, thereby capturing heterogeneous relationships between predictors and responses and providing a more comprehensive view of the underlying stochastic structure. We employ stochastic sub-gradient descent to minimize the empirical check loss and analyze its statistical properties and regret behavior. Our analysis reveals a subtle interplay between updating iterates based on individual observations and on batches of observations, highlighting distinct regularity characteristics in each setting. The proposed method guarantees long-term optimal estimation performance regardless of the chosen update strategy. Our contributions extend existing literature by establishing exponential-type concentration inequalities and by achieving optimal regret and error rates that exhibit only short-term sensitivity to initialization. A key insight from our study lies in the refined statistical analysis showing that properly chosen stepsize schemes substantially mitigate the influence of initial errors on subsequent estimation and regret. This result underscores the robustness of stochastic sub-gradient descent in managing initial uncertainties and affirms its effectiveness in sequential learning settings with unknown horizons and data-dependent sample sizes. Furthermore, when the initial estimation error is well-controlled, our analysis reveals a trade-off between short-term error reduction and long-term optimality. For completeness, we also discuss the squared loss case and outline appropriate update schemes, whose analysis requires additional care. Extensive simulation studies corroborate our theoretical findings. Keywords: online learning, quantile regression, regret, stochastic sub-gradient descent. 1. Introduction Online learning aims to efficiently incorporate sequentially arriving data and make timely predictions. In contrast to offline learning, where all data are collected and stored prior to analysis, online c 2025 Yinan Shen, Dong Xia and Wen-Xin Zhou. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0589.html. SHEN, XIA AND ZHOU learning processes data in a sequential manner without requiring access to the entire sample at once, thereby alleviating both storage and computational burdens. As a result, online methods are particularly well-suited for large-scale datasets (Le Cun et al., 1989; Rajalakshmi et al., 2019; Finn et al., 2019), streaming asset pricing data (Soleymani and Paquet, 2020), and the increasingly prominent domain of reinforcement learning (Gao et al., 2019; Han et al., 2022; Ren and Zhou, 2023). Broader treatments and further applications of online learning can be found in Bottou (1999), Cesa-Bianchi and Lugosi (2006), Hoffman et al. (2010), Hazan (2016), and Orabona (2019). In classical offline linear regression, estimation and inference are based on a pre-collected sample of independent observations {(Xi, Yi)}n i=1 satisfying Yi = X i β + ξi, where β Rd is the unknown parameter of interest, and ξi denotes the random noise variable, satisfying E(ξi|Xi) = 0. It is well-known that the ordinary least squares estimator ˆβ achieves the minimax rate of convergence, i.e., E ˆβ β 2 2 = O(σ2d/n), where σ2 is such that E(ξ2 i |Xi) σ2 almost surely. In contrast, the entire sample is inaccessible in the online setting, and the total number of observations may even be unknown. Specifically, at time t, only nt pairs of observations can be used: Y (t) i = X(t) i β + ξ(t) i , i = 1, . . . , nt. (1) While the data are received sequentially, online learning refrains from repeatedly using the data for model refitting. Instead, it sequentially updates the estimator. In contrast, offline methods leverage the entire sample, and based on the available data, the offline method (He and Shao, 2000; Koenker, 2005) leads to a statistically optimal estimator with mean squared error of order O(σ2 d/ PT t=0 nt), where T is called the horizon. This prompts the question of whether online methods can achieve error rates comparable to their offline counterparts. Must we compromise accuracy for more efficient computation and reduced storage in online learning? Furthermore, with the accumulation of more information over time, the theoretical offline error rates decline. Is it feasible to sustain this downward trend in optimal error rates over time? Particularly when the horizon (maximum time T) is unknown, isolated error rates at specific times lack persuasiveness. In online learning, a learner must generate timely predictions as data arrive sequentially, either individually or in batches. The concept of regret, first introduced by Savage (1951) and later popularized in the online learning literature (e.g., Goldenshluger and Zeevi, 2013; Han et al., 2020; Ren and Zhou, 2023), quantifies cumulative prediction error over time and serves as a key measure of learning performance. The regret at horizon T is defined as Regret T := E t=0 ft(βt) ft(β ) where ft( ) and βt denote the loss function and estimator at time t, respectively. The expectation is taken over the data from t = 0 to T. Regret serves as a metric to evaluate the efficacy of a sequential estimation scheme. Optimal regret occurs when error rates exhibit a scaling behavior of 1/t as time t becomes sufficiently large (Hazan, 2016). In contrast, Cai et al. (2023) assumes a known horizon and focuses more on the ultimate accuracy of estimation. However, this setting bears resemblance to the sample splitting technique (Xia and Yuan, 2021; Zou et al., 2020). Online estimation under the squared loss in equation (2) has been extensively studied over the past two decades; see, for example, Bottou (1999), Zinkevich (2003), Hazan et al. (2007b), and Langford et al. (2009), among others. Stochastic gradient descent stands out as a natural and elegant methodology for processing sequentially arriving data, integrating streaming information, ONLINE QUANTILE REGRESSION and minimizing the regret in (2). Bottou (1999) lays the foundation for the general theoretical framework of online learning, establishing the asymptotic convergence of gradient descent toward stationary points. Under suitable conditions, the seminal work of Zinkevich (2003) proves the iterative convergence of gradient descent with a stepsize of order O(t 1/2). Subsequently, Hazan et al. (2007a) demonstrates that using a stepsize of order O(t 1) yields an optimal regret bound of order O(log T). The pioneering work of Duchi and Singer (2009) further examined the performance of iterative sub-gradient descent applied to non-differentiable loss functions, offering insights into convergence dynamics and establishing regret bounds under various stepsize schemes. We refer to Hazan et al. (2007b), Langford et al. (2009), Streeter and Mc Mahan (2010), Mc Mahan and Streeter (2010), Duchi et al. (2011), Bach and Moulines (2013), and the references therein Orabona (2019) for additional gradient descent-based algorithms in online learning. In the offline learning context, stochastic (sub-)gradient descent serves as an effective approach to mitigate computational burdens (Zhang, 2004). This method updates the current iterate using only a single, randomly selected observation pair, rather than the entire sample. Rakhlin et al. (2011) introduces a stepsize scheme of order O(1/t), yielding a sub-linear convergence rate comparable to that of its online counterpart under a strongly convex objective. In offline stochastic optimization, the total sample size is typically known, allowing such information to be incorporated into the stepsize design and ensuring that the statistical optimality remains stable over time. Examples of constant stepsize schemes that explicitly account for the learning horizon can be found in Duchi and Singer (2009), Cai et al. (2023), and Puchkin et al. (2024). Other studies, such as Delyon and Juditsky (1993), Roux et al. (2012), and Johnson and Zhang (2013), focus on accelerating the sub-linear convergence rate of stochastic descent methods. While both offline stochastic gradient descent and online learning alleviate computational pressures, they differ fundamentally in nature. In online learning, the time horizon and the total number of observations are unknown, and may even be infinite, necessitating procedures that achieve dynamic and statistically optimal error rates over time. Moreover, an online learner often must generate predictions for newly arriving covariate-only inputs, with predictive accuracy evaluated through the notion of regret. The statistical analysis of the squared loss remains a delicate yet crucial area of study. Although online learning and (stochastic) gradient descent methods have been extensively investigated from an optimization perspective for both squared and non-differentiable loss functions, most of the existing literature assumes that the empirical loss possesses certain regularity properties, such as strong convexity and/or smoothness. Moreover, the feasible domain is often restricted to a bounded convex set (Hazan et al., 2007a; Duchi and Singer, 2009; Langford et al., 2009; Rakhlin et al., 2011). Comprehensive overviews of online learning algorithms from an optimization standpoint can be found in Hazan (2016) and Orabona (2019). However, empirical strong convexity may fail to hold even under the squared loss when the available storage size is limited. Addressing this issue calls for refined statistical analysis that explicitly accounts for the interplay between data dependence, sample size, and memory constraints; see, for example, Chen et al. (2019), Han et al. (2024), and the references therein. Fan et al. (2018) further demonstrates that an expected error rate of order O(log(T)/T) is attainable for sparse online regression under a stepsize scheme of order O(1/t), yielding an optimal O(log T) regret. The preceding statistical online frameworks exhibit notable limitations when applied to quantile regression, often yielding suboptimal error rates and weakened guarantees on success probability, particularly in the presence of heavy-tailed noise. Existing studies on renewable or online quantile regression primarily analyze asymptotic performance within batch learning frameworks, where the SHEN, XIA AND ZHOU batch size diverges over time. A common feature of these methods is their reliance on storing historical summary statistics, which incurs a memory cost of order O(d2) (Jiang and Yu, 2022; Sun et al., 2023; Chen and Yuan, 2024). In contrast, the present work complements prior studies by developing online quantile regression methods that eliminate the need to store either historical data or summary statistics. For comparison, Section 3.1 provides a parallel discussion of online least squares methods and their corresponding stepsize schemes. To estimate conditional quantile models sequentially, we adopt a simple yet effective sub-gradient descent approach. Through refined statistical analysis, we design a stepsize sequence that either exhibits geometric decay, remains constant, or decays at the rate of 1/t, depending on the proximity of the current iterate to the oracle solution. This design departs from the scheme proposed by Duchi and Singer (2009); a detailed comparison between the two approaches is provided in Section 3.1, highlighting the advantages of the proposed statistically informed stepsize rule. Under heavy-tailed noise, we further establish exponential-type tail bounds for the proposed online estimators, improving upon the polynomialtype guarantees in Han et al. (2022). This enhancement in tail behavior underscores the robustness and reliability of our estimators when applied to heavy-tailed data environments. In this work, we refer to online learning as a setting in which data arrive sequentially over time. The complete dataset and the total number of observations are never simultaneously available, regardless of how many observations can be processed at each step. Under the linear model (1), we consider three representative scenarios: (i) the newly arriving dataset Dt := {(X(t) i , Y (t) i )}nt i=1 contains only a single observation pair, requiring O(d) storage; (ii) Dt contains at least O(d) samples; and (iii) the server can store an unbounded number of samples. The first setting has been extensively examined in the literature (Duchi and Singer, 2009; Hazan et al., 2007a; Hazan, 2016; Cai et al., 2023; Langford et al., 2009; Bach and Moulines, 2013; Mhammedi et al., 2019; Han et al., 2022), yet it continues to demand careful statistical treatment, as discussed earlier. Do et al. (2009) further studies the case where data arrive in batches at each iteration. From an optimization standpoint, however, there is little distinction between single-sample and batch arrivals, since regularity conditions such as smoothness or strong convexity are typically assumed. From a statistical perspective, by contrast, we will demonstrate that these two settings exhibit markedly different behaviors. Specifically, this paper seeks to address the following fundamental questions: Is it possible to achieve a continuously decreasing error rate as new data arrive under limited storage? Can statistical optimality be maintained along the learning trajectory? How much information is lost relative to the offline benchmark? In online learning, how do failure probabilities accumulate across iterations, and how do they determine the maximal horizon for guaranteed convergence? What are the implications for per-iteration success probability, especially under heavy-tailed noise and singlesample updates? Finally, what role does initialization play? Does the initial error necessarily have a long-term influence on subsequent errors and regret, or can arbitrary initialization be tolerated? We summarize our main contributions addressing these questions as follows. 1. This paper develops a statistical analysis framework for online quantile regression, addressing the challenges arising from the empirical loss function s lack of strong convexity and smoothness. To achieve long-term statistical optimality under heavy-tailed noise, we design a stepsize scheme informed by the statistical regularity properties of the quantile loss. Unlike prior studies, the derived error rates explicitly scale with the noise level, capturing the intrinsic stochastic complexity of the problem. Furthermore, the proposed online approach deliberately trades a marginal loss in statistical efficiency for substantial gains in compu- ONLINE QUANTILE REGRESSION tational scalability and memory efficiency relative to offline methods. As demonstrated in Section 3.1, this slight reduction in accuracy becomes asymptotically negligible as the time horizon grows. 2. Our algorithm attains statistically optimal rates even under heavy-tailed noise and admits a high-probability guarantee, with the failure probability decaying exponentially fast in the dimension. This property ensures the validity of our theoretical results even when the unknown horizon grows exponentially with dimension. For instance, when only a single datum arrives at each iteration, the failure probability associated with our established error rate is O(exp( c0d)), where c0 > 0 is a universal constant. Consequently, the horizon can be as large as T = O(exp(c0d)) without compromising the theoretical guarantees. In sharp contrast, the results in Han et al. (2022) and Cai et al. (2023) hold with probability 1 d O(1) under sub-Gaussian noise, implying that their admissible horizon is limited to T = O(d O(1)). 3. Our analysis demonstrates that the influence of the initial error on both estimation accuracy and regret is merely transient. Consequently, the derived statistical error bounds remain valid for arbitrary initializations, without requiring the initial estimator to lie within a compact region. This robustness is achieved through refined statistical analyses and the design of a novel stepsize scheme. Stochastic (sub-)gradient descent methods for both quantile and squared losses benefit substantially from stepsize schedules informed by underlying statistical properties. For instance, in the setting where ξ(t) i N(0, σ2) and nt = 1 in (1), we establish the regret bound PT t=0 E βt β 2 O( β0 β 2 2 + σ2 log T). The remainder of this paper is organized as follows. Section 2 introduces the quantile (check) loss function and the online sub-gradient descent algorithm. It further examines the convergence dynamics and regret bounds, with detailed discussions of the three distinct settings considered. Section 3.1 presents extensive numerical experiments, including evaluations of the proposed stepsize scheme, comparisons of accuracy with offline regression methods, and analyses of convergence behavior. A brief discussion of online least squares from a statistical perspective is also provided. The proof of Theorem 1 appears in Section 4, while the proofs of the remaining theorems are deferred to the supplementary material. 2. Online quantile regression via adaptive sub-gradient descent Throughout this section, we consider the conditional quantile model (1), that is, the conditional τ-th quantile of ξ(t) i given X(t) i is zero for some predetermined τ (0, 1). Quantile regression (QR) plays a crucial role in unraveling pathways of dependence between the outcome and a collection of features, which remain elusive through conditional mean regression analysis, such as the least squares estimation. Since its introduction by Koenker and Bassett Jr (1978), QR as undergone extensive study, encompassing theoretical understanding, methodological development for various models and data types, and practical applications in a wide range of fields. The main focus of existing literature centers around the formulation of methodologies and theories for QR utilizing static data, where a complete dataset is available. This is referred to as the offline setting in online learning literature. The most fundamental and well-studied method for estimating β involves empirical risk minimization, or statistical M-estimation. The associated loss function ρQ,τ(x) := τx I(x 0) + (τ 1)x I(x < 0) is known as the check loss or quantile loss. We refer to SHEN, XIA AND ZHOU Koenker (2005) and Koenker et al. for a comprehensive exploration of offline quantile regression, covering statistical theory, computational methods, as well as diverse extensions and applications. In contrast to the least squares method, the non-smooth nature of the quantile loss introduces additional challenges to QR, particularly in the era of big data. Established algorithms employed for this purpose include the simplex algorithm (Koenker and d Orey, 1987), interior point methods with preprocessing (Portnoy and Koenker, 1997), alternating direction method of multipliers, and proximal methods (Parikh and Boyd, 2014). More recently, Fernandes et al. (2021) and He et al. (2023) have shown that convolution-smoothed quantile regression attains desired asymptotic and non-asymptotic properties, provided that the smoothing parameter (bandwidth) is appropriately selected as a function of the sample size and dimensionality. When addressing low-rank matrix regression or completion problems under either absolute or quantile loss, recent studies, including Cambier and Absil (2016), Li et al. (2020), Charisopoulos et al. (2021), Tong et al. (2021), Ma and Fattahi (2023), and Shen et al. (2023), have conducted extensive investigations of sub-gradient descent methods from both computational and statistical perspectives. The offline methods mentioned above crucially hinge on specific regularity properties inherent in the empirical loss function. However, in the online setting, characterized by sequential data arrival and an unknown total number of observations, the assurance of offline regularity properties becomes untenable with only a limited number of available samples. Consequently, neither the offline results nor the corresponding proof techniques are applicable in this online context. As an illustration, in an offline setting based on model (1), Shen et al. (2023) demonstrated the existence of certain parameters µ1, µ2 > 0 such that with high probability, i=1 (|Yi Xi, β | |Yi Xi, β |) max{µ1 β β 2 γ, µ2 β β 2 2} holds for all β under the sample size requirement n Cd, γ = E|ξ| is the first absolute moment of the noise variable. However, in the context of online learning, where the stored data is limited, such as when n = 1, the above inequality does not hold in general. The foundational contributions to online learning by Duchi and Singer (2009), Duchi et al. (2011) and Johnson and Zhang (2013) employ sub-gradient descent to address scenarios characterized by the lack of differentiability in the loss function at specific points. These studies establish the properties of iterates under various stepsize schemes, with a primary emphasis on the excess risk, leaving the analysis of estimation error βt β 2 unaddressed. In the absence of strong convexity and/or certain smoothness, the upper bounds on the excess risk cannot be straightforwardly extended to those on the estimation error. From the statistical viewpoint, Jiang and Yu (2022) and Wang et al. (2022) consider online quantile regression within the batch learning framework in which the data arrive in batches, and establish their asymptotic normality as the batch size goes to infinity. It is worth noticing that the requirement for diverging batch sizes renders these methods impractical in scenarios where only one observation or a few observations arrive at a time. This paper is dedicated to the exploration of online quantile regression with unknown horizon and sample sizes (batch or total). We aim to provide a non-asymptotic analysis of online sub-gradient descent. By using a customized stepsize scheme with explicit dependencies on dimensionality, sample size, and noise scale, we establish optimal rates of convergence for online QR estimators. Additionally, we seek to demonstrate that near-optimal regret performance can be achieved. Let St be the set of observations acquired at time t, which will be used to update the ONLINE QUANTILE REGRESSION current iterate βt. Consequently, the loss function at t is given by (X(t) i ,Y (t) i ) St ρQ,τ(Y (t) i X(t) i , β ). We will explore three settings: (i) St = Dt, containing only one data vector, referred to as the online learning setting, (ii) St = Dt, comprising a set of observations with a size of at least O(d), termed as the online batch learning setting, and (iii) St containing all the data vectors accumulated up to time t, expressed as St = t l=0Dl, recognized as the sequential learning with infinite storage setting. Online sub-gradient descent. Initiated at some β0, the online QR estimates are iteratively defined via sub-gradient descent as βt+1 = βt ηtgt, t = 0, 1, . . . , where gt ft(βt) is the sub-gradient of ft at βt, and {ηt}t=0,1,... constitutes a sequence of stepsize parameters (learning rates). The loss function varies across different settings, exhibiting distinct regularity properties. Consequently, tailored stepsize schemes are imperative in the three settings to attain the desired convergence properties. We will show that these customized stepsize schemes yield statistically optimal estimators and achieve near-optimal regret performances. Importantly, these schemes effectively adapt to varying dimensions and unknown horizons. Prior to presenting the theoretical guarantees and convergence dynamics, we outline the essential assumptions concerning the model. Assumption 1 (Random covariate) The covariate vector X Rd follows the Gaussian distribution N(0, Σ), where Σ is symmetric and positive definite. There exist absolute constants Cl, Cu > 0 such that Cl λmin(Σ) λmax(Σ) Cu. Assumption 2 (Noise distribution) Given X Rd, the noise variable ξ has the conditional density function hξ|X( ) and distribution function Hξ|X( ), respectively. The conditional τ-th quantile of ξ given X is zero, i.e., Hξ|X(0) = τ and E|ξ| < + . Let γ = E|ξ|. There exist constants b0, b1 > 0 such that 1 (a) inf|x| 8(Cu/Cl)1/2γ hξ|X(x) b 1 0 , (b) supx R hξ|X(x) b 1 1 . Both local lower and global upper bounds on the (conditional) density hξ|X are commonly imposed in the QR literature, particularly in non-asymptotic settings (e.g., Belloni and Chernozhukov (2011) and He et al. (2023)). 1. As marked in Shen et al. (2023), the noise assumptions can be slightly relaxed to |Hξ|X(x) Hξ|X(0)| |x|/b0 for |x| 8(Cu/Cl)1/2γ and |Hξ|X(x) Hξ|X(0)| |x|/b1. SHEN, XIA AND ZHOU 2.1 Online Learning The server receives a new data point (Xt, Yt) that follows the linear model (1). The online subgradient descent algorithm calculates the corresponding sub-gradient gt ft(βt) and subsequently updates the current estimate βt in the direction of the negative sub-gradient, all without storing the new observation. Here, the loss function at time t is given by ft(β) = ρQ,τ(Yt X t β). More specifically, the algorithm proceeds as follows. We begin with an arbitrary initialization β0. At each time step t, given the current estimate βt and a newly arrived observation (Xt, Yt), the update is performed as βt+1 = βt ηt (τ I{Yt Xt, βt }) Xt, where ηt > 0 denotes the stepsize. The choice of ηt will be discussed in Theorem 1. The expected excess risk E[ft(β) ft(β )], as a function of β, demonstrates a two-phase regularity property, visually depicted in Figure 1. As an illustrative example, under the assumptions X N(0, Id) and ξ N(0, σ2), Shen et al. (2023) derived the population excess risk as E |X (β β ) + ξ| |ξ| = rπ β β 2 + σ2 + σ . When β is distant from the population risk minimizer β , the expected excess risk exhibits a firstorder lower bound, decaying linearly in β β 2. However, as β approaches proximity to β , a quadratic lower bound emerges concerning β β 2. For general noise with bounded density, a similar equation is achievable and a two-phase property exists accordingly. Importantly, it is noteworthy that, in contrast to offline works such as (Tong et al., 2021; Shen et al., 2023), this regularity property pertains to the expectation, given that the empirical loss is based on only one single observation. The empirical loss lacks a guaranteed high probability concentration property; for instance, the variance of ft(β) may significantly overshadow its expectation. As a result, the commonly observed monotone-decreasing trend in the estimation error, such as βt+1 β 2 βt β 2 in offline learning algorithms, does not occur. In fact, a more nuanced analysis is essential to understand the convergence dynamics of online sub-gradient algorithms for quantile regression. Interestingly, the following theorem illustrates that, despite the online sub-gradient descent algorithm s inability to ensure error contraction as βt+1 β 2 < βt β 2, the error rates βt β 2 can be bounded by a monotone-decreasing sequence with high probability. The convergence of the online sub-gradient descent algorithm unfolds in two distinct phases, as suggested by the two-phase regularity properties depicted in Figure 1. Theorem 1 Under Assumptions 1 and 2 (a), there exist universal positive constants c0, c1, c2, c3, C0 such that, for an arbitrary initialization β0 Rd, the sequence {βt}t 1 generated by the online sub-gradient descent algorithm follows the dynamics outlined below: 1. in phase one2, i.e., when βt β 2 8C 1/2 l γ, by selecting a stepsize η0 = C Cl d and ηt = (1 c5 Cl Cu 1 d τ 2 )ηt 1 with any D0 β0 β , C > 1, c5 < 0.1, it holds with 2. With this geometrically decaying stepsize, once βt β 2 O(γ) is achieved, the estimation error will remain to be this scale both theoretically and practically. ONLINE QUANTILE REGRESSION Figure 1: Expected excess risk of the objective function. Y -axis: lower bound of E[ft(β) ft(β )]; X-axis: the value of β β 2. It shows that the lower bound varies from a linear to quadratic dependence on β β 2 as β gets closer to β . probability at least 1 exp( c0d) that βt+1 β 2 2 C0 t+1 β0 β 2 2; 3 the conclusion of phase one occurs after t1 = O(d log( β0 β 2/γ)) iterations, where τ = max{τ, 1 τ}. 2. in phase two, i.e., when βt β 2 < 8C 1/2 l γ, by choosing the stepsize ηt = (b0/Cl) (Ca/(t t1 + Cbd)) with certain constants Ca > 12 and Cb > Ca/(12d), it holds with probability exceeding 1 exp( c0d) that βt+1 β 2 2 C Cu d t + 1 t1 + Cbdb2 0, where the constant C depends on Ca, Cb. In accordance with Theorem 1, the algorithm exhibits linear convergence during the first phase, concluding after t1 iterations. This phase requires a total of t1 observations and employs an approximately geometrically decaying stepsize scheme. In the second phase, the algorithm converges sub-linearly, employing a O(1/(t + d)) decaying stepsize scheme. The failure probability in both phases decays exponentially in d. To our knowledge, our result stands as the first instance in online learning literature to provide an exponential-type high probability bound. Previous works employing the squared loss generally guarantee a polynomial-type tail probability, even under sub-Gaussian 3. We remark that C0 depends on the choices of C, c5 and we leave the details in the proof. SHEN, XIA AND ZHOU noise (Cai et al., 2023; Han et al., 2022; Bastani and Bayati, 2020). This enhancement holds significant implications for practical applications. It suggests that our algorithm remains applicable even when the ultimate horizon T is as large as e O(d). Theorem 1 asserts that, provided d T ec0d, the final estimator βT , with high probability, attains an error rate that is minimax optimal under offline settings. Online methods can adapt to the unknown horizon, and offer substantial reductions in computation costs, as well as savings in memory and storage. The final error rate achieved after the second phase iterations is proportional to the noise level, independent of the initialization error. In existing works, the initialization error often exerts a lasting impact on the final error rate, even when using the squared loss. For instance, the online algorithm Fan et al. (2018) obtains an expected error rate E βT β 2 2 d β0 β 2 2 log(T)/T. Moreover, the error rate established by Cai et al. (2023) is βT β 2 2 d β0 β 2 2/T, which holds with high probability as long as T poly(d). Theorem 1 reveals that the initialization error only has a short-time effect in the online sub-gradient descent algorithm, dissipating after the second phase iterations begin. This two-phase convergence phenomenon is elucidated through a meticulous analysis of the algorithm s dynamics. It is worth noting that the two-phase convergence and the transient impact of initialization error are not exclusive to quantile regression; they also manifest in least squares, as per our analysis framework. A comprehensive discussion of these properties for online least squares will be presented in Section 3.1. Numerical simulations in that section demonstrate that a two-phase stepsize scheme yields significantly improved accuracy. Corollary 2 Assume that the same conditions as in Theorem 1 hold, and let the horizon satisfy T C d max{1, log( Cl β0 β 2/γ)} with a constant C depending on (Cl, Cu, Ca, Cb). Then, with probability at least 1 T exp( c0d), the online sub-gradient descent algorithm produces a final estimate with an error rate βT β 2 2 C3 d T b2 0, where C3 > 0 is a constant. Corollary 2 demonstrates that the online sub-gradient descent algorithm attains the minimax optimal error rate, provided the horizon is not too small. This makes it a favorable choice over the offline approach, considering the benefits of reduced computation and storage costs. However, it is advisable to opt for offline approaches in cases where the horizon is small. A comprehensive examination of their numerical performances is conducted in Section 3.1. Remark 3 (Trade-off between short-term accuracy and long-term optimality) Suppose the initialization is already situated in the second phase region, i.e., β0 β 2 < 8C 1/2 l γ. It is advisable to initiate the phase two iterations directly (t1 = 0), where the choice of the parameter Ca plays a pivotal role in determining short-term accuracy and long-term optimality. Selecting Ca > 12, according to Theorem 1, the updated estimates achieve an error rate of βt β 2 2 C Cu C2 l d t+Cbdb2 0 with high probability. This upper bound may exceed the initialization error when t is small, indicating potential accuracy fluctuations in the early stage. Nevertheless, the algorithm eventually converges and yields a statistically optimal estimator. On the contrary, opting for a sufficiently small parameter Ca < min{12, (Cl/ Cu) β0 β 2/b0} results in a more stable convergence ONLINE QUANTILE REGRESSION E|ξ| = 0.02 Figure 2: Trade-off between short-term accuracy and long-term optimality. The initialization β0 = 0 is already in proximity to β , enabling the online sub-gradient descent algorithm to bypass the first phase and start the second phase iterations immediately. A small stepsize can ensure short-term accuracy but may compromise losing long-term optimality, whereas a large stepsize can guarantee long-term optimality but at the expense of short-term accuracy. Here, the dimension is set to d = 100, and the noise follows a Student tν distribution with ν = 1.1. in the early stage. However, it sacrifices long-term accuracy, as the updated estimates possess an error bound βt β 2 2 C d t + Cbd 12 β0 β 2 2, achieving a sub-optimal error rate in the end. This phenomenon is illustrated in Figure 2. Specifically, in Figure 2a, when the initialization satisfies β0 β 2 0.2E|ξ|, it demonstrates that a small Ca ensures short-term accuracy but compromises long-term optimality. On the other hand, a large Ca guarantees long-term optimality at the expense of short-term accuracy. Figure 2b presents even more pronounced differences, where the initialization satisfies β0 β 2 0.02E|ξ|. Theorem 4 Under the same conditions and stepsize scheme as in Theorem 1, the online subgradient descent algorithm achieves an upper bound on regret given by Regret T C1(Cu/Cl) p Cu d β0 β 2 + C2Ca C2 u C2 l b2 0 b1 d log 1 + T for any T with probability over 1 t1 exp( cd). The regret upper bound exhibits logarithmic growth with respect to the horizon T, aligning with the well-established optimal regret bounds in online optimization (Hazan et al., 2007a; Orabona, 2019). The other term in the regret upper bound is solely dependent on the initialization error and remains unaffected by the horizon T. This indicates that the initialization error does not exert a persistent influence over the regret. In contrast, regret upper bounds established by Hazan et al. (2007a), Orabona (2019), Cai et al. (2023), and Han et al. (2022) all involve a multiplicative term of the form β0 β 2 2 log T, indicating that poor initialization may lead to substantially larger regret. SHEN, XIA AND ZHOU 2.2 Batch Learning Suppose the server receives a batch of data points Dt = {(X(t) i , Y (t) i )}nt i=1 at each time t, consisting of nt i.i.d. observations that follow the linear model (1). The loss function at time t is given by i=1 ρQ,τ(Y (t) i X(t) i β), (3) which represents the empirical risk based on nt observations. The special case considered in Section 2.1, where only a single data point is observed, corresponds to nt = 1. As nt increases, the objective function (3) becomes more amenable to analysis due to the concentration phenomenon. As before, we allow for an arbitrary initialization β0. At each time t, the current estimate is βt, and upon receiving the new batch {(X(t) i , Y (t) i )}nt i=1, the update is given by βt+1 = βt ηt 1 i=1 (τ I{Y (t) i X(t) i , βt }) X(t) i , where ηt denotes the stepsize. The choice of ηt will be discussed in Theorem 5. In contrast to our focus, Jiang and Yu (2022) and Wang et al. (2022) examine asymptotic batch learning for quantile regression, specifically in the regime where the batch size nt tends to infinity. Separately, Do et al. (2009) investigates batch learning from an optimization-theoretic perspective. More recently, the advantages of batch learning have been extensively explored in the literature on bandit algorithms (Ren and Zhou, 2023; Gao et al., 2019; Han et al., 2020). Despite these advances, the computational and statistical aspects of batched quantile regression remain largely underexplored. Relative to the setting considered in Section 2.2, access to larger batches provides technical advantages in analyzing the non-smooth objective function, especially when the current iterate is far from the ground truth, and gives rise to distinct theoretical properties. The excess risk and sub-gradient of the objective function play central roles in characterizing the convergence dynamics of (sub-)gradient descent algorithms. While the expected excess risk properties illustrated in Figure 1 also hold for the objective function (3), the sub-gradient of the objective function ft(β), denoted by gt ft(β), behaves distinctly from that in Section 2.1. The three phases in the behavior of the expected sub-gradient norm are illustrated in Figure 3. Essentially, when β is sufficiently far from the ground truth, the expected squared norm of the subgradient, E g 2 2, is bounded above by a constant that is independent of the distance β β 2. When β is closer, but not too close, to β , the expected norm is upper bounded by O( β β 2), and this bound remains independent of the batch size. Finally, when β is very close to the ground truth, the expected sub-gradient norm depends only on the batch size, dimensionality, and noise characteristics. Hence, the online sub-gradient descent algorithm applied to (3) exhibits a three-phase convergence behavior, provided that the stepsize is carefully chosen to align with the aforementioned statistical properties. For notational convenience, define τ = max{τ, 1 τ}. Theorem 5 Suppose Assumptions 1 and 2 hold. Let β0 be an arbitrary initialization satisfying β0 β 2 D0 for some D0 > 0. There exist universal constants {ci}5 i=0, C0, C1, C2 > 0 such that if the batch size nt n C0(Cu/Cl) τ 2d for all t, then the sequence {βt}t 1 generated by the online sub-gradient descent algorithm exhibits the following dynamics: ONLINE QUANTILE REGRESSION Figure 3: Expected length of the sub-gradient for the empirical quantile loss. Y -axis: upper bound of E[ g 2 2|β], where g 1 n Pn i=1 ρQ,τ(Yi X i β); X-axis: β β 2. It reveals three phases of properties associated with the sub-gradient, depending on the closeness between β and β . (1). during phase one, where βt β 2 8C 1/2 l γ, by choosing a stepsize η0 = C Cl/Cu D0 and ηt = (1 c Cl/Cu)ηt 1 with any D0 βt β 2, C > 1, c < 0.1, it holds with probability exceeding 1 exp( c0d) exp p nt/ log(nt) that βt+1 β 2 1 c Cl the conclusion of phase one is reached after t1 = O(log( β0 β 2/γ)) iterations; (2). in phase two, characterized by C1C1/2 u C 1 l τ p d/n b0 βt β 2 < 8C 1/2 l γ, by selecting a constant stepsize ηt = η (Cl/C2 u)(b2 1/b0) [c1, c2], it holds with probability at least 1 3 exp( c0d) exp( p nt/ log nt) that βt+1 β 2 2 1 0.0005b2 1 b2 0 phase two requires O log((n/d)(γ/b0)) iterations to conclude, providing an estimate βt2 that satisfies βt2 β 2 C1/2 u C 1 l τ p d/n b0 under the specified conditions; (3). in phase three, characterized by βt β 2 C1C1/2 u C 1 l τ p d/n b0, by choosing a stepsize ηt := Ca Cl b0 t+Cb t2 with arbitrary constants Ca 12 and Cb > Ca/12, it holds with probability exceeding 1 c2 exp( c1d) exp( p nt/ log nt) that βt+1 β 2 2 C C2 ub2 0 C2 l b2 1 t + Cb + 1 t2 SHEN, XIA AND ZHOU where the constant C depends solely on Ca, Cb, and t2 denotes the conclusion time of the phase two convergence. According to Theorem 5, the online sub-gradient descent algorithm exhibits fast linear convergence during phases one and two. The stepsize schemes used in these phases mirror those observed in the two-phase convergence of offline quantile regression (Shen et al., 2023). Specifically, the first phase requires t1 iterations and uses a total of O(d log( β0 β 2/γ)) data vectors. This aligns with the data scale required in phase one of the online learning setting, as outlined in Theorem 1. In the third phase, a stepsize of O(1/t) is essential for attaining long-term statistical optimality. The final error rate exhibits a continual decrease as additional data becomes available, demonstrating the efficacy of the proposed stepsize scheme in seamlessly integrating sequentially arriving data with an unknown horizon. Furthermore, at each iteration, the failure probability diminishes exponentially with respect to the dimension. Corollary 6 Under the conditions of Theorem 5, and if the horizon T C0(log( β0 β 2/b0)+ log(n/d)), then, with a probability exceeding 1 c1T exp( c0d) T exp( p n/ log n), the online sub-gradient descent algorithm produces a final estimate with an error rate given by βT β 2 2 C C2 ub2 0 C2 l b2 1 where the constant C depends only on Ca and Cb, and τ = max{τ, 1 τ}. If all batch sizes satisfy nt n, then the total number of data points used is O(Tn). In this setting, the error rate established in Corollary 6 matches the minimax-optimal rate in the offline setting (He et al., 2023). Notably, this rate is robust to initialization errors, similar to the rate obtained for online learning in Corollary 2. When the initialization is sufficiently close to β , a trade-off emerges between short-term accuracy and long-term optimality, as discussed in Remark 3. Theorem 7 Under the same conditions and stepsize scheme as in Theorem 5, the online subgradient descent algorithm attains a regret upper bound that for any T 1 with probability over 1 c1t1 exp( c0d) t1 exp( p Regret T C1 C3 u C3 l b2 0 b2 1 max{ p Cu β0 β 2, γ2/b0} + C2 C3 u C3 l d nb0 log T t1 + Cb The regret upper bound consists of two terms: one that depends on the initialization error but is independent of T, and another that grows logarithmically with T yet remains independent of the initialization. When the total number of data points consumed is equal denoted by N, and assuming N C1 max{n log( β0 β 2/b0), n log(n/d)} both online and batch learning achieve an optimal estimation error rate of O(b2 0 d/N). However, batch learning incurs significantly lower regret by leveraging a larger number of data points in each iteration, thereby accelerating the learning process. Specifically, as shown in Theorem 4, online learning achieves a regret upper bound of O(d β0 β 2 + d log(N/d)), whereas batch learning attains a considerably smaller bound of O( β0 β 2 + (d/n) log(N/n)). It is worth noting, however, that batch learning requires substantially greater storage capacity. ONLINE QUANTILE REGRESSION 2.3 Sequential Learning with Infinite Storage Consider a scenario in which the server hosting βt+1 has unlimited storage capacity, retaining all historical data and updating the estimate only upon the arrival of new samples. Let β0 be an arbitrary initialization. At each time t, the server incorporates all observations received up to and including time t when computing the update. Let {(Y (t) i , X(t) i )}nt i=1 denote the nt observations that arrive at time t. The loss function at time t is then given by ft(β) := 1 Pt l=0 nl i=1 ρQ,τ(Y (l) i X(l) i β), (4) which corresponds to the empirical risk over all accumulated data. The online sub-gradient descent algorithm updates the iterate via βt+1 = βt ηt gt, where ηt is the stepsize and gt ft(βt) is a sub-gradient. Specifically, we use the update rule βt+1 = βt ηt 1 Pt l=0 nl i=1 (τ I{Y (l) i X(l) i , βt }) X(l) i . The stepsize scheme ηt will be discussed in Theorem 8. Although this algorithm requires iterating over the entire dataset accumulated so far at each update, it remains computationally more efficient than a full offline optimization approach (Shen et al., 2023). The objective function in (4) retains its inherent two-phase structure, as illustrated in Figure 1. Notably, although this formulation incurs substantial storage and computational costs, these demands facilitate a faster convergence rate of the algorithm, as established in the following theorem. Theorem 8 Suppose Assumptions 1 and 2 hold, and let β0 be an arbitrary initialization satisfying β0 β 2 D0 for some D0 > 0. There exist absolute positive constants c0, c1, c2, C1, C2 such that, if the initial storage n0 C1d, then the sequence {βt}t 1 generated by the online sub-gradient descent algorithm has the following dynamics: (1). During phase one, when βt β 2 8γ, by selecting a stepsize ηt := Cl 8Cu 1 1 100 Cl Cu D0, it holds with probability at least 1 exp( c0d) exp( q Pt l=0 nl/ log Pt l=0 nl) that βt+1 β 2 1 1 100 Cl Cu Phase one concludes after t1 = O log( β0 β 2/γ) iterations. (2). In phase two, when βt β 2 < 8γ, by selecting a stepsize ηt := η Cl C2u b2 1 b0 , it holds with probability at least 1 exp( c0d) exp( q Pt l=0 nl/ log Pt l=0 nl) that βt+1 β 2 2 1 c1 b2 1 b2 0 C2 l C2u βt β 2 2 + c2 d Cu Pt l=0 nl b2 1. SHEN, XIA AND ZHOU As shown in Theorem 8, the convergence behavior of the algorithm proceeds in two distinct phases. The initial phase requires a geometrically decaying sequence of stepsizes, while the second phase adopts a constant stepsize. Both phases are proven to exhibit linear convergence. For simplicity, assume that nl = n holds for all l 1. During the second phase (i.e., for t > t1), the error rate satisfies βt+1 β 2 2 1 c1 b2 1 b2 0 C2 l C2u t+1 t1 βt1 β 2 2 + C2 d n0 + tn Cu C2 l b2 0 b0 which leads to the following corollary. Corollary 9 Assume the same conditions as in Theorem 8, and suppose that nl = n for all l 1. If the total time horizon satisfies T C β0 β 2/γ, for some constant C that may depend on (Cl, Cu, n0, n), with probability over 1 PT t=0 exp( p (n0 + tn)/ log(n0 + tn)) T exp( c0d), it holds that βT β 2 C b0 C2 l d n0 + Tnb2 0. The corollary above demonstrates that, despite the server updating only once upon the arrival of each new batch of samples, the procedure consistently achieves statistical optimality, even when the total time horizon is unknown. 3. Numerical Experiments In this section, we present experiments to empirically validate our theoretical findings. We first generate synthetic data for simulation studies, followed by an application of our proposed methods to a real-data example. 3.1 Simulations In this section, we numerically examine the performance of the proposed online QR algorithms. We first demonstrate the effectiveness of our stepsize scheme, and then compare the final accuracy of different online estimators with that of the offline estimate. We further illustrate the advantage of online quantile regression over online least squares regression. It is worth noting that existing online quantile algorithms (Jiang and Yu, 2022; Sun et al., 2023; Chen and Yuan, 2024) depend on storage of O(d2) summary statistics. Additionally, they require data to arrive in batches with a diverging batch size. Their requirements limit their applicability for empirical performance comparisons. We use the relative error βt β 2/ β 2 as the main metric. For simplicity, the initialization β0 is set to be 0 throughout our numerical studies. STEPSIZE SCHEME. We begin by demonstrating the effectiveness of our stepsize scheme, as outlined in Theorem 1, which is theoretically guided by statistical regularities. The selection of parameters in the proposed stepsize is quite flexible, both theoretically (as outlined in Theorem 1) and in practice. For instance, concerning the parameter Ca required by the stepsize schedule in the second phase of iterations, we observe that our algorithm always performs well as long as it is not excessively small. ONLINE QUANTILE REGRESSION Figure 4: Relative error versus time/iterations in online (one-sample) learning (nt 1). The dimension d = 100, unknown horizon T = 105, and quantile loss parameter τ = 1/2. The convergence performances of online sub-gradient descent are examined under three stepsize schemes: Statistical stands for our stepsize scheme guided by Theorem 1, Constant stands for the stepsize scheme ηt const (Zhang, 2004), O(1/t) means the decaying stepsize scheme ηt = O(1/t) (Duchi and Singer, 2009). Left(a): moderate SNR; right(b): strong SNR. In all the experiments, the stepsize for the first phase is scheduled as ηt = (1 0.5/d)tη0, where η0 is the initial stepsize. The choice of stepsize in second phase involves specific parameters where we set Ca = 20 and Cb = 30. We fix the dimension at d = 100, the unknown horizon T = 105, and sample tν-distributed noise with a degree of freedom ν = 1.1. The performance of proposed stepsize scheme is compared with those of two existing alternative stepsize scheme: the ηt = O(1/t) decaying scheme and a constant stepsize scheme ηt const. For detailed discussions on these stepsize schemes, we refer to Duchi and Singer (2009) and Zhang (2004). The convergence performances of online sub-gradient descent under the aforementioned three stepsize schemes are presented in Figure 4. For both the moderate and strong SNR cases, our proposed stepsize scheme can ensure a linear convergence of the online sub-gradient descent algorithm in the first phase. The linear convergence behavior stops once the algorithm reaches at a sufficiently accurate estimate, i.e., when the error rate is dominated by the noise scale E|ξ|. Our proposed scheme then resets the stepsize and the algorithm enters second phase of iterations. The error rate continues to decrease in the second phase, which is sub-linear and exhibits an O(1/t) convergence rate. The two-phase convergence phenomenon is consistent with the theoretical discoveries. In contrast, the constant stepsize scheme can also achieve an error rate the same as the one our algorithm achieves at the end of first phase iterations. However, it cannot further improve the estimate or may converge too slowly resulting into a statistically sub-optimal estimate. It is also observed that employing a relatively large constant stepsize can facilitate faster convergence in the initial stage, which was claimed by the theoretical results in Cai et al. (2023). The performance under the O(1/t) stepsize scheme is considerably inferior to those under the other two stepsize schemes. STATISTICAL ACCURACY COMPARISONS. We now evaluate the statistical accuracy of the final estimator output by the online sub-gradient descent algorithm. The error rate achieved by offline quantile regression is used as the benchmark. The dimension d and noise distribution are set the SHEN, XIA AND ZHOU (a) τ = 0.25 (b) τ = 0.75 Figure 5: Error rates of offline and online regression using quantile loss ρQ,τ( ). Online One Sample refers to the online learning algorithm studied in Section 2.1. The dimension d = 100, total sample size n = 20, 000, the batch size nt 100, and noise has a t1.1 distribution. Box-plots are drawn based on 50 independent simulations. same as those in the previous set of simulations. Both the online one-sample learning and batch learning are studied in the experiment. The total sample size is set at n = 20, 000. Figure 5 displays a box plot of error rates based on 50 replications. For each simulation, the online batch learning algorithm and one-sample learning algorithm take approximately 2 and 10 seconds, respectively, on a Mac Book Pro 2020. The offline learning, implemented using the quantreg package (Portnoy and Koenker, 1997), takes more than 2 minutes. Figure 5 shows that offline learning achieves the best statistical accuracy when the total sample size is relatively small, while online one-sample learning and batch learning achieve comparable statistical accuracy. These are consistent with our theoretical findings. Simulation results for the large sample size n = 50, 000 are displayed in Figure 6, in which case the offline learning achieves only slightly better accuracy than its online counterparts. However, online algorithms enjoy much higher computational efficiencies. On the same Mac Pro, the online one-sample, batch, and offline learning methods take approximately 30, 5, and 520 seconds, respectively. CONVERGENCE DYNAMICS COMPARISONS. While online QR is motivated primarily for treating heavy-tailed noise, it is still of interest to examine its performance under Gaussian noise. Towards that end, we compare the performance of our proposed online QR algorithms with that of classical online least squares algorithm (Zhang, 2004). We will show that the proposed online QR algorithms are not only robust in the presence of heavy-tailed noise or responses, but also are as efficient as the classical online least squares algorithm if the noise is Gaussian. While online learning with square loss has been extensively studied (Orabona, 2019; Hazan, 2016), its stepsize scheme guided by statistical properties remains relatively under-explored. Here we briefly explain the appropriate stepsize scheme to achieve long-term statistical optimality for online least squares algorithm with a focus on sub-Gaussian noise and undetermined horizon. Considering the square loss function at time t as ft(β) := (Yt X t β)2, the expected length of gradient gt satisfies E gt 2 2 βt 4d Cu(Eξ2 + Cu βt β 2 2). ONLINE QUANTILE REGRESSION (a) τ = 0.25 (b) τ = 0.75 Figure 6: Error rates of offline and online regression using quantile loss ρQ,τ( ). Online One Sample refers to the online learning algorithm studied in Section 2.1. The dimension d = 100, total sample size n = 50, 000, the batch size nt 200, and noise has a t1.1 distribution. Box-plots are drawn based on 50 independent simulations. The gradient length is decided primarily by βt β 2 2 when it is large. Conversely, when βt is sufficiently close to the oracle, the gradient length is determined by Eξ2. This implies an appropriate stepsize scheme for online least squares algorithm should also consists of two phases: a constant stepsize ηt η = O(d 1) in the first phase; a decaying stepsize schedule ηt = O (t + d) 1 in the second phase. The detailed proof of the convergence dynamics under this stepsize scheme is almost identical to that of Theorem 1 and thus omitted. We compare the convergence dynamics of online one-sample QR algorithm and online least squares algorithm (equipped with the aforementioned two-phase stepsize scheme). The dimension d = 100, quantile loss parameter τ = 0.5, and horizon is unknown. The results under Gaussian noise and t1.1 noise are presented in Figure 7. The convergence dynamics of the two algorithms are comparable under Gaussian noise, both showing a linear convergence in the first phase and an O(1/t) decaying rate afterwards. They achieve almost the same statistical accuracy in the end. Under t1.1-distributed noise, online least squares algorithm does not converge. In contrast, the proposed online QR algorithm ensures stable convergence and achieves error rates comparable to those under Gaussian noise. PARAMETER SENSITIVITY. Here we empirically examine the flexibility of tuning parameters discussed in Theorem 1. The simulation results demonstrating parameter sensitivity are presented in Figure 8. Following the same settings as before, we set d = 100, Cl = Cu = 1 and β 2/γ = 20. Figure 8a illustrates the impact of varying the geometric decay rate c {0.5, . . . , 0.05} for the first-phase stepsize, defined as ηt = (1 c/d)tη0, while the second-phase stepsize is fixed as ηt = 15/(t t1 + 20d). Interestingly, when t is sufficiently large, different choices of the decay rate c result in similar convergence behavior. This suggests that the decay rate c in the first-phase stepsize is indeed flexible in practice. Figures 8b 8d further explore convergence dynamics under various values of Ca and Cb. Specifically, Figure 8c shows the sensitivity to Cb with Ca = 15 fixed, while Figure 8d examines the influence of Ca with Cb = 100 held constant. These results indicate that as long as Ca and Cb are not too small, the estimation error remains on the same scale, and SHEN, XIA AND ZHOU E|ξ| = 20, Gaussian noise E|ξ| = 200, Gaussian noise E|ξ| = 20, t1.1 noise E|ξ| = 200, t1.1 noise Figure 7: Relative error versus time/iterations for online QR and least squares algorithms. The dimension is d = 100, the (unknown) time horizon is T = 105, and the quantile level is τ = 1/2. The proposed online QR algorithm is robust to heavy-tailed noise and achieves comparable performance to online least squares under Gaussian noise. ONLINE QUANTILE REGRESSION (a) Convergence under varying geometrically decaying step sizes for t t1, followed by a common fixed stepsize for t t1. (b) Convergence under varying values of Ca and Cb for t t1, with a common stepsize applied for t t1. (c) Convergence under a fixed Ca for t t1, and a shared stepsize scheme for t t1. (d) Convergence under a fixed value of Cb for t t1, and a shared stepsize scheme for t t1. Figure 8: Relative error versus time/iteration for online QR. The dimension is d = 100, with a signal-to-noise ratio of β E|ξ| = 20. The noise is generated from a t-distribution with 1.1 degrees of freedom. The time horizon is set to T = 105. variations in Cb have little effect on performance. However, if both parameters are too small (e.g., Ca = 1 and Cb = 0), the theoretical O(1/t) convergence rate is no longer guaranteed. Furthermore, Figures 8b 8d depict the convergence dynamics under varying values of Ca and Cb. Specifically, Figure 8c illustrates the effect of different choices of Cb while fixing Ca = 15, and Figure 8d examines the sensitivity to Ca with Cb = 100 held constant. These results suggest that as long as Ca and Cb are not too small, the estimation error remains on the same scale, and the choice of Cb has minimal impact on performance. In contrast, when both parameters are set to overly small values, for example, Ca = 1 and Cb = 0, the convergence guarantee at the O(1/t) rate no longer holds. These empirical findings support the theoretical flexibility of Ca and Cb as established in Theorem 1. REGRET DYNAMICS. In this paragraph, we examine the regret dynamics illustrated in Figure 9, under the setting where d = 200, T = 105, and the noise follows a t-distribution with 1.1 degrees of SHEN, XIA AND ZHOU (a) Regret Dynamics with Oracle log(t) Function (b) Regret Dynamics under Different Stepsizes Figure 9: Regret dynamics versus time/iteration for online QR. The dimension d = 100, signal-tonoise ratio β E|ξ| = 20, unknown horizon T = 105, t1.1 noise. freedom (t1.1). The left panel (Figure 9a) corresponds to the stepsize schemes ηt = (1 0.05/d)tη0 and Ca/(t t1 + Cbd). The pink star-dotted line represents a benchmark function of the form b+log(1+t/a). As shown in Figure 9a, the curves exhibit highly parallel growth patterns. Figure 9b displays the regret values over time t under different stepsize schemes, all showing a logarithmic trend of the form b + log(1 + t/a), albeit with different parameterizations. 3.2 Real Data Example In this section, we analyze a real-world dataset on news popularity provided by Moniz and Torgo (2018). Our focus is on news articles categorized under Economy that were published on Facebook. The objective is to predict the final popularity of each news item using three sets of predictors: the sentiment score of the headline text, the sentiment score of the body text, and the popularity levels measured at successive time intervals post-publication (0 6h, . . . , 18 24h). The dataset contains a total of N = 29928 samples and d = 7 predictors. We randomly split the dataset into a training set with n1 = 20000 observations and a test set with n2 = 9928 observations. We apply the proposed online QR and batch learning algorithms to the training set. As benchmarks, we include the offline QR, implemented using quantreg (Portnoy and Koenker, 1997), and the ordinary least squares (OLS) estimators. It is worth noting that this dataset was also examined by Chen and Zhou (2020), although their analysis was based on a different subset of the data, as some records from the original source are no longer available. We report the test errors n 1 2 P (Yi,Xi) Test Data |Yi X i bβ| in Table 1. The outcome variable Y ranges from approximately 1 to 104, with an average absolute value on the test set exceeding 50. Despite this wide range, the prediction errors reported in Table 1 are substantially smaller, indicating strong predictive performance. Among the three QR methods online, batch, and offline both online and batch QR yield slightly lower prediction errors compared to offline QR. All three quantile-based approaches, however, significantly outperform the OLS in terms of test error. Moreover, consistent with the findings of Chen and Zhou (2020), we observe that the estimated coefficients for the sentiment scores of both the headline and body text are close ONLINE QUANTILE REGRESSION Method Online QR Online Batch QR Offline QR OLS Test Error 8.5824 8.5084 8.1895 9.8752 Table 1: Prediction performance on the news popularity dataset. to zero. This suggests that, in this setting, sentiment features provide limited predictive value for news popularity. 4. Proof of Theorem 1 This section presents the proof of Theorem 1. Proofs of the remaining results are deferred to the supplementary material. The update procedure βt+1 = βt ηtgt yields βt+1 β 2 2 = βt β 2 2 2ηt βt β , gt + η2 t gt 2 2, (5) where the sub-gradient gt is given by gt = Xt{ τ 1(Yt>X t βt) + (1 τ) 1(Yt 1 remains fixed throughout the proof and 0 < c5 < 1 is some sufficiently small constant to be specified later. Initially, the inequality β0 β 2 C β0 β 2 holds trivially. For notational convenience, define Dl := (1 c5 Cl Cu 1 τ 2d)l β0 β 2 for l = 0, 1, . . . . Assuming the inductive hypothesis βl β 2 C Dl holds for all l = 0, 1, . . . , t, our goal is to establish that βt+1 β 2 C Dt+1. By convexity and the property of sub-gradient, βt β , gt ft(βt) ft(β ). Substituting this into equation (5) yields βt+1 β 2 2 βt β 2 2 2ηt{(ft(βt) ft(β )} + 2η2 t Cu τ 2d = βt β 2 2 2ηt E{ft(βt) ft(β )|βt} + 2η2 t Cu τ 2d 2ηt ft(βt) ft(β ) E{ft(βt) ft(β )|βt} . SHEN, XIA AND ZHOU For the expected difference E{ft(βt) ft(β )|βt}, Lemma 13 implies E{ft(βt) ft(β )|βt} Cl 2π βt β 2 γ. During phase one, where βt β 2 8C 1/2 l γ, it follows from the above lower bound that βt+1 β 2 2 βt β 2 2 1 Cl βt β 2 + 2η2 t Cu τ 2d 2ηt ft(βt) ft(β ) E{ft(βt) ft(β )|βt} . We further rewrite the above inequality as βt+1 β 2 2 1 Cl 3 ηt βt β 2 βt β 2 2 + 2η2 t τ 2Cud 2ηt ft(βt) ft(β ) E{ft(βt) ft(β )|βt} . By the induction hypothesis, βt β C Dt, and hence βt+1 β 2 2 1 Cl βt β 2 2 + 2η2 t τ 2Cud 2ηt ft(βt) ft(β ) E{ft(βt) ft(β )|βt} Substituting the stepsize ηt = c Cl Cu Dt τ 2d into the inequality above, we obtain βt+1 β 2 2 1 c 3C Cl Cu βt β 2 2 + 2c2 Cl 1 τ 2d D2 t Dt τ 2d ft(βt) ft(β ) E{ft(βt) ft(β )|βt} . By applying this iterative bound repeatedly starting from t = 0, we obtain 1 c 3C Cl Cu t+1 β0 β 2 2 + 2c2 Cl 1 c 3C Cl Cu 1 c 3C Cl Cu t l Dl fl(βl) fl(β ) E{ft(βt) ft(β )|βt} . We begin by bounding the second term on the right-hand side of the above inequality as 1 c 3C Cl Cu 1 c 3C Cl Cu t l 1 c5 Cl Cu ONLINE QUANTILE REGRESSION 1 c 3C Cl Cu t 1 c 3C c5 Cu 1 c 3C Cl Cu where the second line follows from the inequality 1 2a (1 a)2 and the condition c5 c/(6C ). This further implies 1 c 3C Cl Cu t l D2 l 12c C 1 c 3C Cl Cu To bound the last stochastic term, under the inductive hypothesis t l=0{ βl β 2 2 C 2D2 l }, we have fl(βl) fl(β ) E fl(βl) fl(β ) βl Ψ2 C1C p Cu τDl, l = 0, 1, . . . , t, where Ψ2 denotes the Orlicz norm associated with the Orlicz function Ψ(x) = exp(x2) 1. Then, by applying a variant of Azuma s inequality (Shamir, 2011), it follows that with probability at least 1 exp( c1d), t l Dl fl(βl) fl(β ) E{fl(βl) fl(β )|βl} v u u t Cud 1 c 3C Cl Cu v u u t Cud 1 c 3C Cl Cu 2t 2l 1 c5 Cl Cu Cud 1 c 3C Cl Cu Cl τ 2d D2 0 τ 2d 1 c5 Cl Cu where the last two inequalities follow from the bound 1 2a (1 a)2 for 0 < a < 1, together with the condition c5 c/(6C ). Putting together the pieces, we conclude that for any C satisfying c C c/(6c5), it holds 1 c 3C Cl Cu t+1 β0 β 2 2 + 12c C 1 c 3C Cl Cu t D2 0 + c C 3/2D2 t+1 3D2 t+1 + 12c C 1 c5 Cl Cu 2t D2 0 exp Cu 1 τ 2d (1 c5 Cl Cu 1 τ 2d)2 + c C 3/2D2 t+1 SHEN, XIA AND ZHOU This completes the proof of convergence in phase one. SECOND PHASE ANALYSIS. The verification of the second phase is also carried out by induction. The proof for the step t1 + 1 is omitted due to its simplicity, as it follows analogously from the subsequent argument. We assume t l=t1{ βl β 2 2 C d l t1+Cbd Cu C2 l b2 0} and proceed to analyze the behavior at iteration t+1. Within this second phase, Lemma 17 establishes a quadratic lower bound on the expected excess loss, that is, E ft(βt) ft(β ) βt Cl 12b0 βt β 2 2. Subsequently, the right-hand side of equation (5) can be upper bounded by βt+1 β 2 2 βt β 2 2 1 6b0 ηt Cl βt β 2 2 + 2η2 t Cu τ 2d 2ηt ft(βt) ft(β ) E ft(βt) ft(β ) βt . Incorporating the chosen stepsize ηt = Ca t t1+Cbd b0 Cl preceding equation, we obtain βt+1 β 2 2 1 Ca 12(t t1 + Cbd) βt β 2 2 + 2b2 0 τ 2 Cu C2 a (t t1 + Cbd)2 d 2 Ca t t1 + Cbd b0 Cl ft(βt) ft(β ) E ft(βt) ft(β ) βt . Summing the above inequality over t from t1 onward, we arrive at 1 Ca 12(l t1 + Cbd) + 2b2 0 τ 2 Cu 1 Ca 12(l + 1 t1 + Cbd) 1 Ca 12(t t1 + Cbd) C2 a (l t1 + Cbd)2 1 Ca 12(l + 1 t1 + Cbd) 1 Ca 12(t t1 + Cbd) Ca l t1 + Cbd fl(βl) fl(β ) E fl(βl) fl(β ) βl . (6) A sharp bound on the above equation constitutes a key step in the proof. We proceed by analyzing each of the three terms on the right-hand side separately. Note that when Ca > 12, we have 1 Ca 12(l t1 + Cbd) l t1 + Cbd l + 1 t1 + Cbd. Then, the first term on the right-hand side of (6) satisfies 1 Ca 12(l t1 + Cbd) βt1 β 2 2 Cbd t + 1 t1 + Cbd βt1 β 2 2. ONLINE QUANTILE REGRESSION Consider the product sequence, which can be bounded as 1 Ca 12(l + 1 t1 + Cbd) 1 Ca 12(t t1 + Cbd) k=l+1 log 1 Ca 12(k t1 + Cbd) Ca 12(k t1 + Cbd) 1 x + Cbd dx 12 log t + 1 t1 + Cbd l + 1 t1 + Cbd = l + 1 t1 + Cbd t + 1 t1 + Cbd where the first inequality uses the bound log(1+x) x, and the third line leverages the relationship between integrals and discrete sums. Consequently, the second term in equation (6) can be bounded as 1 Ca 12(l + 1 t1 + Cbd) 1 Ca 12(t t1 + Cbd) C2 a (l t1 + Cbd)2 l + 1 t1 + Cbd t + 1 t1 + Cbd 12 C2 a (l t1 + Cbd)2 12 C2 a (t + 1 t1 + Cbd) Ca l=t1 (l t1 + Cbd) Ca With Ca > 12, we have l=t1 (l t1 + Cbd) Ca 12 2 Z t t1+1 0 (x + Cbd) Ca 12 2 dx = 1 Ca 12 1(t t1 + 1 + Cbd) Ca Combining these results, we obtain the following upper bound for the second term in equation (6): 1 Ca 12(l + 1 t1 + Cbd) 1 Ca 12(t t1 + Cbd) C2 a (l t1 + Cbd)2 12 C2 a t + 1 t1 + Cbd. Assuming t l=t1{ βl β 2 2 C d l t1+Cbd Cu C2 l b2 0}, it follows that fl(βl) fl(β ) E fl(βl) fl(β ) βl Ψ2 C τ C d l t1 + Cbd C2u C2 l b0. By Theorem 2 in Shamir (2011) and equation (7), with probability exceeding 1 c exp( cd), the third term in equation (6) can be bounded by 1 Ca 12(l + 1 t1 + Cbd) 1 Ca 12(t t1 + Cbd) Ca l t1 + Cbd SHEN, XIA AND ZHOU fl(βl) fl(β ) E fl(βl) fl(β ) βl τCa Cu C2 l l + 1 t1 + Cbd t + 1 t1 + Cbd 6 1 (l t1 + Cbd)2 C d2 l t1 + Cbdb2 0 = τCa Cu C2 l b2 0 (t + 1 t1 + Cbd) Ca l=t1 (l t1 + Cbd) Ca τCa Cu C2 l b2 0 C d t + 1 t1 + Cbd 1 + Cbd Finally, combining the upper bounds of the three terms, we obtain βt+1 β 2 2 Cbd t + 1 t1 + Cbd βt1 β 2 2 + 2 1 + Cbd 12 C2 a t + 1 t1 + Cbd + Ca Cu C2 l b2 0 C d t + 1 t1 + Cbd 1 + Cbd C2 l b2 0 d t + 1 t1 + Cbd, where {(1 + Cbd)/(Cbd} Ca 12 < 2Ca/12 is independent of d, and C is a sufficiently large constant that may depend on Ca, Cb. This completes the proof of Theorem 1. 5. Discussions This paper focuses on online quantile regression in low-dimensional settings. The analytical framework developed herein can also be extended to the study of stochastic sub-gradient methods for low-rank regression under quantile loss. Let M Rd1 d2 denote the true low-rank matrix with rank(M ) = r. At each time step t, we observe data (Yt, Xt), where Yt = Xt, M + ξt, and Xt Rd1 d2 is the sensing matrix. The corresponding loss function at time t is defined as ft(M) = ρQ,τ(Yt Xt, M ). A central challenge in this setting is to maintain the low-rank structure of the iterates over time. To this end, the online Riemannian optimization framework updates the estimate according to Mt+1 = SVDr(Mt ηt PTt(Gt)), where Gt ft(Mt) is the sub-gradient, and PTt( ) denotes the projection onto the tangent space at Mt. Detailed expressions for PTt(Gt) and further discussion of Riemannian optimization techniques can be found in Vandereycken (2013) and Mishra et al. (2014). Each iteration retains only the leading r singular components, raising an important conceptual question: to what extent does this truncation preserve essential statistical information? A comprehensive theoretical investigation of online low-rank regression under quantile loss is beyond the scope of this paper and is left for future research. Acknowledgments ONLINE QUANTILE REGRESSION The authors would like to thank Professor Genevera Allen and the two anonymous reviewers for their insightful comments and constructive suggestions that helped improve this work. Yinan Shen is the corresponding author and acknowledges support from the Wi SE Supplemental Faculty Support program at the University of Southern California. Dong Xia s research was partially supported by the Hong Kong Research Grants Council under Grant GRF 16302323. Wen-Xin Zhou s research was supported by the National Science Foundation under Grant DMS-2401268 and by the Australian Research Council Discovery Project Grant DP230100147. Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Advances in Neural Information Processing Systems, 26, 2013. Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276 294, 2020. Alexandre Belloni and Victor Chernozhukov. ℓ1-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39(1):82 130, 2011. L eon Bottou. Online learning and stochastic approximations. In David Saad, editor, On-line Learning in Neural Networks, pages 9 42. Cambridge University Press, 1999. Jian-Feng Cai, Jingyang Li, and Dong Xia. Online tensor learning: Computational and statistical trade-offs, adaptivity and optimal regret. ar Xiv preprint ar Xiv:2306.03372, 2023. L eopold Cambier and P.-A. Absil. Robust low-rank matrix completion by riemannian optimization. SIAM Journal on Scientific Computing, 38(5):S440 S460, 2016. Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. Vasileios Charisopoulos, Yudong Chen, Damek Davis, Mateo Diaz, Lijun Ding, and Dmitriy Drusvyatskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. Foundations of Computational Mathematics, 21(6):1505 1593, 2021. Lanjue Chen and Yong Zhou. Quantile regression in big data: A divide and conquer based strategy. Computational Statistics & Data Analysis, 144:106892, 2020. Xi Chen, Weidong Liu, and Yichen Zhang. Quantile regression under memory constraint. The Annals of Statistics, 47(6):3244 3273, 2019. Xuerong Chen and Senlin Yuan. Renewable quantile regression with heterogeneous streaming datasets. Journal of Computational and Graphical Statistics, 33(4):1185 1201, 2024. Bernard Delyon and Anatoli Juditsky. Accelerated stochastic approximation. SIAM Journal on Optimization, 3(4):868 881, 1993. Chuong B Do, Quoc V Le, and Chuan-Sheng Foo. Proximal regularization for online and batch learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 257 264, 2009. SHEN, XIA AND ZHOU John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(99):2899 2934, 2009. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7):2121 2159, 2011. Jianqing Fan, Wenyan Gong, Chris Junchi Li, and Qiang Sun. Statistical sparse online regression: A diffusion approximation perspective. In International Conference on Artificial Intelligence and Statistics, pages 1017 1026. PMLR, 2018. Marcelo Fernandes, Emmanuel Guerre, and Eduardo Horta. Smoothing quantile regressions. Journal of Business and Economic Statistics, 39(1):338 357, 2021. Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In International Conference on Machine Learning, pages 1920 1930. PMLR, 2019. Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019. Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230 261, 2013. Qiyu Han, Will Wei Sun, and Yichen Zhang. Online statistical inference for matrix contextual bandit. ar Xiv preprint ar Xiv:2212.11385, 2022. Ruijian Han, Lan Luo, Yuanyuan Lin, and Jian Huang. Online inference with debiased stochastic gradient descent. Biometrika, 111(1):93 108, 2024. Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanchet, Peter W Glynn, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. ar Xiv preprint ar Xiv:2004.06321, 2020. Elad Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016. Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169 192, 2007a. Elad Hazan, Alexander Rakhlin, and Peter Bartlett. Adaptive online gradient descent. Advances in Neural Information Processing Systems, 20, 2007b. Xuming He and Qi-Man Shao. On parameters of increasing dimensions. Journal of Multivariate Analysis, 73(1):120 135, 2000. Xuming He, Xiaoou Pan, Kean Ming Tan, and Wen-Xin Zhou. Smoothed quantile regression with large-scale inference. Journal of Econometrics, 232(2):367 388, 2023. Matthew Hoffman, Francis Bach, and David Blei. Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems, 23, 2010. ONLINE QUANTILE REGRESSION Rong Jiang and Keming Yu. Renewable quantile regression for streaming data sets. Neurocomputing, 508:208 224, 2022. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 2013. Roger Koenker. Quantile Regression. Cambridge University Press, 2005. Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica, 46(1):33 50, 1978. Roger Koenker, Victor Chernozhukov, Xuming He, and Liming Peng. Handbook of Quantile Regression. CRC Press. Roger W Koenker and Vasco d Orey. Algorithm as 229: Computing regression quantiles. Applied Statistics, pages 383 393, 1987. John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10(3):777 801, 2009. Yann Le Cun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541 551, 1989. Xiao Li, Zhihui Zhu, Anthony Man-Cho So, and Rene Vidal. Nonconvex robust low-rank matrix recovery. SIAM Journal on Optimization, 30(1):660 686, 2020. Jianhao Ma and Salar Fattahi. Global convergence of sub-gradient method for robust matrix recovery: Small initialization, noisy measurements, and over-parameterization. Journal of Machine Learning Research, 24(96):1 84, 2023. H Brendan Mc Mahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In COLT, 2010. Zakaria Mhammedi, Wouter M Koolen, and Tim Van Erven. Lipschitz adaptivity with multiple learning rates in online learning. In Conference on Learning Theory, pages 2490 2511. PMLR, 2019. Bamdev Mishra, Gilles Meyer, Silvere Bonnabel, and Rodolphe Sepulchre. Fixed-rank matrix factorizations and riemannian low-rank optimization. Computational Statistics, 29:591 621, 2014. Nuno Moniz and Luis Torgo. Multi-source social feedback of online news feeds. ar Xiv preprint ar Xiv:1801.07055, 2018. Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019. Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends R in Optimization, 1(3):127 239, 2014. Stephen Portnoy and Roger Koenker. The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Statistical Science, 12(4):279 300, 1997. SHEN, XIA AND ZHOU Nikita Puchkin, Eduard Gorbunov, Nickolay Kutuzov, and Alexander Gasnikov. Breaking the heavy-tailed noise barrier in stochastic optimization problems. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238, pages 856 864. PMLR, 2024. M Rajalakshmi, P Saranya, and P Shanmugavadivu. Pattern recognition-recognition of handwritten document using convolutional neural networks. In 2019 IEEE International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS), pages 1 7. IEEE, 2019. Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. ar Xiv preprint ar Xiv:1109.5647, 2011. Zhimei Ren and Zhengyuan Zhou. Dynamic batch learning in high-dimensional sparse linear contextual bandits. Management Science, 70(2), 2023. Nicolas Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. Advances in Neural Information Processing Systems, 25, 2012. Leonard J Savage. The theory of statistical decision. Journal of the American Statistical Association, 46(253):55 67, 1951. Ohad Shamir. A variant of Azuma s inequality for martingales with subgaussian tails. ar Xiv preprint ar Xiv:1110.2392, 2011. Yinan Shen, Jingyang Li, Jian-Feng Cai, and Dong Xia. Computationally efficient and statistically optimal robust high-dimensional linear regression. ar Xiv preprint ar Xiv:2305.06199, 2023. Farzan Soleymani and Eric Paquet. Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoder deepbreath. Expert Systems with Applications, 156: 113456, 2020. Matthew Streeter and H Brendan Mc Mahan. Less regret via online conditioning. ar Xiv preprint ar Xiv:1002.4862, 2010. Xiaofei Sun, Hongwei Wang, Chao Cai, Mei Yao, and Kangning Wang. Online renewable smooth quantile regression. Computational Statistics & Data Analysis, 185:107781, 2023. Tian Tong, Cong Ma, and Yuejie Chi. Low-rank matrix recovery with scaled subgradient methods: Fast and robust convergence without the condition number. IEEE Transactions on Signal Processing, 69:2396 2409, 2021. Bart Vandereycken. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2):1214 1236, 2013. Roman Vershynin. High-dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018. Kangning Wang, Hongwei Wang, and Shaomin Li. Renewable quantile regression for streaming datasets. Knowledge-Based Systems, 235:107675, 2022. ONLINE QUANTILE REGRESSION Dong Xia and Ming Yuan. Statistical inferences of linear forms for noisy matrix completion. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(1):58 77, 2021. Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning, page 116, 2004. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928 936, 2003. Changliang Zou, Guanghui Wang, and Runze Li. Consistent selection of the number of changepoints via sample-splitting. The Annals of Statistics, 48(1):413, 2020. SHEN, XIA AND ZHOU Supplementary Material to Online Quantile Regression Appendix A. Proof of Main Results This section presents proofs of the main results for each setting. A.1 Proof of Online Learning In this subsection, we shall first prove Remark 3, and then prove the regret bound, which is a subsequent result of the expected estimation error rates. A.1.1 PROOF OF REMARK 3: ONLINE LEARNING WITH GOOD INITIALIZATION We are going to prove the convergence dynamics when β0 β 2 < 8 q C 1 l γ. According to proof of Theorem 1, we have 1 Ca 12(l + Cbd) + 2b2 0 τ 2 Cu 1 Ca 12(l + 1 + Cbd) 1 Ca 12(t + Cbd) C2 a (l + Cbd)2 1 Ca 12(l + 1 + Cbd) 1 Ca 12(t + Cbd) fl(βl) fl(β ) E fl(βl) fl(β ) βl , also, the following upper bound of the product series holds 1 Ca 12(l + Cbd) 1 Ca 12(t + Cbd) l + 1 + Cbd t + 1 + Cbd Case One: large Ca If Ca > 12, the analyses are exactly the ones in Theorem 1 and the upper bound is βt+1 β 2 2 C Cu d t + 1 t1 + Cbdb2 0. Case One: small Ca If Ca < 12, the first term of equation (8) is bounded with 1 Ca 12(l + Cbd) β0 β 2 2 1 + Cbd In addition, the second term of equation (8) has the following upper bound, b2 0 τ 2 Cu 1 Ca 12(l + 1 + Cbd) 1 Ca 12(t + Cbd) C2 a (l + Cbd)2 ONLINE QUANTILE REGRESSION b2 0 τ 2 Cu l + 1 + Cbd t + 1 + Cbd 12 C2 a (l + Cbd)2 b2 0 τ 2 Cu C2 l d C2 a (t + 1 + Cbd) Ca (l + 1 + Cbd)2 Ca 4b2 0 τ 2 Cu 12 C2 a (t + 1 + Cbd) Ca where the last line uses Cb 1 and (l + 1 + Cbd)2 Ca (x + Cbd)2 Ca 12 dx 1 1 Ca/12 1 It is worth noting that, under the event of t l=0{ βl β 2 2 C ( d l+Cbd) Ca 12 β0 β 2 2}, it has fl(βl) fl(β ) E fl(βl) fl(β ) βl ψ2 τ Cu C d l + Cbd 12 β0 β 2 2. Thus according to Theorem 2 in Shamir (2011), the upper bound of the last term for equation (8) is obtained with probability exceeding 1 exp( cd), 1 Ca 12(l + 1 + Cbd) 1 Ca 12(t + Cbd) fl(βl) fl(β ) E fl(βl) fl(β ) βl v u u t C Cud l + 1 + Cbd t + 1 + Cbd 6 1 (l + Cbd)2 12 β0 β 2 2 (t + 1 + Cbd) Ca 12 Cab0 β0 β 2. Thus altogether, by having Ca < min{12, (Cl/ Cu) β0 β 2/b0}, we obtain the following bound for equation (8) with some sufficient large constant C , where C may depend on the constants Ca, Cb, βt+1 β 2 2 C d t + 1 + Cbd 12 β0 β 2 2. A.1.2 PROOF OF THEOREM 4 Before proving the regret bound, we first prove the following lemma, which establishes the expected error rate dynamics. It is worth noting that it is not a consequence of the empirical dynamics Theorem 1. Lemma 10 Assume the same conditions and stepsizes as Theorem 1. Then in the first phase we have P t t1 E βt β 2 C max{τ 2, (1 τ)2}d Cu Cl D0 + Ct1 exp( cd)D0 and in the second phase we have E βt+1 β 2 2 C Cu C2 l d t+1 t1+Cbdb2 0. SHEN, XIA AND ZHOU Proof Note that the expectation of E βt+1 β 2 2 is E βt+1 β 2 2 = E E βt+1 β 2 2 βt = E E βt β 2 2 2ηt βt β , gt + η2 t gt 2 2 βt . First, we would consider the inner conditional expectation. According to the sub-gradient definition, and the independence between ηt and (Xt, Yt), we have E βt β 2 2 2ηt βt β , gt + η2 t gt 2 2 βt βt β 2 2 2ηt E ft(βt) ft(β ) βt + η2 t E gt 2 2 βt . FIRST PHASE ANALYSIS. We inherit the notations in Section 4. Denote the event Et := { βt β 2 CDt} and Theorem 1 shows P( t=0,...,t1Ec t ) t1 exp( c0d). Then we have E βt+1 β 2 = E βt+1 β 2 I{Et+1} + E βt+1 β 2 I{Ec t+1} CDt+1 + P Ec t+1 E βt+1 β 2 2 1 Hence we only need to provide a rough upper bound for E βt β 2. Note that E[ft(βt) ft(β ) βt] 0 and therefore, we have E βt+1 β 2 2 E βt β 2 + η2 t max{τ, 1 τ}2Cud E βt β 2 + C max{τ, 1 τ}2 Cl Cu which implies E βt+1 β 2 2 β0 β 2 2 + C max{τ, 1 τ}2 Cl Cu l=0 D2 l CD2 0. Inserting the inequality above into equation (9), we obtain E βt+1 β 2 CDt+1 + P Ec t+1 D0. Thus we have t=0 E βt+1 β 2 t=0 Dt + CP( t=0,...,t1Ec t )D0 C max{τ 2, (1 τ)2}d Cu Cl D0 + Ct1 exp( cd)D0. SECOND PHASE ANALYSIS. Lemma 17 shows that in the second phase, the lower bound of the expected excess risk is a second-order one, E[ft(βt) ft(β )|βt] 1 12b0 Cl βt β 2 2. Thus we have E βt+1 β 2 2 βt βt β 2 2 1 6b0 ηt Cl βt β 2 2 + η2 t Cu max{τ 2, (1 τ)2}d. ONLINE QUANTILE REGRESSION By specifying the stepsize ηt = Ca t t1+Cbd b0 Cl , we obtain E βt+1 β 2 2 βt 1 Ca 6(t t1 + Cbd) βt β 2 2 + Cu C2 a (t t1 + Cbd)2 db2 0. Taking the expectation over βt on both sides of the equation and substituting the upper bound at time t, E[ βt β 2 2] C Cu C2 l Cad t t1+Cbdb2 0, into the above equation, we obtain E βt+1 β 2 2 C 1 Ca 6(t t1 + Cbd) Cad t t1 + Cbdb2 0 + Cu C2 a (t t1 + Cbd)2 db2 0 = C 1 Ca/2 6(t t1 + Cbd) Cad t t1 + Cbdb2 0. Note that 1 Ca/2 6(t t1+Cbd) 1 1 t t1+Cbd t t1+Cbd t+1 t1+Cbd, and hence we finally obtain E[ βt+1 β 2 2] C Cu C2 l Cad t+1 t1+Cbdb2 0, which completes the proof for the second phase. We are now ready to establish the regret bound stated in Theorem 4. By the expected excess risk bound in Lemma 17, we have β = arg maxβ E PT t=0 ft(βt) ft(β). In the first phase, it has t=0 E [ft(βt) ft(β )] t=0 E E ft(βt) ft(β ) βt max{τ, 1 τ} p t=0 E βt β 2 C max{τ 3, (1 τ)3} p Cl D0 + max{τ, 1 τ} p Cut1 exp( cd), where we use the Lipschitz continuity of ft( ), and the final inequality follows from Lemma 10. For the second phase (t > t1), we have E [ft(βt) ft(β )] = E E ft(βt) ft(β ) βt Cu b1 E βt β 2 2 d t t1 + Cbdb2 0, where the first inequality follows from Lemma 17, and the second from Lemma 10. The proof is thus complete. A.2 Proof of Batch Learning In this subsection, we develop the batch learning framework in a sequential manner. We begin with the proof of Theorem 5, proceed to analyze convergence under well-controlled initialization, and conclude with the proof of Theorem 7. SHEN, XIA AND ZHOU A.2.1 PROOF OF THEOREM 5 The update rule βt+1 = βt ηtgt ensures that βt+1 β 2 2 = βt β 2 2 2ηt βt β , gt + η2 t gt 2 2. (10) The behavior of the inner product term βt β , gt and the gradient norm gt 2 2 varies significantly depending on the proximity of the current iterate βt to the true parameter β . Therefore, we analyze these quantities by distinguishing between regimes where βt is either close to or far from β . FIRST PHASE ANALYSIS. We shall prove the convergence dynamics by induction. Initially, it holds obviously for β0. We continue to assume for the iteration of t, it has βt β 2 (1 1 100 Cl Cu )t D0 and we are going to prove the convergence dynamics at t + 1. For convenience, we denote Dt := (1 1 100 Cl Cu )t D0, Dt+1 := (1 1 100 Cl Cu )t+1 D0. We proceed to assume the event sup β1,β2 Rd |ft(β1) ft(β2) E [ft(β1) ft(β2)]| β1 β2 1 2 C1 max{τ, 1 τ} holds. Specifically, Proposition 14 establishes that P(Et) 1 exp( C2d) exp( p nt/ log nt). Under the event Et, the intermediate term in (10) can be lower bounded as βt β , gt ft(βt) ft(β ) E ft(βt) ft(β ) βt C1 max{τ, 1 τ} where the first inequality follows from the sub-gradient definition, and the second holds on the event Et. Additionally, the convexity of the quantile loss, combined with Lemma 12, yields E ft(βt) ft(β ) βt 1 i=1 E h ρQ,τ( X(t) i , β βt ) ρQ,τ( ξi) ρQ,τ(ξi) βt i Cl βt β 2 γ. Thus, if the batch size satisfies nt C(Cu/Cl) max{τ 2, (1 τ)2}d, and the iterate lies in the outer phase region βt β 2 8 q C 1 l γ, it holds that βt β , gt 1 Moreover, Lemma 16 shows that under the event Et, gt 2 Cu. Combining these, the update equation (10) can be upper bounded as βt+1 β 2 2 βt β 2 2 1 Cl βt β 2 + η2 t Cu. ONLINE QUANTILE REGRESSION We now treat the right-hand side of the above inequality as a quadratic function in βt β 2. Using the inductive assumption βt β 2 Dt, we obtain βt+1 β 2 2 D2 t 1 Cl Dt + η2 t Cu. The stepsize ηt = (1 1 100 Cl Cu )tη0 Cl Cu Dt [1/8, 5/24] implies that βt+1 β 2 2 1 1 which leads to the inductive conclusion βt+1 β 2 (1 1 100 Cl Cu )t+1D0 = Dt+1. SECOND PHASE ANALYSIS. In the second phase, we shall still assume event Et holds, which is defined and discussed in the first phase analyses. Note that according to Lemma 17, in this case, where βt β 2 8 q C 1 l γ, the expectation of excess risk has a quadratic lower bound, E ft(βt) ft(β ) βt 1 12b0 Cl βt β 2 2. The second phase region has the constraints βt β 2 C max{τ, 1 τ} q (Cu/C2 l )d/nb0. Then under event Et, we have ft(βt) ft(β ) 1 12b0 Cl βt β 2 2 C1 max{τ, 1 τ} p Cud/nt βt β 2 1 24b0 Cl βt β 2 2. In addition, Lemma 16 proves that under the event Et, the following holds in the second phase region: b2 1 C2 u β β 2 2. Then together with the sub-gradient definition gt, βt β ft(βt) ft(β ), equation (10) can be upper bounded as βt+1 β 2 2 βt β 2 2 ηt 1 12b0 Cl βt β 2 2 + η2 t 4 b2 1 C2 u β β 2 2. By having stepsize ηt Cl C2u b2 1 b0 [c1, c2], we obtain βt+1 β 2 2 1 0.0005b2 1 b2 0 THIRD PHASE ANALYSIS. We shall prove the convergence dynamics by induction. The proof at t2 is trivial, which can be obtained similarly to the following analyses, and hence it is skipped. Then we are going to prove βt+1 β 2 when the desired inequality holds for t2, . . . , t. It is worth SHEN, XIA AND ZHOU noting that Lemma 16 and Proposition 14 show: in the region of βt β 2 C1 max{τ, 1 (Cu/C2 l )d/n b0, with probability exceeding 1 exp( C1d) exp( p nt/ log nt), it has gt 2 2 C2 3 τ 2Cu d n C2 ub2 0 C2 l b2 1 , where we denote τ := max{τ, 1 τ}, for convenience. According to the loss function expectation calculations in Lemma 17, we have E ft(βt) ft(β ) βt 1 24b0 βt β 2 2. Then together with the sub-gradient definition, we have βt+1 β 2 2 βt β 2 2 ηt 1 12b0 Cl βt β 2 2 + C2 3η2 t τ 2Cu d n C2 ub2 0 C2 l b2 1 + 2ηt ft(βt) ft(β ) E ft(βt) ft(β ) βt . Insert the stepsize ηt = Ca t t2+Cb b0 Cl into the above equation and then it arrives at βt+1 β 2 2 1 Ca 12(t t2 + Cb) βt β 2 2 + C2 3 τ 2 C2 a (t t2 + Cb)2 Cu C2 l d n C2 ub2 0 C2 l b2 1 b2 0 Ca t t2 + Cb ft(βt) ft(β ) E ft(βt) ft(β ) βt . Applying the above bound repeatedly from t2, we have 1 Ca 12(l t2 + Cb) + C2 3 τ 2 Cu C2 ub2 0 C2 l b2 1 1 Ca 12(l + 1 t2 + Cb) 1 Ca 12(t t2 + Cb) C2 a (l t2 + Cb)2 1 Ca 12(l + 1 t2 + Cb) 1 Ca 12(t t2 + Cb) Ca l t2 + Cb fl(βl) fl(β ) E fl(βl) fl(β ) βl . (11) A sharp analysis of equation (11) is key to the proof. Notice that with Ca 12, we have 1 Ca 12(l t2+Cb) 1 1 l t2+Cb l t2+Cb l+1 t2+Cb . In this way, the first term on the right hand side could be bounded with 1 Ca 12(l t2 + Cb) βt2 β 2 2 Cb t + 1 t2 + Cb βt2 β 2 2. ONLINE QUANTILE REGRESSION The product sequence satisfies, for each l = t2, . . . , t, that 1 Ca 12(l + 1 t2 + Cb) 1 Ca 12(t t2 + Cb) k=l+1 log 1 Ca 12(k t2 + Cb) Ca 12(k t2 + Cb) 1 x + Cb dx = exp Ca 12 log t + 1 t2 + Cb l + 1 t2 + Cb = l + 1 t2 + Cb t + 1 t2 + Cb The second term of equation (11) thus has the following upper bound: 1 Ca 12(l + 1 t2 + Cb) 1 Ca 12(t t2 + Cb) C2 a (l t2 + Cb)2 l + 1 t2 + Cb t + 1 t2 + Cb 12 C2 a (l t2 + Cb)2 12 C2 a (t + 1 t2 + Cb) Ca l=t2 (l t2 + Cb) Ca With Ca > 12, we have l=t2 (l t2 + Cb) Ca 12 2 Z t t2+1 0 (x + Cb) Ca 12 2 dx = 1 Ca 12 1(t t2 + 1 + Cb) Ca Hence, we have the upper bound of the second term 1 Ca 12(l + 1 t2 + Cb) 1 Ca 12(t t2 + Cb) C2 a (l t2 + Cb)2 12 1 t + 1 + Cb t2 . Then, it suffices to bound the last term of equation (11). It is worth noting that under the event of t l=t2 n βl β 2 2 C Cu C2 ub2 0 C2 l b2 1 1 l t2+Cb d nb2 0 o , we have fl(βl) fl(β ) E fl(βl) fl(β ) βl Ψ2 C2ub2 0 C2 l b2 1 1 l t2 + Cb Furthermore, invoking Azuma s sub-Gaussian inequality (e.g., Theorem 2 in Shamir (2011)) in conjunction with equation (12), it holds with probability exceeding 1 exp( cd) that 1 Ca 12(l + 1 t2 + Cb) 1 Ca 12(t t2 + Cb) Ca l t2 + Cb SHEN, XIA AND ZHOU fl(βl) fl(β ) E fl(βl) fl(β ) βl 1 Ca 12(k t2 + Cb) 2 C2 a (l t2 + Cb)2 fl(βl) fl(β ) E fl(βl) fl(β ) βl 2 Ψ2 d nb2 0 CCa (t + 1 t2 + Cb) Ca l=t2 (l + 1 t2 + Cb) Ca d nb2 0 CCa t + 1 t2 + Cb . Therefore, combining the pieces, equation (11) can be bounded from above as βt+1 β 2 2 C τ 2 Cu d n C2 ub2 0 C2 l b2 1 1 t + 1 t2 + Cb b2 0, where C > C2 a is some sufficiently large constant. A.2.2 ANALYSIS OF BATCH LEARNING WITH WELL-CONTROLLED INITIAL ERRORS We will elucidate the convergence dynamics under the condition of well-controlled initial error rates, specifically when β0 β 2 2 C2 1 max{τ 2, (1 τ)2}(Cu/C2 l )(d/n) b2 0. Employing a similar analysis as applied in the third phase of Theorem 5, we obtain βt+1 β 2 2 βt β 2 2 ηt 1 12b0 Cl βt β 2 2 + C2 3η2 t τ 2Cu d n C2 ub2 0 C2 l b2 1 2ηt ft(βt) ft(β ) E ft(βt) ft(β ) βt . Inserting the stepsize ηt = Ca t+Cb b0 Cl into the above equation results in βt+1 β 2 2 1 Ca 12(t + Cb) βt β 2 2 + C2 3 τ 2 Cu C2 a (t + Cb)2 d nb2 0 2 Ca t + Cb ft(βt) ft(β ) E ft(βt) ft(β ) βt . Moreover, the preceding equation implies 1 Ca 12(l + Cb) + C2 3 τ 2 C2 ub2 0 C2 l b2 1 1 Ca 12(l + 1 + Cb) 1 Ca 12(t + Cb) C2 a (l + Cb)2 d nb2 0 1 Ca 12(l + 1 + Cb) 1 Ca 12(t + Cb) ONLINE QUANTILE REGRESSION fl(βl) fl(β ) E fl(βl) ft(β ) βl . In what follows, we are going to upper bound each term of the right hand side equation. Case One: large Ca. With Ca > 12, the first term on the right-hand side can be bounded as 1 Ca 12(l + Cb) β0 β 2 2 Cb t + 1 + Cb β0 β 2 2. The product sequence is bounded by 1 Ca 12(l + 1 + Cb) 1 Ca 12(t + Cb) 1 x + Cb dx = l + 1 + Cb Subsequently, the second term can be bounded as 1 Ca 12(l + 1 + Cb) 1 Ca 12(t + Cb) C2 a (l + Cb)2 12 C2 a (l + Cb)2 Cb + 1 12 C2 a t + Cb + 1. Next, consider the last term. It is noteworthy that under t l=0{ βl β 2 2 C τ 2 Cu C2 l 1 l+Cb d nb2 0}, we have fl(βl) fl(β ) E fl(βl) ft(β ) βl 2 Ψ2 C τ 4 d n2 C2 u C2 l 1 l + Cb b2 0. Therefore, in accordance with Theorem 2 in Shamir (2011), with probability at least 1 exp( cd), the last term satisfies 1 Ca 12(l + 1 + Cb) 1 Ca 12(t + Cb) fl(βl) fl(β ) E fl(βl) ft(β ) βl SHEN, XIA AND ZHOU Putting together the pieces, we conclude that βt+1 β 2 2 C τ 2 Cu C2 ub2 0 C2 l b2 1 1 t + 1 + Cb Case Two: small Ca In this setting, with Ca < 12, the second term satisfies 1 Ca 12(l + 1 + Cb) 1 Ca 12(t + Cb) C2 a (l + Cb)2 12 C2 a (l + Cb)2 C2 a (t + Cb + 1) Ca Under the event t l=0{ βl β 2 2 c (l+Cb) Ca 12 β0 β 2 2}, we have fl(βl) fl(β ) E fl(βl) ft(β ) βl 2 Ψ2 C τ Cu (l + Cb) Ca 12 β0 β 2 2. Thus, with probability exceeding 1 exp( cd), it holds 1 Ca 12(l + 1 + Cb) 1 Ca 12(t + Cb) fl(βl) fl(β ) E fl(βl) ft(β ) βl Cl b0 β0 β 2 (l + Cb)2+ Ca Cl b0 β0 β 2 (t + 1 + Cb) Ca To sum up, we establish the upper bound βt+1 β 2 2 1 + Cb t + 1 + Cb 12 β0 β 2 2 + C2 3 τ 2 Cu C2 ub2 0 C2 l b2 1 d nb2 0 C2 a (t + Cb + 1) Ca Cl b0 β0 β 2 Ca (t + 1 + Cb) Ca Therefore, by choosing sufficiently small Ca < (Cl/Cu)(b1/b0)(Cl/ Cu) p n/d β0 β 2/ τb0 and Ca < Cb 1, we obtain βt+1 β 2 2 C (t + 1 + Cb) Ca 12 β0 β 2 2. ONLINE QUANTILE REGRESSION A.2.3 PROOF OF THEOREM 7 Prior to demonstrating Theorem 7, it is imperative to establish the ensuing lemma. Lemma 11 Assume the same conditions and stepsize schemes as in Theorem 5. In the second phase, E βt+1 β 2 2 (1 c1(Cl/Cu)/(b2 1/b2 0))t+1 βt1 β 2 2; and in phrase three, E βt+1 β 2 2 C Cu C2 l 1 t+1 t2+Cb d nb2 0. Proof Note that the expectation can be expressed as E βt+1 β 2 2 = E E βt+1 β 2 2 βt = E E βt β 2 2 2ηt βt β , gt + η2 t gt 2 2 βt We first consider the inside conditional expectation. According to the sub-gradient definition and considering that ηt is independent of (Xt, Yt), we have E βt β 2 2 2ηt βt β , gt + η2 t gt 2 2 βt βt β 2 2 2ηt E ft(βt) ft(β ) βt + η2 t E gt 2 2 βt (13) The sub-gradient value at 0 won t affect the expectation calculations, so for convenience, we let ρQ,τ(x)|x=0 = 0. Then the sub-gradient at βt could be written as i=1 X(t) i τ1Y (t) i < X(t) i ,βt + (1 τ)1Y (t) i > X(t) i ,βt Its length is bounded with n2 t gt 2 2 = i=1 X(t) i 2 2 τ 21Y (t) i < X(t) i ,βt + (1 τ)21Y (t) i > X(t) i ,βt i =j X(t) i , X(t) j τ1Y (t) i < X(t) i ,βt + (1 τ)1Y (t) i > X(t) i ,βt τ1Y (t) j < X(t) j ,βt + (1 τ)1Y (t) j > X(t) j ,βt Taking the βt conditional expectation on each side of the above equation leads to n2 t E gt 2 2 βt max{τ 2, (1 τ)2}Cuntd + nt(nt 1) 2 E h X(t) i , X(t) j Hξ( X(t) i , βt β ) Hξ(0) Hξ( X(t) j , βt β ) Hξ(0) βt i . Notice that X(t) i follows Assumption 1 and then the transformed vector Z(t) i := Σ 1 2 X(t) i is isotropic. Thus we have E := E h X(t) i , X(t) j Hξ( X(t) i , βt β ) Hξ(0) Hξ( X(t) j , βt β ) Hξ(0) βt i = E h Σ 1 2 Z(t) i , Σ 1 2 Z(t) j Hξ( Z(t) i , Σ 1 2 (βt β ) ) Hξ(0) SHEN, XIA AND ZHOU Hξ( Z(t) j , Σ 1 2 (βt β ) ) Hξ(0) βt i . There exists a set of unit length orthogonal vectors {e1 := Σ 1 2 (βt β ) Σ 1 2 (βt β ) 2 , e2, . . . , ed} such that any two of them satisfy e i ej = 0 with i = j. Then the vector Zi can be decomposed into d independent random vectors Z(t) i = Z(t) i , Σ 1 2 (βt β ) Σ 1 2 (βt β ) 2 2 Σ 1 2 (βt β ) + Z(t) i , e2 e2 + + Z(t) i , ed ed. And Z(t) j could be written in a parallel way. It is worth noting that the term Hξ( Z(t) i , Σ 1 2 (βt β ) ) Hξ(0) is independent of the last d 1 components of the decomposition and E Z(t) i , el = 0 holds with l = 2, . . . , d. Thus we have E = Cu Σ 1 2 (βt β ) 2 2 E h Z(t) i , Σ 1 2 (βt β ) Z(t) j , Σ 1 2 (βt β ) Hξ( Z(t) i , Σ 1 2 (βt β ) ) Hξ(0) Hξ( Z(t) j , Σ 1 2 (βt β ) ) Hξ(0) βt i . There are two methods to upper bound the above equation, which are presented as follows. 1. Bound Hξ( Z(t) i , Σ 1 2 (βt β ) ) Hξ(0) with constant 1. Then we have E Cu Σ 1 2 (βt β ) 2 2 E h | Z(t) i , Σ 1 2 (βt β ) || Z(t) j , Σ 1 2 (βt β ) | βt i In this way, we have E gt 2 2 βt max{τ 2, (1 τ)2}Cu d nt + Cu. 2. Bound Hξ( Z(t) i , Σ 1 2 (βt β ) ) Hξ(0) using conditions in Assumption 2, which implies Hξ( Z(t) i , Σ 1 2 (βt β ) ) Hξ(0) 1 b1 | Z(t) i , Σ 1 2 (βt β ) |. Then we have E Cu b2 1 Σ 1 2 (βt β ) 2 2 E h | Z(t) i , Σ 1 2 (βt β ) |2| Z(t) j , Σ 1 2 (βt β ) |2 βt i b2 1 C2 u βt β 2 2. This method leads to E gt 2 2 βt max{τ 2, (1 τ)2}Cu d nt + C2 u b2 1 βt β 2 2. ONLINE QUANTILE REGRESSION Thus, altogether, we have E gt 2 2 βt max{τ 2, (1 τ)2}Cu d nt + min Cu, C2 u b2 1 βt β 2 2 Then we are ready to analyze the convergence dynamics in each phase. SECOND PHASE ANALYSIS. In this region, equation (14) yields E gt 2 2 βt 2C2 u b2 1 βt β 2 2 and Lemma 17 provides E[ft(βt) ft(β )|βt] Cl b0 βt β 2 2. Hence, equation (13) is upper bounded with E βt+1 β 2 2 βt βt β 2 2 ηt Cl 12b0 βt β 2 2 + η2 t Cu b2 1 βt β 2 2 Take expectation over βt and then we obtain E βt+1 β 2 2 1 Cl E βt+1 β 2 2 . THIRD PHASE ANALYSIS. In this region, (14) guarantees E[ gt 2 2 βt] 2(C2 u/C2 l )(b2 0/b2 1) τ 2Cu d nt and Lemma 17 proves E[ft(βt) ft(β )|βt] Cl b0 βt β 2 2. Hence, equation (13) is upper bounded with E βt+1 β 2 2 βt βt β 2 2 ηt Cl 12b0 βt β 2 2 + η2 t C2 ub2 0 C2 l b2 1 max{τ 2, (1 τ)2}Cu d n 1 Ca 12(t t2 + Cb) βt β 2 2 + Ca (t t2 + Cb)2 C2 ub2 0 C2 l b2 1 max{τ 2, (1 τ)2}Cu Take expectation over βt and then we have E βt+1 β 2 2 1 Ca 12(t t2 + Cb) E βt β 2 2 + Ca (t t2 + Cb)2 max{τ 2, (1 τ)2}C2 ub2 0 C2 l b2 1 C 1 Ca 24(t t2 + Cb) C2 ub2 0 C2 l b2 1 1 t t2 + Cb d nb2 0 C C2 ub2 0 C2 l b2 1 1 t + 1 t2 + Cb Then we are ready to prove Theorem 7. According to the expected excess risk bound in Lemma 17, we have β = arg maxβ E PT t=0 ft(βt) ft(β). In the first phase, it has E [ft(βt) ft(β )] max{τ, 1 τ} p Cu E βt β 2 2 1/2 max{τ, 1 τ} p Cu 1 c(Cl/Cu)/(max{τ 2, (1 τ)2}) t β0 β 2, SHEN, XIA AND ZHOU where the first line is obtained in the same way as Theorem 4 and the last line follows from Lemma 10. In the second phase and the third phase, according to Lemma 17, we have E [ft(βt) ft(β )] = E E ft(βt) ft(β ) βt Cu b1 E βt β 2 2. As a result, applying Lemma 11, the regret can ultimately be bounded by Regret T = E t=0 ft(βt) ft(β ) t=0 (1 c(Cl/Cu))t 1 cb2 1 b2 0 t t1 βt1 β 2 2 + C C3 u C3 l 1 t t1 + Cb Cu β0 β 2 + C ku3 b2 0 b2 1 γ2/b1 + C C3 u C3 l d nb0 log T + 1 t1 + Cb C C3 u C3 l b2 0 b2 1 max{ p Cu β0 β 2, γ2/b1} + C C3 u C3 l d nb0 log T + 1 t1 + Cb The proof is then complete. A.3 Proof of Sequential Learning with Infinite Storage Same as the previous settings, here the update at t can be characterized as the follows, βt+1 β 2 2 = βt β 2 2 2ηt βt β , gt + η2 t gt 2 2. (15) We begin by establishing the regularity properties in this setting, which will be instrumental in proving convergence. According to Lemma 10 in Shen et al. (2023) and Lemma 16, under the event sup β1,β2 Rd |ft(β1) ft(β2) E [ft(β1) ft(β2)]| β1 β2 1 2 C max{τ, 1 τ} v u u t Cud/ the following properties hold for any β Rd and corresponding sub-gradient g ft(β): ft(β) ft(β ) 1 Cl β β 2 γ, g 2 p and simultaneously, ft(β) ft(β ) 1 12b0 Cl β β 2 2 C1 v u u t Cud/ l=0 nl β β 2, b1 Cu β β 2 + C2 v u u t Cud/ ONLINE QUANTILE REGRESSION where C > 0 is a universal constant that is independent of both the ambient dimension and the sample size. Specifically, Proposition 14 shows that P(Et) 1 exp( C3d) exp l=0 nl/ log A.3.1 FIRST PHASE We shall also prove the convergence dynamics via induction. The desired inequality at β0 is obvi- ous. Then assume we already have the dynamics at t, namely, βt β 2 1 1 100 Cl Cu t D0 and we are going to prove the concentration at t + 1. For convenience, denote Dt := 1 1 100 Cl Cu t D0, Dt+1 := 1 1 100 Cl Cu Definition of sub-gradient and event Et together infer that the intermediate term of (15) could be lower bounded with βt β , gt ft(βt) ft(β ) E ft(βt) ft(β ) βt C max{τ, 1 τ} v u u t Cud/ l=0 nl βt β 2 where the last line uses expectation calculations, which can be found in the batch learning setting and n0 C max{τ 2, (1 τ)2}(Cu/Cl)d. Moreover, Lemma 16 proves that under Et, gt 2 Cu. Then (15) could be upper bounded with βt+1 β 2 2 βt β 2 2 1 Cl βt β 2 + η2 t Cu. Then by the induction βt β 2 Dt and stepsize definition ηt Cl Cu Dt [1/8, 5/24], and by regarding the right hand side as quadratic function of βt β 2, we have βt+1 β 2 2 D2 t 1 Cl Dt + η2 t Cu 1 1 which yields βt+1 β 2 1 1 100 Cl Cu A.3.2 SECOND PHASE In the second phase, the expectation of the loss function could be lower bounded with E ft(βt) ft(β ) βt 1 12b0 Cl βt β 2 2. SHEN, XIA AND ZHOU Furthermore, the definition of sub-gradient and event Et imply that ft(βt) ft(β ) 1 12b0 Cl βt β 2 2 C max{τ, 1 τ} v u u t Cud/ l=0 nl βt β 2 1 24b0 Cl βt β 2 2 C1b0 Cu Cl d Pt l=0 nl , where Cauchy-Schwarz inequality is used in the last line that 1 24b0 Cl βt β 2 2 + Cb0 τ 2 Cu d Pt l=0 nl C1 τ v u u t Cud/ l=0 nl βt β 2, Lemma 16 shows that b2 1 C2 u βt β 2 2 + C2 τ 2Cu d Pt l=0 nl . Then equation (15) could be upper bounded with βt+1 β 2 2 βt β 2 2 1 12b0 ηt Clnl βt β 2 2 + η2 t 4 b2 1 C2 u βt β 2 2 + 2ηt C1b0 Cu d Pt l=0 nl + C2η2 t Cu d Pt l=0 nl . Plug the stepsize ηt [c1, c2] Cl C2u b2 1 b0 into the above equation and then we have βt+1 β 2 2 1 cb2 1 b2 0 C2 l C2u βt β 2 2 + C d Cu Pt l=0 nl b2 1, where c, C are some universal constants. In all, we have βt+1 β 2 2 1 cb2 1 b2 0 C2 l C2u t+1 t1 βt1 β 2 2 + C d 1 cb2 1 b2 0 C2 l C2u Ps l=0 nl . If the number of arriving samples remains to be n, namely, nl = n for all l 1, then equipped with Lemma 19, the estimation error rate is upper bounded with βt+1 β 2 2 1 2cb2 1 b2 0 C2 l C2u t+1 t1 βt1 β 2 2 C2 l d n0 + tn b2 0 | log(1 cb2 1 b2 0 C2 l C2u )| , where C is some constant irrelevant with t. It is worth noting that when 0 < x < 1, | log(1 x)| x holds. Thus the last term of the above equation could be further simplified by βt+1 β 2 2 1 2cb2 1 b2 0 C2 l C2u t+1 t1 βt1 β 2 2 + C1 Cu C2 l d n0 + tn b2 0 b0 with some constant C1. ONLINE QUANTILE REGRESSION Appendix B. Technical Lemmas The following lemma is derivable through integral calculations and can be found the appendix of Shen et al. (2023). Lemma 12 Suppose the predictors {Xi}n i=1 satisfy Assumption 1. Then for any fixed vector β Rd, we have r Cl 2π β 2 EρQ,τ( β, Xi ) Lemma 13 Suppose the predictors and noise term satisfy Assumption 1 and Assumption 2 (a), respectively. And the loss function is based on one observation, f(β) = ρQ,τ(Y X β). Then for any fixed β, we have E {f(β) f(β )} Cl 2π β β 2 γ, E {f(β) f(β )} Cl 12b0 β β 2 2, for all β β 2 q Proof Regarding the expectation, we consider the expectation conditional on X first and then consider the expectation with respect to ξ, E {f(β) f(β )} = E n E n ρQ,τ(ξ + X (β β)) ρQ,τ(ξ) X oo . Specifically, it s E n ρQ,τ(ξ + X (β β)) ρQ,τ(ξ) X o = Z X,β β 0 (Hξ(t) Hξ(0)) dt Denote σ := ((β β ) Σ(β β ))1/2 and denote g( ) as the density X, β β . First, consider the case when σ γ, E {f(β) f(β )} = Z + 0 g(z) (Hξ(t) Hξ(0)) dt dz 0 g(z) (Hξ(t) Hξ(0)) dt dz Z +σ Then consider another way to find the lower bound, following the triangle inequality and the expectation calculations in Lemma 12, E{f(β) f(β )} E n ρQ,τ(X (β β)) ρQ,τ(ξ) ρQ,τ( ξ) o SHEN, XIA AND ZHOU Cl 2π β β γ, which completes the proof. Proposition 14 (Proposition 3 in Shen et al. (2023)) Suppose the loss function is given by the following equation i=1 ρQ,τ(Yi Xi, β ), where {Xi}n i=1 follow Assumption 1 and {(Xi, Yi)}n i=1 are independent observations. Then there exist C1, C2 such that with probability exceeding 1 exp ( C1d) exp p n/ log n , the fol- lowing holds for all β1, β2 Rd, |f(β1) f(β2) E (f(β1) f(β2))| C2 max{τ, 1 τ} Cu d n β1 β2 2. Proposition 15 (Concentration with Independence) Suppose the setting is same as Proposition 14. Then for any fixed β1, β2, with probability exceeding 1 2 exp C ns2 Cu β1 β2 2 2 where C is some constant, it has |f(β1) f(β2) E [f(β1) f(β2)]| s. Similarly, with probability exceeding 1 2 exp c1 ns2 Cu max{τ 2,(1 τ)2} β1 β 2 2 i=1 β1 β , Xi gi E β1 β , Xi gi where gi := τI(ξi> Xi,β1 β ) + (1 τ)I(ξi< Xi,β1 β ) + δI(ξi= Xi,β1 β ) is the sub-gradient with any δ [ τ, 1 τ]. Proof Note that |Yi Xi, β1 | |Yi Xi, β2 | Ψ2 | Xi, β1 β2 | Ψ2 C p Cu β1 β2 2. Then by sum of independent sub-Gaussians, we obtain the desired inequality. The second consequence could be obtained in a similar way, β1 β , Xi gi Ψ2 max{τ, 1 τ} | β1 β , Xi | Ψ2 C max{τ, 1 τ} p Cu β1 β2 2. The following lemma provides two types of upper bound for g 2 and in the convergence analyses we would choose to use the sharpest one accordingly. ONLINE QUANTILE REGRESSION Lemma 16 (Upper Bound of Sub-gradient) Suppose the predictors and noise term satisfy Assumption 1 and Assumption 2, respectively. And the loss function is based on n independnet observations f(β) = 1 n Pn i=1 ρQ,τ(Yi X i β) with n Cd. Suppose E := sup β1,β2 Rd |f(β1) f(β2) E [f(β1) f(β2)]| β1 β2 1 2 C1 max{τ, 1 τ} holds. Then for all β and any sub-gradient g f(β), we have b1 β β 2 + C1 max{τ, 1 τ} q Proof Under the event E and with sub-gradient definition, for any β Rd, we have β, g f(β + β) f(β) Ef(β + β) Ef(β) + C1 max{δ, 1 δ} Cu d n β 2. We shall provide the two types of upper bounds respectively. 1. Note that the quantile function ρQ,τ( ) satisfies ρQ,τ(x1 + x2) ρQ,τ(x1) + ρQ,τ(x2) and then for any fixed β, β, we have Ef(β + β) Ef(β) i=1 E [ρQ,τ(ξi Xi, β + β β ) ρQ,τ(ξi Xi, β β )] i=1 EρQ,τ( Xi, β ). Also notice that Xi, β is a mean zero Gaussian variable and then Assumption 1 implies that EρQ,τ( Xi, β ) q 2 π Cu β β 2. Thus together with n C3 max{τ 2, (1 τ)2}d we obtain Ef(β + β) Ef(β) Cu β 2. Then (18) becomes holds for all β, β. Then insert β = g into the above equation and we acquire 2. According to the first two equations in the proof of Lemma 11 in Shen et al. (2023), it holds that Ef(β + β) Ef(β) 1 2b1 Cu β 2 2 + 1 b1 Cu β 2 β β 2. Inserting it into (18), and taking β = b1 4Cu g, we obtain b1 β β 2 + C1 max{δ, 1 δ} SHEN, XIA AND ZHOU Lemma 17 (Lemma 6 in Shen et al. (2023)) Suppose the predictors and noise term satisfy Assumption 1 and Assumption 2(b), respectively. Let f(β) = (1/n) Pn i=1 ρQ,τ(Yi X i β) for n 1. Then, for any fixed β, E [f(β) f(β )] Cu b1 β β 2 2. Lemma 18 Under Assumptions 1 and 2(b), for any β Rd we have E β β , X ρQ,τ(ξ X, β β ) 1 b1 Cu β β 2 2, where ρQ,τ(u) = τI(u 0) + (1 τ)I(u<0) + δI(u=0) is the sub-gradient with δ [ τ, 1 τ]. Proof First consider the conditional expectation, E β β , X ρQ,τ(ξ X, β β ) X = β β , X E ρQ,τ(ξ X, β β ) X . Then the following equation can be obtained via some trivial calculations, E ρQ,τ(ξ X, β β ) X = Z X,β β 0 hξ(x) dx. Denote z := X, β β , which follows mean zero Gaussian distributions with variance (β β ) Σ(β β ) Cu β β 2 2. Let fz( ) be its density. Thus we obtain E β β , X ρQ,τ(ξ X, β β ) = E β β , X E ρQ,τ(ξ X, β β ) X 0 yfz(y)hξ(x) dx dy 1 b1 Cu β β 2 2, where the last equation uses the upper bound of the density characterized in Assumption 2. Lemma 19 Suppose 0 < a < 1 is some fixed constant. Define the sequence with an := Pn k=1 an k k+m, where m > 0 is some integer. Then the following holds for each n, (m + n)an 9 1 a 1 | log a|. Proof First consider the function ha(x) := xax with the domain x > 0. Its first-order derivative is h a(x) = ax xax| log a| = (1 | log a|x)ax. ONLINE QUANTILE REGRESSION It infers that ha(x) achieves the maximum at x = 1 | log a| and ha(x) is monotone decreasing in the region x ( 1 | log a|, + ). And the maxx>0 ha(x) = 1 | log a| exp( 1 | log a| log a) 3 | log a|. In this way we have 2 ] 6 | log a|, for all n Z+. (19) Then consider the sequence (m + n)an, (m + n)an = 1 + m + n m + n 1a + + m + n = 1 + m + n m + n 1a + + m + n m + n [n + m + n m + n [n 2 ]+1 + + m + n As for term A1, when k [n 2 ] it has m+n m+n k 2 and then we obtain A1 2 1 + a + a[ n And A2 could be bounded with 2 ]+1 + + m + n m an 1 = m + n 2 ]+1 + + an 1 2 ]+1 7 1 1 a 1 | log a|, where equation (19) is used. Thus altogether, we have (m + n)an 9 1 a 1 | log a|. The following lemma provides the concentration of Gaussian vectors, an immediate result of Bernstein inequality (Theorem 2.8.1 of Vershynin (2018)). Lemma 20 Suppose the random vector X Rd satisfies Assumption 1. Then with probability exceeding 1 exp( cd), its ℓ2-norm is bounded by X 2 2 2Cud.