# stochastic_online_optimization_using_kalman_recursion__1bc660ea.pdf Journal of Machine Learning Research 22 (2021) 1-55 Submitted 6/20; Revised 8/21; Published 9/21 Stochastic Online Optimization using Kalman Recursion Joseph de Vilmarest joseph.de vilmarest@sorbonne-universite.fr Laboratoire de Probabilit es, Statistique et Mod elisation, Sorbonne Universit e, CNRS 4 place Jussieu, 75005 Paris, France and Electricit e de France R&D Olivier Wintenberger olivier.wintenberger@sorbonne-universite.fr Laboratoire de Probabilit es, Statistique et Mod elisation, Sorbonne Universit e, CNRS 4 place Jussieu, 75005 Paris, France Editor: Tong Zhang We study the Extended Kalman Filter in constant dynamics, offering a bayesian perspective of stochastic optimization. For generalized linear models, we obtain high probability bounds on the cumulative excess risk in an unconstrained setting, under the assumption that the algorithm reaches a local phase. In order to avoid any projection step we propose a twophase analysis. First, for linear and logistic regressions, we prove that the algorithm enters a local phase where the estimate stays in a small region around the optimum. We provide explicit bounds with high probability on this convergence time, slightly modifying the Extended Kalman Filter in the logistic setting. Second, for generalized linear regressions, we provide a martingale analysis of the excess risk in the local phase, improving existing ones in bounded stochastic optimization. The algorithm appears as a parameter-free online procedure that optimally solves some unconstrained optimization problems. Keywords: extended kalman filter, online learning, stochastic optimization 1. Introduction The optimization of convex functions is a long-standing problem with many applications. In supervised machine learning it frequently arises in the form of the prediction of an observation yt R given explanatory variables Xt Rd. The aim is to minimize a cost depending on the prediction and the observation. We focus in this article on linear predictors, hence the loss function is of the form ℓ(yt, θ Xt). Two important settings have emerged in order to analyse learning algorithms. In the online setting (Xt, yt) may be set by an adversary. The assumption required is boundedness and the goal is to upper estimate the regret (cumulative excess loss compared to the optimum). In the stochastic setting with independent identically distributed (i.i.d.) (Xt, yt), we define the risk L(θ) = E[ℓ(y, θ X)]. We focus on the cumulative excess risk Pn t=1 L(ˆθt) L(θ ) where θ minimizes the risk. We obtain bounds holding with high probability simultaneously for any horizon, that is, we control the whole trajectory of the risk. Furthermore, our bounds on the cumulative risk all lead to similar ones on the excess risk at any step for the averaged version of the algorithm. Due to its low computational cost the Stochastic Gradient Descent (SGD) of Robbins and Monro (1951) has been widely used, along with its equivalent in the online setting, c 2021 Joseph de Vilmarest and Olivier Wintenberger. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-618.html. de Vilmarest and Wintenberger the online gradient descent (Zinkevich, 2003) and a simple variant where the iterates are averaged (Ruppert, 1988; Polyak and Juditsky, 1992). More recently Bach and Moulines (2013) provided a sharp bound in expectation on the excess risk for a two-step procedure that has been extended to the average of SGD with a constant step size (Bach, 2014). Second-order methods based on stochastic versions of Newton-Raphson algorithm have been developed in order to converge faster in iterations, although with a bigger computational cost per iteration (Hazan et al., 2007). In order to obtain a parameter-free second-order algorithm we apply a bayesian perspective, seeing the loss as a negative log-likelihood and approximating the maximum-likelihood estimator at each step. We get a state-space model interpretation of the optimization problem: in a well-specified setting the space equation is yt pθt( | Xt) exp( ℓ( , θ t Xt)) with θt Rd and the state equation defines the dynamics of the state θt. The stochastic convex optimization setting corresponds to a degenerate constant state-space model θt = θt 1 called static. As usual in State-Space models, the optimization is realized with the Kalman recursion (Kalman and Bucy, 1961) for the quadratic loss and the Extended Kalman Filter (EKF) (Fahrmeir, 1992) in a more general case. A correspondence has recently been made by Ollivier (2018) between the static EKF and the online natural gradient (Amari, 1998). This motivates a risk analysis in order to enrich the link between Kalman filtering and the optimization community. We may see the static EKF as the online approximation of bayesian model averaging, and similarly to its analysis derived by Kakade and Ng (2005) our analysis is robust to misspecification, that is we don t assume the data to be generated by the probabilistic model. The static EKF is very close to the Online Newton Step (ONS) introduced by Hazan et al. (2007) as both are second-order online algorithms and our results are of the same flavor as those obtained on the ONS (Mahdavi et al., 2015). However the ONS requires the knowledge of the region in which the optimization is realized. It is involved in the choice of the gradient step size and a projection step is done to ensure that the search stays in the chosen region. On the other hand the static EKF yields two advantages at the cost of being less generic. First, there is no costly projection step and each recursive update runs in O(d2) operations, where d is the dimension of the features Xt. Therefore, our comparison of the static EKF with the ONS provides a lead to the open question of Koren (2013). Indeed, the problem of the ONS pointed out by Koren (2013) is to control the cost of the projection step and the question is whether it is possible to perform better than the ONS in the stochastic exp-concave setting. We don t answer the open question in the general setting. However, we suggest a general way to get rid of the projection by dividing the analysis between a convergence proof of the algorithm to the optimum and a second phase where the estimate stays in a small region around the optimum where no projection is required. Second, the algorithm is (nearly) parameter-free. We believe that bayesian statistics is the reasonable approach in order to obtain parameter-free online algorithms in the unconstrained setting. Parameter-free is not exactly correct as there are initialization parameters, which we see as a smoothed version of the hard constraint imposed by bounded algorithm, but they have no impact on the leading terms of our bounds. Static Kalman filter coincides with the ridge forecaster and similarly the static EKF may be seen as the online approximation of a regularized empirical risk minimizer. Stochastic Online Optimization using Kalman Recursion 1.1 Related Work Theoretical guarantees for online and stochastic algorithms are multi-criteria and of various natures. The comparison of upper-bounds or computational complexity highly depends on the values of d the dimension of the explanatory vectors and n the time horizon, leading to different views on whether the dependence to d or n is the most important. The nature of the guarantee obviously depends on the objective pursued. In the advesarial setting, the learner suffers a loss ℓt(ˆθt) depending on its estimate ˆθt at each time step t. It is natural to minimize the cumulative loss, or equivalently the regret t=1 ℓt(ˆθt) t=1 ℓt(θ ) , where θ reaches the minimum value of the cumulative loss and thus highly depends on (ℓt)1 t n. Under an assumption of bounded gradients, Zinkevich (2003) proved that a firstorder online gradient descent yields a regret bound in O( n). The Online Newton Step (ONS) is a second-order online gradient descent that has been designed to obtain a regret bound in O(ln n) (Hazan et al., 2007) under the assumption that the losses are exp-concave. The improved guarantee comes at a cost of O(d2) operations per step instead of O(d), along with a projection at each step whose cost depends on the data. In the stochastic setting where the losses (ℓt) are assumed i.i.d., the aim is to minimize the risk L(θ) = E[ℓ(θ)]. A natural candidate is the Empirical Risk Minimizer (ERM), whose asymptotics are well understood (see for example Murata and Amari (1999)). Assuming the existence of θ minimizing the risk and that the Hessian matrix H = 2 θ2 L(θ ) is positive definite, the ERM ˆθn satisfies E[L(ˆθn)] L(θ ) = tr(G H 1) 2n + o(1/n) , G = E h θℓ(y, θ X) i . Although in the well-specified setting the identity tr(G H 1) = d holds, in the misspecified case there is no general estimate for tr(G H 1). Recently a non-asymptotic bound L(ˆθn) L(θ ) = O(tr(G H 1) ln δ 1/n) holding with probability 1 δ has been shown by Ostrovskii and Bach (2021) on the ERM. However the ERM is defined only implicitly and may have important computational cost, hence recursive algorithms based on gradient descent have been studied under different sets of assumptions to bound tr(G H 1). The most simple is Stochastic Gradient Descent (SGD), where each step is in the opposite direction of the gradient. This algorithm has been widely used with various step sizes. Sharp results have been obtained by Bach (2014) for a constant gradient step size C/ n with fixed horizon n. Under the assumption that gradients are bounded by R we have tr(G H 1) R2/µ where µ is the minimal eigenvalue of H . The fast rate E[L(θn)] L(θ ) = O(R2/(µn)) is obtained by Bach (2014) for the averaged estimate θn of SGD. In the same article the author also derives a bound with high probability but with a slower rate: it degrades into L(θn) L(θ ) = O(log δ 1/ n) with probability 1 δ. Finally, in the quadratic setting a fast rate L(θn) L(θ ) = O(1/(nδα)) is achieved with probability 1 δ for a defined α > 0 (Bach and Moulines, 2013). To obtain fast rates with high probability beyond the quadratic setting, it seems necessary to use second-order information as in the ONS (Mahdavi et al., 2015). Under the de Vilmarest and Wintenberger ERM Averaged SGD ONS This article Regret O(ln n) Excess risk in expectation O( 1 n) Excess risk w.h.p. O(ln δ 1 n ) O(ln δ 1 n ) Cumulative excess risk w.h.p. O(ln n + ln δ 1) O(ln n + ln δ 1 + S(δ)) Cost per iteration Implicit O(d) O(d2) + Tproj O(d2) Table 1: Summary of existing results along with the static EKF for which we prove the bound for the cumulative excess risk. We focus on the dependence on n, and δ for the bounds holding with probability 1 δ (w.h.p.). S(δ) is the cumulative excess risk of the convergence phase. assumption that the loss is α-exp-concave, tr(G H 1) d/α and for the averaged version of the ONS the rate L(θn) L(θ ) = O(d(ln n + ln δ 1)/(αn)) with probability 1 δ is obtained. From our perspective, the result is stronger than what is claimed by Mahdavi et al. (2015): the bound obtained is t=1 L(ˆθn) L(θ ) = O(ln n + ln δ 1) , (1) holding simultaneously for any n with probability 1 δ. Note that although averaging this bound with Jensen s inequality leads to a sub-optimal bound on the excess risk of the last averaged estimate, it is conversely not possible to obtain Equation (1) from L(ˆθn) L(θ ) = O(ln δ 1/n) , holding with probability 1 δ. 1.2 Contributions Our first contribution is a local analysis of the static EKF under assumptions defined in Section 2, and provided that consecutive steps stay in a small ball around the optimum θ . We derive local bounds on the cumulative risk with high probability from a martingale analysis. Our analysis of Section 3 is similar to the one of Mahdavi et al. (2015) and we slightly refine their constants as a by-product. We then show that the convergence property crucial in our analysis is reachable. To that end we focus on linear regression and logistic regression as these two well-known problems are challenging in the unconstrained setting. In linear regression, the gradient of the loss is not bounded globally. In logistic regression, the loss is strictly convex, but neither strongly convex nor exp-concave in the unconstrained setting. In Section 4, we develop a global bound in the logistic setting on a slight modification of the algorithm introduced by Bercu et al. (2020). We prove that this modified algorithm converges and stays into a local region around θ after a finite number of steps. Moreover we show that it coincides with the static EKF and thus our local analysis applies. In Section 5, we apply our local analysis to the quadratic setting. We rely on Hsu et al. (2012) to obtain the convergence of the algorithm after exhibiting the correspondence between Kalman filter in constant dynamics and the ridge forecaster, and we therefore obtain similarly a global bound. Stochastic Online Optimization using Kalman Recursion Finally, we demonstrate numerically the competitiveness of the static EKF for logistic regression in Section 6. 2. Definitions and Assumptions We consider loss functions that may be written as the negative log-likelihood of a generalized linear model (Mc Cullagh and Nelder, 1989). Formally, the loss is defined as ℓ(y, θ X) = log pθ(y | X) where θ Rd, (X, y) X Y for some X Rd and Y R and pθ is of the form pθ(y | X) = h(y) exp y θ X b(θ X) where a is a constant and h and b are one-dimensional functions on which a few assumptions are required (Assumption 3). This includes logistic and quadratic regressions, see Sections 4 and 5. We display the static EKF in Algorithm 1 in this setting (see Appendix A for a derivation relying on Durbin and Koopman, 2012). In the quadratic setting, noting that the EKF estimate ˆθt does not depend on the (unknown) variance σ2, we consider the quadratic loss ℓ(y, ˆy) = (y ˆy)2/2 by convention. Algorithm 1: Static Extended Kalman Filter for Generalized Linear Model 1. Initialization: P1 is any positive definite matrix, ˆθ1 is any initial parameter in Rd. 2. Iteration: at each time step t = 1, 2, . . . (a) Update Pt+1 = Pt Pt Xt X t Pt 1+X t Pt Xtαt αt with αt = b (ˆθ t Xt) a . (b) Update ˆθt+1 = ˆθt + Pt+1 (yt b (ˆθ t Xt))Xt a . Due to matrix-vector and vector-vector multiplications, Algorithm 1 has a running-time complexity of O(d2) at each iteration and thus O(nd2) for n iterations. Note that although we need the loss function to be derived from a likelihood of the form (2), we do not need the data to be generated under this process. We need two standard hypotheses on the data. The first one is the i.i.d. assumption and bounded random design (all along the paper . is the Euclidean norm): Assumption 1 The observations (Xt, yt)t are i.i.d. copies of the pair (X, y) X Y and there exists DX such that Xt DX almost surely (a.s.). Working under Assumption 1, we define the risk function L(θ) = E ℓ(y, θ X) . Note that in Section 3 we don t need E[XX ] invertible, but we will make such an assumption in Sections 4 and 5 to prove the convergence of the algorithm in the logistic and quadratic settings, respectively. In order to work on a well-defined optimization problem we assume there exists a minimum: Assumption 2 There exists θ Rd such that L(θ ) = inf θ Rd L(θ). de Vilmarest and Wintenberger We treat two different settings requiring different assumptions, summarized in Assumption 3 and 4 respectively. First, motivated by logistic regression we define: Assumption 3 There exists (κε)ε>0, (hε)ε>0 and ρε ε 0 1 such that for any ε > 0 and any θ, θ0 Rd satisfying max( θ θ , θ0 θ ) ε, we have ℓ (y, θ X)2 κεℓ (y, θ X) a.s. ℓ (y, θ X) hε a.s. ℓ (y, θ X) ρεℓ (y, θ 0 X) a.s. Here ℓ and ℓ are the first and second derivatives of ℓwith respect to the second variable. Assumption 3 requires local exp-concavity (around θ ) along with some regularity on ℓ (ℓ continuous and ℓ (y, θ X) γ > 0 a.s. is sufficient). That setting implies Y bounded, because ℓ depends on y whereas ℓ doesn t. In logistic regression, Y = { 1, +1} and Assumption 3 is satisfied for κε = e DX( θ +ε), hε = 1 4, ρε = e εDX. Second, we consider the quadratic loss, corresponding to a gaussian model. In order to include the well-specified setting and to bound G = E[(y θ X)2XX ], we assume y sub-gaussian conditionally to X and not too far away from the model: Assumption 4 The distribution of (X, y) X Y satisfies There exists σ2 > 0 such that for any s R, E es(y E[y|X]) | X e σ2s2 There exists Dapp 0 such that |E[y | X] θ X| Dapp a.s. Both conditions of Assumption 4 hold with Y = R and Dapp = 0 for the well-specified gaussian linear model with random bounded design. The second condition of Assumption 4 is satisfied for Dapp > 0 in misspecified sub-gaussian linear model with a.s. bounded approximation error. 3. The Algorithm Around the Optimum In this section, we analyse the cumulative risk under a strong convergence assumption. Precisely we define τ(ε) = min{k N | t > k, ˆθt θ ε} , where (ˆθt)t are the estimates of the static EKF, and with the convention min = + . We assume a bound on τ(ε) holding with high probability: Assumption 5 For any δ, ε > 0, there exists T(ε, δ) N such that P τ(ε) T(ε, δ) 1 δ . Assumption 5 states that with high probability there exists a convergence time after which the algorithm stays trapped in a local region around the optimum. Sections 4 and 5 are devoted to define explicitly such a convergence time for a modified EKF in the logistic setting and for the true EKF in the quadratic setting. Stochastic Online Optimization using Kalman Recursion We present our result in the bounded and sub-gaussian settings. The results and their proofs are very similar, but two crucial steps are different. First, Assumption 3 yields a bound on the gradient holding almost surely. We relax the boundedness condition for the quadratic loss with a sub-gaussian hypothesis, requiring a specific analysis. Second, our analysis is based on a second-order expansion. The quadratic loss is equal to its secondorder Taylor expansion but we need Assumption 5 along with the third point of Assumption 3 otherwise. The following theorem is our result in the bounded setting. Theorem 1 Starting the static EKF from any ˆθ1 Rd, P1 0, if Assumptions 1, 2, 3, 5 are satisfied and if ρε > 0.95, for any δ > 0, it holds for any n 1 simultaneously t=T(ε,δ)+1 L(ˆθt) L(θ ) 5 2dκε ln 1 + nhελmax(P1)D2 X d + 5λmax P 1 T(ε,δ)+1 ε2 + 30 2κε + hεε2D2 X ln δ 1 , with probability at least 1 3δ. The constant 0.95 may be chosen arbitrarily close to 0.5 with growing constants in the bound on the cumulative risk. There is a hidden trade-offin ε: on the one hand, the smaller ε the better our upper-bound, but on the other hand T(ε, δ) increases when ε decreases, and thus our bound applies after a bigger convergence time. For the quadratic loss, we obtain the following result under the sub-gaussian hypothesis. Theorem 2 In the quadratic setting, starting the static EKF from any ˆθ1 Rd, P1 0, if Assumptions 1, 2, 4 and 5 are satisfied, for any δ > 0 and any ε > 0, it holds for any n 1 simultaneously t=T(ε,δ)+1 L(ˆθt) L(θ ) 15 2 d 8σ2 + D2 app + ε2D2 X ln 1 + nλmax(P1)D2 X d + 5λmax P 1 T(ε,δ)+1 ε2 + 115 σ2 4 + λmax(P1)D2 X 4 + D2 app + 2ε2D2 X with probability at least 1 5δ. We observe a similar trade-offin ε as in Theorem 1. Up to numerical constants, the tight constant d(σ2+D2 app) (see for instance Hsu et al., 2012) is achieved by choosing ε arbitrarily small, at the cost of a loose control of the T(ε, δ) first steps. Both results follow from a regret analysis close to the one on the ONS (see Section 3.1), and on a control on the martingales stated below: Lemma 3 Let k 0 and ( Nt)t>k be any martingale difference adapted to the filtration (Ft)t k such that for any t > k, E[ N2 t | Ft 1] < . For any δ, λ > 0, we have the simultaneous property 2 (( Nt)2 + E[( Nt)2 | Ft 1]) ln δ 1 with probability at least 1 δ. de Vilmarest and Wintenberger Algorithm 2: Recursive updates of the ONS and the static EKF Online Newton Step Static Extended Kalman Filter P 1 t+1 = P 1 t + ℓ (yt, w t Xt)2Xt X t , P 1 t+1 = P 1 t + ℓ (yt, ˆθ t Xt)Xt X t , t = ℓ (yt, w t Xt)Xt , t = ℓ (yt, ˆθ t Xt)Xt , , ˆθt+1 = ˆθt Pt+1 t , K is the projection on K for the norm . P 1 t+1. This result proved in Appendix B.1 is a corollary of a martingale inequality from Bercu and Touati (2008) and a stopping time construction (Freedman, 1975). We detail the key ideas of the proofs in the rest of the Section, and we defer to Appendix B the proof of the intermediate results along with the detailed proof of Theorems 1 and 2. Specifically, we display in Section 3.1 the parallel with the ONS, where we compare with the existing result on the cumulative risk, and a similar analysis yields an adversarial bound on a second-order expansion of the loss. In Section 3.2 we compare the excess risk with its second-order expansion thanks to Assumption 5, and we use a martingale analysis to obtain a bound on the cumulative excess risk. 3.1 Comparison with Online Newton Step and Adversarial Analysis We display the parallel between the ONS and the static EKF in Algorithm 2 through their recursive updates. We observe that the square of the first derivative of ℓis replaced with the second derivative. Thus t P 1 t in the static EKF is an estimate of the Hessian H which is the optimal preconditioning matrix as shown in Corollary 3 of Murata and Amari (1999). Then the recursion on the parameter (wt and ˆθt) has two differences: there is a gradient step size 1/γ in the ONS absent in the static EKF, and after the gradient step the ONS applies a projection. Lemma 3 yields the following refinement on the bound of Mahdavi et al. (2015) obtained on the cumulative excess risk of the ONS: Corollary 4 Assume the search region K has diameter D and the gradients are bounded by R. Let (wt)t be the ONS estimates starting from w1 K, P1 = λI and using a step-size γ = 1 2 min( 1 4RD, α) with α the exp-concavity constant of ℓon K. Then for any δ > 0, it holds for any n 1 simultaneously t=1 L(wt) L(θ ) 3 2γ d ln 1 + n R2 with probability at least 1 2δ. For the sake of consistency, we display Corollary 4 as a bound on the cumulative excess risk, whereas Theorem 3 of Mahdavi et al. (2015) is a bound on the excess risk of the averaged Stochastic Online Optimization using Kalman Recursion ONS. The latter follows directly from an application of Jensen s inequality. The proof of Corollary 4 consists in replacing Theorem 4 of Mahdavi et al. (2015) with Lemma 3. We obtain similar constants in Theorem 1 and in Corollary 4, as κε is the inverse of the expconcavity constant α. The use of second-order methods with well-tuned preconditioning is crucial in order to replace the leading constant R2/µ obtained for first-order methods by d/α (µ is the minimum eigenvalue of the hessian H ). Our results on the static EKF are less general than the ones obtained on the ONS as a control of the convergence time τ(ε) T(ε, δ) is required with high probability. On the other hand the results obtained on the ONS require the knowledge of the exp-concavity constant α whereas the static EKF is parameter-free. That is why we argue that the static EKF provides an optimal way to tune the step size and the preconditioning matrix. Indeed, as ε is a parameter of the EKF analysis but not of the algorithm, we can improve the leading constant κε on a local region arbitrarily small around θ , at a cost of a loose control of the T(ε, δ) first steps. In the ONS the choice of a diameter D > θ makes the gradient step-size sub-optimal and impacts the leading constant. Once the parallel between the ONS and the static EKF has been displayed (Algorithm 2), it is natural to adopt an approach similar to the one in Hazan et al. (2007). The cornerstone of our local analysis is the derivation of an adversarial bound on the second-order Taylor expansion of ℓ, from the recursive update formulae. Lemma 5 For any sequence (Xt, yt)t, starting from any ˆθ1 Rd, P1 0, it holds for any θ Rd and n N that ℓ (yt, ˆθ t Xt)Xt (ˆθt θ ) 1 2(ˆθt θ ) ℓ (yt, ˆθ t Xt)Xt X t (ˆθt θ ) t=1 X t Pt+1Xtℓ (yt, ˆθ t Xt)2 + ˆθ1 θ 2 We cannot compare the excess loss with the second-order Taylor expansion in general, and it is natural to use a step size parameter. In Hazan et al. (2007), the regret analysis of the ONS is based on a very similar bound on ℓ (yt, w t Xt)Xt (wt θ ) γ 2(wt θ ) ℓ (yt, w t Xt)2Xt X t (wt θ ) , where γ is a step size parameter. Then the regret bound follows from the exp-concavity property, bounding the excess loss ℓ(yt, w t Xt) ℓ(yt, θ Xt) with the previous quantity for a specific γ. The dependence of γ on the exp-concavity constant and the bound on the gradients require that the algorithm stays in a bounded region around the optimum θ , and a projection on this region is used, potentially at each step. We follow a very different approach, to stay parameter-free, unconstrained and to avoid any additional cost in the leading constant. In the stochastic setting, we observe that we can upper-bound the excess risk with a second-order expansion, up to a multiplicative factor. 3.2 From Adversarial to Stochastic: the Cumulative Risk In order to compare the excess risk with a second-order expansion, we compare the firstorder term with the second-order one. de Vilmarest and Wintenberger Proposition 6 If Assumptions 1, 2 and 3 are satisfied, for any θ Rd, it holds θ (θ θ ) ρ θ θ (θ θ ) 2L This result leads immediately to the following proposition, using the first-order convexity property of L. Proposition 7 If Assumptions 1, 2 and 3 are satisfied, for any θ Rd, 0 < c < ρ θ θ , it holds L(θ) L(θ ) ρ θ θ ρ θ θ c θ (θ θ ) c(θ θ ) 2L Lemma 5 motivates the use of c > 1 2, thus we need at least ρ θ θ > 1 2. In the quadratic setting, it holds as an equality with ρ = 1 because the second derivative of the quadratic loss is constant. In the bounded setting we need to control the second derivative in a small range, and we can achieve that only locally, therefore we separate the condition ρ θ θ > 1 2 between the third point of Assumption 3 and Assumption 5. Then we are left to obtain a bound on the cumulative risk from Lemma 5. In order to compare the derivatives of the risk and the losses, we need to control the martingale difference adapted to the natural filtration (Ft) and defined as (ˆθt θ ), where t = ℓ (yt, ˆθ t Xt)Xt . We thus apply Lemma 3 to this martingale difference. Lemma 8 Starting the static EKF from any ˆθ1 Rd, P1 0, if Assumptions 1 and 2 are satisfied, for any k 0 and δ, λ > 0, it holds simultaneously Mt λ(ˆθt θ ) t t + 3 2E t t | Ft 1 (ˆθt θ ) ln δ 1 with probability at least 1 δ. The proof of Theorems 1 and 2 deferred to Appendix B builds on the above results. Summing Lemma 5 and 8, we obtain for any δ, λ > 0 the simultaneous bound ˆθt (ˆθt θ ) (ˆθt θ ) 1 2 (2) t + λ t t + 3 2λE t t | Ft 1 (ˆθt θ ) t=T(ε,δ)+1 X t Pt+1Xtℓ (yt, ˆθ t Xt)2 + ˆθ1 θ 2 λmin(PT(ε,δ)+1) + ln δ 1 with probability at least 1 δ, where we define (2) t = ℓ (yt, ˆθ t Xt)Xt X t for any t. In the last equation, we control (see Appendix B.4 and B.5) the quadratic term in ˆθt θ on the left hand-side in terms of (ˆθt θ ) 2L θ2 |ˆθt(ˆθt θ ) in order to lower-bound the left expression proportionally to the cumulative excess risk using Proposition 7 for well chosen λ. Stochastic Online Optimization using Kalman Recursion Algorithm 3: Truncated Extended Kalman Filter for Logistic Regression 1. Initialization: P1 is any positive definite matrix, ˆθ1 is any initial parameter in Rd. 2. Iteration: at each time step t = 1, 2, . . . (a) Update Pt+1 = Pt Pt Xt X t Pt 1+X t Pt Xtαt αt, with αt = max 1 tβ , 1 (1+e ˆθ t Xt)(1+e ˆθ t Xt) (b) Update ˆθt+1 = ˆθt + Pt+1 yt Xt 1+eyt ˆθ t Xt . 4. Logistic Setting Logistic regression is a widely used statistical model in classification. The prediction of a binary random variable y Y = { 1, 1} consists in modelling L(y | X) with pθ(y | X) = 1 1 + e yθ X = exp yθ X (2 ln(1 + eθ X) θ X) In the GLM notations, it yields a = 2 and b(θ X) = 2 ln(1 + eθ X) θ X. 4.1 Results for the Truncated Algorithm In order to prove the convergence of the algorithm needed in the local phase, we follow a trick introduced by Bercu et al. (2020) consisting in changing slightly the update on Pt. Indeed, when the authors tried to prove the asymptotic convergence of the static EKF (which they named stochastic Newton step) using Robbins-Siegmund Theorem, they needed the convergence of P t λmax(Pt)2. This seems very likely to hold as we have intuitively Pt 1/t. However, in order to obtain λmax(Pt) = O(1/t), one needs to lower-bound αt, that is, to upper-bound |ˆθ t Xt|, and that is impossible in the global logistic setting. Therefore, the idea is to force a lower-bound on αt in its definition. We thus define, for some 0 < β < 1/2, (1 + eˆθ t Xt)(1 + e ˆθ t Xt) This modification yields Algorithm 3, where we keep the notations ˆθt, Pt, τ(ε) with some abuse in the rest of this section. We impose a decreasing threshold on αt (β > 0) and we prove that the recursion coincides with Algorithm 1 after some steps. The sensitivity of the algorithm to β is discussed at the end of Section 4.2. Also, note that the threshold could be c/tβ, c > 0, as in Bercu et al. (2020). We consider 1/tβ for clarity. We control the convergence time τ(ε) of this version of the EKF: Proposition 9 Starting Algorithm 3 from ˆθ1 = 0 and any P1 0, if Assumptions 1 and 2 are satisfied and E[XX ] is invertible, for any ε, δ > 0, it holds τ(ε) T(ε, δ) along with t > T(ε, δ), αt = 1 (1 + eˆθ t Xt)(1 + e ˆθ t Xt) , de Vilmarest and Wintenberger with probability at least 1 δ, where T(ε, δ) N is defined in Corollary 13. Besides the convergence of the truncated EKF, the proposition states that the truncated recursions coincide with the static EKF ones after the first T(ε, δ) steps. Thus we can apply our analysis of Section 3. We state the global result for ε = 1/(20DX): Theorem 10 Under the assumptions of Proposition 9, for any δ > 0, it holds for any n 1 simultaneously t=1 L(ˆθt) L(θ ) 3de DX θ ln 1 + nλmax(P1)D2 X 4d + λmax(P 1 1 ) 75D2 X + 64e DX θ ln δ 1 + T 1 20DX , δ 1 300 + DX ˆθ1 θ + T 1 20DX , δ 2 λmax(P1)D2 X 2 , with probability at least 1 4δ, where T(1/(20DX), δ) is defined in Corollary 13. 4.2 Explicit Definition of T(ε, δ) in Proposition 9 It is proved that ˆθn θ 2 = O (ln n/n) almost surely (Bercu et al., 2020, Theorem 4.2). We don t obtain a non-asymptotic version of this rate of convergence, but the aim of this paragraph is to prove Proposition 9 for an explicit value of T(ε, δ) for any δ, ε > 0. The objective of the truncation introduced in the algorithm is to improve the control on Pt. We state that fact formally with a concentration result relying on Tropp (2012). We define Λmin the smallest eigenvalue of E[XX ]. Proposition 11 Under the assumptions of Proposition 9, for any δ > 0, it holds simultaneously t > 20D4 X Λ2 min ln 625d D8 X Λ4 minδ 1/(1 β) , λmax(Pt) 4 Λmint1 β , with probability at least 1 δ. This proposition justifies the choice β < 1/2 in the introduction of the truncated algorithm to satisfy the condition P t λmax(Pt)2 < + with high probability. Motivated by Proposition 11, we define, for C > 0, the event λmax(Pt) C t1 β To obtain a control on Pt holding for any t, we use the relation λmax(Pt) λmax(P1) holding almost surely. We thus define Cδ = max 4 Λmin , λmax(P1) 20D4 X Λ2 min ln 625d D8 X Λ4 minδ and we obtain P (ACδ) 1 δ. We obtain the following theorem under that condition. Stochastic Online Optimization using Kalman Recursion Theorem 12 Under the assumptions of Proposition 9, we have for any δ, ε > 0 and t exp 28D8 XC2 δ (1+e DX ( θ +ε))3 Λ3 min(1 2β)3/2ε2 P( ˆθt θ > ε | ACδ) ( t + 1) exp Λ6 min(1 2β)ε4 216D12 X C2 δ (1 + e DX( θ +ε))6 ln(t)2 + t exp Λ2 min(1 2β)ε4 211D4 XC2 δ (1 + e DX( θ +ε))2 ( The beginning of our convergence proof starts similarly as the analysis of Bercu et al. (2020): we obtain a recursive inequality ensuring that (L(ˆθt))t is decreasing in expectation. However, in order to obtain a non-asymptotic result we cannot apply Robbins-Siegmund Theorem. Instead we use the fact that the variations of the algorithm ˆθt are slow thanks to the control on Pt. Thus, if the algorithm was far from the optimum, the last estimates were far too which contradicts the decrease in expectation of the risk. Consequently, we look at the last k t such that ˆθk θ < ε/2, if it exists. We decompose the probability of being outside the local region in two scenarii, yielding the two terms in Theorem 12. If k < t, the recursive decrease in expectation makes it unlikely that the estimate stays far from the optimum for a long period. If k > t, the control on Pt allows a control on the probability that the algorithm moves fast, in t k steps, away from the optimum. The following corollary explicitly defines a guarantee for the convergence time. Corollary 13 Proposition 9 holds with for any ε, δ > 0 T(ε, δ) = max 2(1 + e DX( θ +ε)) 1/β , exp 3 215D12 X C2 δ/2(1 + e DX( θ +ε))6 Λ6 min(1 2β)3/2ε4 This definition of T(ε, δ) allows a discussion on the dependence of the bound Theorem 10 to the different parameters. Note that the choice ε = 1/(20DX) in Theorem 10 is artificially made for simplifying constants since the bound actually holds for any ε > 0 simultaneously. The truncation has introduced an extra parameter 0 < β < 1/2 that does not impact the leading term in Theorem 10. However, it impacts the first step control in an intricate way. On the one hand, when β is close to 0, the algorithm is slow to coincide with the true EKF as T(ε, δ) = e O(1)/β. On the other hand, the larger β, the larger our control on λmax(Pt) and thus we get T(ε, δ) = e O(1)/(1 2β)3/2. Practical considerations show that the truncation is artificial and can even deteriorate the performence of the EKF, see Section 6. Thus Bercu et al. (2020) suggest to choose β = 0.49. The dependence on δ is even more complex. The third constraint on T(ε, δ) is O(δ 1) which should not be sharp. To improve this lousy dependence in the bound, one needs a better control of Pt. It would follow from a specific analysis of the O(ln δ 1) first recursions in order to initialize the control on Pt. However the objective of Corollary 13 was to prove Proposition 9 and not to get an optimal value of T(ε, δ). A refinement of our convergence analysis following from a tighter control on Pt of the EKF than the one provided by Tropp (2012) is a very important and challenging open question. de Vilmarest and Wintenberger 5. Quadratic Setting We obtain a global result for the quadratic loss where Algorithm 1 becomes the standard Kalman filter (recall that we take σ2 = 1, that is ℓ(y, ˆy) = (y ˆy)2/2 and a = 1, b (ˆθ t Xt) = ˆθ t Xt, αt = 1). The parallel with the ridge forecaster was evoked by Diderrich (1985), and it is crucial that the static Kalman filter is the ridge regression estimator for a decaying regularization parameter. It highlights that the static EKF may be seen as an approximation of the regularized ERM: Proposition 14 In the quadratic setting, for any sequence (Xt, yt), starting from any ˆθ1 Rd and P1 0, the static EKF satisfies the optimisation problem ˆθt = arg min θ Rd s=1 (ys θ Xs)2 + 1 2(θ ˆθ1) P 1 1 (θ ˆθ1) Note that the static Kalman filter provides automatically a right choice of the ridge regularization parameter. This equivalence yields a logarithmic regret bound for the Kalman filter (Theorem 11.7 of Cesa-Bianchi and Lugosi, 2006). It follows from Lemma 5 as the quadratic loss coincides with its second-order Taylor expansion. The leading term of the bound is d ln n maxt(yt ˆθ t Xt)2, thus yt ˆθ t Xt needs to be bounded. As the static Kalman filter estimator is exactly the ridge forecaster, we can also use the regularized empirical risk minimization properties to control T(ε, δ). In particular, we apply the ridge analysis of Hsu et al. (2012), and we check Assumption 5: Proposition 15 Starting from any ˆθ1 Rd and P1 0, if Assumptions 1, 2 and 4 hold and if E[XX ] is invertible then Assumption 5 holds for T(ε, δ) defined explicitly in Appendix D, Corollary 26. Up to universal constants, defining Λmin as the smallest eigenvalue of E[XX ], we get p1 + D2 X Λmin (1 + D2 app) ln δ 1 + σ2d + DX Λmin (Dapp + DX θ ) + ˆθ1 θ p1 + σ2 ln δ 1 !! with h(x) = x ln x. We obtain a much less dramatic dependence in ε than in the logistic setting. However we could not avoid a Λ 1 min factor in the definition of T(ε, δ). It is not surprising since the convergence phase relies deeply on the behavior of Pt. As for the logistic setting, we split the cumulative risk into two sums. The sum of the first terms is roughly bounded by a worst case analysis, and the sum of the last terms is estimated thanks to our local analysis (Theorem 2). However, as the loss and its gradient are not bounded we cannot obtain a similar almost sure upper-bound on the convergence phase. The sub-gaussian assumption provides a high probability bound instead. Stochastic Online Optimization using Kalman Recursion Theorem 16 Under the assumptions of Proposition 15, for any ε, δ > 0, it holds simultaneously t=1 L(ˆθt) L(θ ) 15 2 d 8σ2 + D2 app + ε2D2 X ln 1 + nλmax(P1)D2 X d + 5λmax(P 1 1 )ε2 + 115 σ2(4 + λmax(P1)D2 X 4 ) + D2 app + 2ε2D2 X + D2 X 5ε2 + 2( ˆθ1 θ 2 + 3λmax(P1)DXσ ln δ 1)2 T(ε, δ) + 2λmax(P1)2D4 X(3σ + Dapp)2 3 T(ε, δ)3, n 1 , with probability at least 1 6δ. Note that the dependence of the cumulative excess risk of the convergence phase in terms of δ is O(log(δ 1)3). 6. Experiments We experiment the static EKF for logistic regression. Precisely, we compare the following sequential algorithms that we all initialize at 0: The static EKF and the truncated version (Algorithm 3). We take the default value P1 = Id along with the value β = 0.49 suggested by Bercu et al. (2020). Note that a threshold 10 10/t0.49 as recommended by Bercu et al. (2020) would coincide with the static EKF. The ONS and the averaged version. The convex region of search is a ball centered in 0 and of radius Dθ = 1.1 θ , a setting where we have good knowledge of θ . We consider two choices of the exp-concavity constant on which the ONS crucially relies to define the gradient step size. First, we use the only available bound e DθDX. Second, in the settings where the step size is so small that the ONS doesn t move, we use the exp-concavity constant κ0 at θ . This yields a bigger step size, though the exp-concavity is not satisfied on the region of search. Two Averaged Stochastic Gradient Descent as described by Bach (2014). First we test the choice of the gradient step size γ = 1/(2D2 X N) denoted by ASGD and a second version with γ = θ /(DX N) denoted by ASGD oracle. Note that these algorithms are with fixed horizon, thus at each step t, we have to re-run the whole procedure. 6.1 Synthetic Data We first consider well-specified data generated by the process of Bercu et al. (2020). The explanatory variables X = (1, Z ) are of dimension d = 11 where Z is a random vector composed of 10 independent components uniformly generated in [0, 1], thus DX = d. With this distribution for X we define three synthetic settings that we evaluate: de Vilmarest and Wintenberger Figure 1: Density of the Bernoulli parameter on 107 samples: on the left and on the middle density of (1 + e θ X) 1 for the two well-specified settings (left, the ordinate is in log scale), and on the right density of (1+e θ j Xt) 1 with j {1, 2} uniformly at random for the misspecified setting. On the right we observe the two modes E[(1 + e θ 1 Xt) 1] 0.28 and E[(1 + e θ 2 Xt) 1] 0.79. Well-specified 1: we define θ = ( 9, 0, 3, 9, 4, 9, 15, 0, 7, 1, 0) , and at each iteration t, the variable yt { 1, 1} is a Bernoulli variable of parameter (1+e θ Xt) 1. Well-specified 2: in the first well-specified setting the Bernoulli parameter is mostly distributed around 0 and 1 (see Figure 1), thus we try a less discriminated setting with θ = 1 10( 9, 0, 3, 9, 4, 9, 15, 0, 7, 1, 0) . Misspecified: In order to demonstrate the robustness of the EKF we test the algorithms in a misspecified setting switching randomly between two well-specified logistic processes. We define θ1 = 1 10( 9, 0, 3, 9, 4, 9, 15, 0, 7, 1, 0) and θ2 where we have only changed the first coefficient from 9/10 to 15/10. Then yt is a Bernoulli random variable whose parameter is either (1 + e θ 1 Xt) 1 or (1 + e θ 2 Xt) 1 uniformly at random. We checked that Assumption 2 is still satisfied. We evaluate the different algorithms with the mean squared error E[ ˆθt θ 2] that we approximate by its empirical version on 100 samples. We display the results in Figure 2. 6.2 Real Data Sets To illustrate better the robustness to misspecification, we run the same procedures on real data sets: Forest cover-type (Blackard and Dean, 1999): the feature vector is of dimension d = 54, and as it is a multi-class task (7 classes) we focus on classifying 2 versus all others. There are n = 581012 instances and we randomly split in two halves for training and testing. Adult income (Kohavi, 1996): the objective is to predict whether a person s annual income is smaller or bigger than 50K. There are 14 explanatory variables, and we Stochastic Online Optimization using Kalman Recursion Figure 2: Mean squared error in log-log scale for the three synthetic settings. For the first well-specified setting (left) the ONS is applied using the exp-concavity constant κ0 1.7 10 15 instead of e Dθ d to accelerate the algorithm, and both the ONS and its averaged version still don t move. In the other two (middle and right) we use e Dθ d for the ONS. We observe that the EKF and the truncated version coincide in the two last settings. obtain d = 98 once categorical variables are transformed into binary variables. We use the canonical split between training (32561 instances) and testing (16281 instances). For each data set, we standardize X such that each feature ranges from 0 to 1. At each step we sample within the training set (with replacement). We evaluate through an empirical version of E[L(ˆθn)] L(θ ) estimated on 100 samples and where L is estimated on the test set, see Figure 3. 6.3 Summary Our experiments show the superiority of the EKF for logistic regression compared to the ONS or to averaged SGD in all the settings we tested. We display in Table 2 a few indicators of the data sets. In particular, it is interesting that the static EKF works well even in a setting where the Hessian matrix H is singular. It appears clear that low exp-concavity constants are responsible of the poor performances of the ONS. One may tune the gradient step size at the cost of losing the expconcavity property and thus the regret guarantee of (Hazan et al., 2007) or its analogous for the cumulative risk (Mahdavi et al., 2015). Averaging is crucial for the ONS, whereas it is useless for the static EKF. Indeed we chose not to plot the averaged version of the EKF for clarity, but the EKF performs better than its averaged version. It is important to note that in the first synthetic setting the truncation deteriorates the performance of the EKF, as well as in the adult income data set to a lesser extent, whereas the results are the same in the other settings. Bercu et al. (2020) argue that the truncation is artificially introduced for the convergence property, thus they use the threshold 10 10/t0.49 instead of 1/t0.49 and the truncated version almost coincides with the true EKF. We confirm here that the truncation may be damaging if the threshold is set too high and de Vilmarest and Wintenberger Figure 3: Excess test risk for forest cover type (left) and adult income (right). As the ONS doesn t move when applied with the exp-concavity constant e DθDX we use instead the exp-concavity constant at θ : κ0 1.4 10 3 for forest cover type and κ0 5.5 10 6 for adult income. The EKF and the truncated version almost coincide for both data sets. Setting d λmax(H )/µ tr(G H 1) R2/µ de DθDX dκ0 Synthetic well-specified 1 11 6.9 102 1.7 102 1.0 105 9.2 1037 6.4 1015 Synthetic well-specified 2 11 1.5 102 7.1 101 2.5 103 5.4 104 3.3 102 Synthetic misspecified 3 11 1.5 102 7.1 101 2.0 103 3.6 103 7.4 101 Forest cover type 54 9.2 1032 3.8 104 Adult income 98 2.5 107 7.2 105 5.3 108 1.8 1062 1.8 107 Table 2: For the different experimental settings we display the dimension d and the condition number of the Hessian at θ (λmax(H ) and µ are the maximal and minimal eigenvalues of H ). We present the value of tr(G H 1) which is bounded either by R2/µ, or by de DθDX because e DθDX bounds the exp-concavity constant on the centered ball of radius Dθ. We add to the table dκ0 de DθDX where κ0 is the inverse of the exp-concavity constant of the loss at θ . Stochastic Online Optimization using Kalman Recursion we recommend to use the EKF in practice, or equivalently the truncated version with the low threshold suggested by Bercu et al. (2020). 7. Conclusion We studied an efficient way to tackle some unconstrained optimization problems, in which we get rid of the projection step of bounded algorithm such as the ONS. We presented a bayesian approach where we transformed the loss into a negative log-likelihood. We used the Kalman recursion to provide a parameter free approximation of the maximum-likelihood estimator. We demonstrated the optimality of the local phase for locally exp-concave losses which can be expressed as GLM log-likelihoods. We proved the finiteness of the convergence phase in logistic and quadratic regressions. We illustrated our theoretical results with numerical experiments for logistic regression. It would be interesting to generalize our results to a larger class of optimization problems. Finally, this article aimed at strengthening the bridge between Kalman recursion and the optimization community. Therefore we made the i.i.d. assumption, standard in the stochastic optimization literature and we focus on the static EKF. It may lead the way to a risk analysis of the general EKF in non i.i.d. state-space models. de Vilmarest and Wintenberger The Appendix follows the structure of the article: Appendix A presents the EKF for generalized linear models. Appendix B contains the proofs of Section 3. Precisely, Lemma 3 is proved in Section B.1, the intermediate results of Sections 3.1 and 3.2 are proved in Sections B.2 and B.3, then Theorem 1 is proved in Section B.4 and Theorem 2 in Section B.5. Appendix C contains the proofs of Section 4. We derive the global bound (Theorem 10) in Section C.1, then we obtain the concentration result on Pt in Section C.2, and finally we prove the convergence of the truncated algorithm in Section C.3. Appendix D contains the proofs of Section 5. We prove Theorem 16 in Section D.1 and then in Section D.2 we prove the convergence of the algorithm, and we define an explicit value of T(ε, δ) satisfying Assumption 5. Appendix A. Derivation of the Static EKF for Generalized Linear Models As in Section 10.2 of Durbin and Koopman (2012) we consider the following state-space model: yt = Zt(θt) + εt , θt+1 = Tt(θt) + ηt . where εt and ηt are independent with mean zero and variances ht(θt), Qt(θt). The statespace version of equation (2) is p(yt | Xt) = h(yt) exp ytθ t Xt b(θ t Xt) a The preceding equation matches the space equation form with Zt(θt) = b (θ t Xt) and ht(θt) = ab (θ t Xt). Thus we can write the EKF as follows (see Equation 10.4 of Durbin and Koopman, 2012): denoting by Tt the derivative of Tt, vt = yt b (ˆθ t Xt) , Ft = X t Pt Xtb (ˆθ t Xt)2 + ab (ˆθ t Xt) , ˆθt|t = ˆθt + Pt Xtb (ˆθ t Xt)F 1 t vt , Pt|t = Pt Pt Xt F 1 t X t Ptb (ˆθ t Xt)2 , ˆθt+1 = Tt(ˆθt|t) , Pt+1 = Tt Pt|t Tt + Qt(ˆθt|t) . We focus on the static setting where the state equation becomes θt+1 = θt, thus we have ˆθt+1 = ˆθt|t and Pt+1 = Pt|t. We rewrite the update on Pt as follows: Pt+1 = Pt Pt Xt X t Ptb (ˆθ t Xt)/a X t Pt Xtb (ˆθ t Xt)/a + 1 . Moreover we have Pt+1Xt = Pt Xt F 1 t ab (ˆθ t Xt) thus we can rewrite the update on ˆθt as follows: ˆθt+1 = ˆθt + Pt+1Xt(yt b (ˆθ t Xt) a . This yields Algorithm 1. Stochastic Online Optimization using Kalman Recursion Appendix B. Proofs of Section 3 B.1 Proof of Lemma 3 We prove the following Lemma inspired by the stopping time technique of Freedman (1975) from which we derive Lemma 3. We give a general form useful in several proofs. Lemma 17 Let (Fn) be a filtration, and we consider a sequence of events (An) that is adapted to (Fn). Let (Vn) be a sequence of random variables adapted to (Fn) satisfying V0 = 1, Vn 0 almost surely for any n, and E[Vn | Fn 1, An 1] Vn 1, n 1. Then for any δ > 0, it holds n=1 Vn > δ 1 [ An important particular case is when (Vn) is a super-martingale adapted to the filtration (Fn) satisfying V0 = 1 and Vn 0 almost surely: then we have simultaneously Vn δ 1 for n 1 with probability larger than 1 δ. Proof We define Vn > δ 1 An 1 . As (Ek) is increasing, we have, for any k 1, n=1 P En En 1 An 1 En 1 + n=1 P Vn > δ 1 En 1 An 1 . First, we have An 1 En 1 P Second, we apply Markov s inequality: n=1 P Vn > δ 1 En 1 An 1 δ 1 1En En 1 An 1 n=1 E h Vn(1En 1 An 1 1En) i E h Vn1En 1 An 1 i E Vn1En . de Vilmarest and Wintenberger The second line is obtained since En En 1 An 1 . According to the tower property and the super-martingale assumption, E h Vn1En 1 An 1 i = E h E[Vn | Fn 1, An 1]1En 1 An 1 E h E[Vn | Fn 1, An 1]1En 1 E h Vn 11En 1 Therefore, a telescopic argument along with V0 = 1 and Vk1Ek 0 yields n=1 P Vn > δ 1 En 1 An 1 δ . Finally, for any k 1, we obtain and the desired result follows by letting k . Proof of Lemma 3. Let λ > 0. For any n 1, we define 2 (( Nt)2 + E[( Nt)2 | Ft 1]) ! Lemma B.1 of Bercu and Touati (2008) states that (Vn) is a super-martingale adapted to the filtration (Fk+n). Moreover V0 = 1 and for any n, it holds Vn 0 almost surely. Therefore we can apply Lemma 17. B.2 Proof of Lemma 5 Proof of Lemma 5. We start from the update formula ˆθt+1 = ˆθt + Pt+1 (yt b (ˆθ t Xt))Xt a yielding (ˆθt+1 θ ) P 1 t+1(ˆθt+1 θ ) = (ˆθt θ ) P 1 t+1(ˆθt θ ) + 2(yt b (ˆθ t Xt))X t a (ˆθt θ ) + X t Pt+1Xt yt b (ˆθ t Xt) a Stochastic Online Optimization using Kalman Recursion With a summation argument, re-arranging terms, we obtain: (b (ˆθ t Xt) yt)X t a (ˆθt θ ) 1 2(ˆθt θ ) (P 1 t+1 P 1 t )(ˆθt θ ) t=1 X t Pt+1Xt yt b (ˆθ t Xt) a (ˆθt θ ) P 1 t (ˆθt θ ) (ˆθt+1 θ ) P 1 t+1(ˆθt+1 θ ) . We bound the telescopic sum: as P 1 n+1 0, we have (ˆθt θ ) P 1 t (ˆθt θ ) (ˆθt+1 θ ) P 1 t+1(ˆθt+1 θ ) (ˆθ1 θ ) P 1 1 (ˆθ1 θ ) ˆθ1 θ 2 The result follows from the identities (b (ˆθ t Xt) yt)Xt a = ℓ (yt, ˆθ t Xt)Xt , P 1 t+1 P 1 t = ℓ (yt, ˆθ t Xt)Xt X t . B.3 Proofs of Section 3.2 Proof of Proposition 6. The first-order condition satisfied by θ is E (y b (θ X))X yielding E [y X] = E[b (θ X)X]. Therefore E (b (θ X) y)X a(θ θ ) E h X(b (θ X) b (θ X)) i . Considering the function f : λ (θ θ ) E Xb (θ X + λ(θ θ ) X)) , we know there exists λ [0, 1] such that f (λ) = f(1) f(0). This translates into θ (θ θ ) = 1 a(θ θ ) E h Xb θ X + λ(θ θ) X (θ θ ) X i . Then we use Assumption 3: b θ X + λ(θ θ) X b (θ X) = ℓ yt, θ X + λ(θ θ) X ℓ (yt, θ X) ρ θ θ , de Vilmarest and Wintenberger θ (θ θ ) ρ θ θ (θ θ ) E h ℓ (y, θ X)XX i (θ θ ) = ρ θ θ (θ θ ) 2L Proof of Proposition 7. We first recall that L(θ) L(θ ) L θ (θ θ ), then Proposition θ (θ θ ) c(θ θ ) 2L θ(θ θ ) (1 c ρ θ θ ) L and the result follows. Proof of Lemma 8. We first develop ( Mt)2: ( Mt)2 = (E [ t | Ft 1] t) (ˆθt θ ) 2 = (ˆθt θ ) E[ t | Ft 1]E[ t | Ft 1] + t t t E[ t | Ft 1] E[ t | Ft 1] t (ˆθt θ ) 2(ˆθt θ ) E[ t | Ft 1]E[ t | Ft 1] + t t (ˆθt θ ) 2(ˆθt θ ) E[ t t | Ft 1] + t t (ˆθt θ ) . The third line holds because if U, V Rd, it holds UV V U UU + V V . The last one comes from E h ( t E[ t | Ft 1])( t E[ t | Ft 1]) | Ft 1 i 0. Also, we have the relation E[( Mt)2 | Ft 1] (ˆθt θ ) E[ t t | Ft 1](ˆθt θ ) . ( Mt)2 + E[( Mt)2 | Ft 1] (ˆθt θ ) 3E[ t t | Ft 1] + 2 t t (ˆθt θ ) , and the result follows from Lemma 3. We derive the following Lemma in order to control the right-hand side of Lemma 5, in both settings. Lemma 18 Assume the second point of Assumption 3 holds. For any k, n 1, if ˆθt θ 2 ε for any k < t k + n then we have t=k+1 Tr Pt+1(P 1 t+1 P 1 t ) d ln 1 + nhελmax(Pk+1)D2 X d Stochastic Online Optimization using Kalman Recursion Proof We apply Lemma 11.11 of Cesa-Bianchi and Lugosi (2006): t=k+1 Tr Pt+1(P 1 t+1 P 1 t ) = 1 det(P 1 t ) det(P 1 t+1) det(P 1 t+1) det(P 1 t ) det(P 1 k+n+1) det(P 1 k+1) t=k+1 ℓ (yt, ˆθ t Xt)(P 1/2 k+1Xt)(P 1/2 k+1Xt) ! i=1 ln(1 + λi) , where λ1, ..., λd are the eigenvalues of k+n P t=k+1 ℓ (yt, ˆθ t Xt)(P 1/2 k+1Xt)(P 1/2 k+1Xt) . Therefore we t=k+1 Tr Pt+1(P 1 t+1 P 1 t ) d ln dnhελmax(Pk+1)D2 X B.4 Bounded Setting (Assumption 3) Proof of Theorem 1. Let δ > 0. On the one hand, we sum Lemma 5 and 8. We obtain, for any λ > 0, E[ t | Ft 1] (ˆθt θ ) 1 λ(ˆθt θ ) t t + 3 2E t t | Ft 1 (ˆθt θ ) t=T(ε,δ)+1 X t Pt+1Xtℓ (yt, ˆθ t Xt)2 + ˆθ1 θ 2 λmin(PT(ε,δ)+1) + ln δ 1 λ , n 1 , (3) with probability at least 1 δ, where we define Qt = (ˆθt θ ) ℓ (yt, ˆθ t Xt)Xt X t (ˆθt θ ) for any t. de Vilmarest and Wintenberger On the other hand, thanks to Assumption 3, we can apply Proposition 7 with c = 0.75 to obtain, for any t 1, = L(ˆθt) L(θ ) ρε ρε 0.75 ˆθt (ˆθt θ ) 0.75(ˆθt θ ) 2L ˆθt (ˆθt θ ) , = L(ˆθt) L(θ ) 5 E[ t | Ft 1] (ˆθt θ ) 0.75E[Qt | Ft 1] , (4) because ρε > 0.95. In order to bridge the gap between Equations (3) and (4), we need to control the quadratic terms of Equation (3) with E[Qt | Ft 1]. First, for any t, if ˆθt θ ε, we have Qt [0, hεε2D2 X], and we apply Lemma A.3 of Cesa-Bianchi and Lugosi (2006) to the random variable 1 hεε2D2 X Qt [0, 1]: for any s > 0, E exp s hεε2D2 X Qt es 1 hεε2D2 X E [Qt | Ft 1] | Ft 1, ˆθt θ ε 1 . We fix s = 0.1 and we define 0.1 hεε2D2 X Qt (e0.1 1)E 1 hεε2D2 X Qt | Ft 1 The sequence (Vn) is adapted to (FT(ε,δ)+n), almost surely we have V0 = 1 and Vn 0. Finally, E h Vn | FT(ε,δ)+n 1, ˆθT(ε,δ)+n θ ε i Vn 1 , and ( ˆθT(ε,δ)+n θ ε) belongs to FT(ε,δ)+n 1. We apply Lemma 17: n=1 Vn > δ 1 [ n=1 ( ˆθT(ε,δ)+n θ > ε) ! n=1 ( ˆθT(ε,δ)+n θ > ε) We define Aε k = T n=k+1 ( ˆθn θ ε) for any k. The last inequality is equivalent to t=T(ε,δ)+1 Qt > 10(e0.1 1) t=T(ε,δ)+1 E [Qt | Ft 1] + 10hεε2D2 X ln δ 1 Aε T(ε,δ) We then bound the two quadratic terms coming from Lemma 8: using Assumption 3 we have the implications ˆθt θ ε = (ˆθt θ ) t t (ˆθt θ ) κεQt , ˆθt θ ε = (ˆθt θ ) E t t | Ft 1 (ˆθt θ ) κεE [Qt | Ft 1] . Stochastic Online Optimization using Kalman Recursion Therefore, we get from (5) 2Qt + λ(ˆθt θ ) t t (ˆθt θ ) 2λ(ˆθt θ ) E t t | Ft 1 (ˆθt θ ) > 10(e0.1 1)(1 2 + λκε) + 3 2λκε T(ε,δ)+n X t=T(ε,δ)+1 E [Qt | Ft 1] 2 + λκε)hεε2D2 X ln δ 1 Aε T(ε,δ) We set λ = 0.75 5(e0.1 1) (10(e0.1 1)+ 3 2 )κε , so that 10(e0.1 1)(1 2 + λκε) + 3 2λκε = 0.75 , 1 2 + λκε = 1 2 + 0.75 5(e0.1 1) 10(e0.1 1) + 3 2 0.59 0.6 , and consequently E[ t | Ft 1] (ˆθt θ ) 0.75E[Qt | Ft 1] > 6hεε2D2 X ln δ 1 E[ t | Ft 1] (ˆθt θ ) 1 λ(ˆθt θ ) t t + 3 2E t t | Ft 1 (ˆθt θ ) ! We plug Equation (4) in the last inequality: t=T(ε,δ)+1 (L(ˆθt) L(θ )) > 30hεε2D2 X ln δ 1 E[ t | Ft 1] (ˆθt θ ) 1 λ(ˆθt θ ) t t + 3 2E t t | Ft 1 (ˆθt θ ) ! de Vilmarest and Wintenberger We then use Equation (3) with 1 λ = (10(e0.1 1)+ 3 2 )κε 0.75 5(e0.1 1) 11.4κε 12κε. It yields t=T(ε,δ)+1 (L(ˆθt) L(θ )) > 5 t=T(ε,δ)+1 X t Pt+1Xtℓ (yt, ˆθ t Xt)2 + 5 ˆθ1 θ 2 λmin(PT(ε,δ)+1) + 30(2κε + hεε2D2 X) ln δ 1 ! Thanks to Assumption 3, we have X t Pt+1Xtℓ (yt, ˆθ t Xt)2 κε Tr Pt+1(P 1 t+1 P 1 t ) , t > T(ε, δ) , therefore we apply Lemma 18: for any n, it holds t=T(ε,δ)+1 X t Pt+1Xtℓ (yt, ˆθ t Xt)2 dκε ln 1 + nhελmax(PT(ε,δ)+1)D2 X d As PT(ε,δ)+1 P1, we obtain t=T(ε,δ)+1 (L(ˆθt) L(θ )) > 5 2dκε ln 1 + nhελmax(P1)D2 X d + 5 ˆθ1 θ 2 λmin(PT(ε,δ)+1) + 30(2κε + hεε2D2 X) ln δ 1 ! To conclude, we use Assumption 5. B.5 Quadratic Setting (Assumption 4) We recall two definitions introduced in the previous subsection: n=k+1 ( ˆθn θ ε), k 1 , Qt = (ˆθt θ ) Xt X t (ˆθt θ ), t 1 . The sub-gaussian hypothesis requires a different treatment of several steps in the proof. In the following proofs, we use a consequence of the first points of Assumption 4. We apply Lemma 1.4 of Rigollet and H utter (2015): for any X Rd, E[(y E[y | X])2i | X] 2i(2σ2)iΓ(i) = 2(2σ2)ii!, i N . (6) First, we control the quadratic terms in t = (yt ˆθ t Xt)Xt in the following lemma. Stochastic Online Optimization using Kalman Recursion Lemma 19 1. For any k N and δ > 0, we have t=k+1 (ˆθt θ ) t t (ˆθt θ ) > 3 8σ2 + D2 app + ε2D2 X k+n X t=k+1 Qt + 12ε2D2 Xσ2 ln δ 1 ! 2. For any t, it holds almost surely (ˆθt θ ) E[ t t | Ft 1](ˆθt θ ) 3 σ2 + D2 app + ˆθt θ 2D2 X E[Qt | Ft 1] . 1. We recall that for any a, b, c, we have (a + b + c)2 3(a2 + b2 + c2). Thus (ˆθt θ ) t t (ˆθt θ ) = Qt(yt ˆθ t Xt)2 3Qt (yt E[yt | Xt])2 + (E[yt | Xt] θ Xt)2 + ((θ ˆθt) Xt)2 3Qt (yt E[yt | Xt])2 + D2 app + ˆθt θ 2D2 X . (7) To obtain the last inequality, we use the second point of Assumption 4 to bound the middle term. Then we use Taylor series for the exponential, and we apply Equation (6). For any t and any µ satisfying 0 < µ 1 4Qtσ2 , we have E exp µQt(yt E[yt | Xt])2 | Ft 1, Xt = 1 + X µi Qi t E[(yt E yt | Xt])2i | Xt µi Qi ti!(2σ2)i 1 + 8µQtσ2, 2µQtσ2 1 2 exp 8µQtσ2 . Therefore, for any t, E exp 1 4ε2D2 Xσ2 Qt (yt E[yt | Xt])2 8σ2 | Ft 1, Xt, ˆθt θ ε 1 . We define the random variable 1 4ε2D2 Xσ2 t=k+1 Qt (yt E[yt | Xt])2 8σ2 ! de Vilmarest and Wintenberger (Vn)n is adapted to the filtration (σ(X1, y1, ..., Xk+n, yk+n, Xk+n+1)n, moreover V0 = 1 and Vn 0 almost surely, and E h Vn | X1, y1, ..., Xk+n 1, yk+n 1, Xk+n, ˆθk+n θ ε i Vn 1 . Therefore we apply Lemma 17: for any δ > 0, n=1 (Vn > δ 1) Aε k which is equivalent to t=k+1 Qt(yt E[yt | Xt])2 > 8σ2 k+n X t=k+1 Qt + 4ε2D2 Xσ2 ln δ 1 ! Substituting in Equation (7), we obtain the desired result. 2. We apply the same decomposition as for Equation (7): for any t, (ˆθt θ ) E[ t t | Ft 1](ˆθt θ ) 3(ˆθt θ ) E Xt X t (yt E[yt | Xt])2 + D2 app + θ ˆθt 2D2 X | Ft 1 Assumption 4 implies that for any Xt, E[(yt E[yt | Xt])2 | Xt] σ2. Thus, the tower property yields (ˆθt θ ) E[ t t | Ft 1](ˆθt θ ) 3 σ2 + D2 app + ˆθt θ 2D2 X (ˆθt θ ) E[Xt X t | Ft 1](ˆθt θ ) . Second, we bound the right-hand side of Lemma 5, that is the objective of the following lemma. Lemma 20 Let k N. For any δ > 0, we have t=k+1 X t Pt+1Xt(yt ˆθ t Xt)2 > 12λmax(P1)D2 Xσ2 ln δ 1 + 3 8σ2 + D2 app + ε2D2 X d ln 1 + nλmax(Pk+1)D2 X d Proof We apply a similar analysis as in the proof of Lemma 19 in order to use the subgaussian assumption, and then we apply the telescopic argument as in the bounded setting. Stochastic Online Optimization using Kalman Recursion We decompose yt ˆθ t Xt: X t Pt+1Xt(yt ˆθ t Xt)2 3X t Pt+1Xt (yt E[yt | Xt])2 + (E[yt | Xt] b (θ Xt))2 + ((θ ˆθt) Xt)2 3X t Pt+1Xt (yt E[yt | Xt])2 + D2 app + ˆθt θ 2D2 X . (8) To control (yt E[yt | Xt])2X t Pt+1Xt, we use its positivity along with Equation (6). Precisely, for any t and any µ > 0 satisfying 0 < µ 1 4X t Pt+1Xtσ2 , we have E h exp µ(yt E[yt | Xt])2X t Pt+1Xt | Ft 1, Xt i µi(X t Pt+1Xt)i E (yt E[yt | Xt])2i | Xt µi(X t Pt+1Xt)ii!(2σ2)i 2µX t Pt+1Xtσ2 i 1 + 8µX t Pt+1Xtσ2, 0 < 2µX t Pt+1Xtσ2 1 exp 8µX t Pt+1Xtσ2 . We apply the previous bound with a uniform µ = 1 4λmax(P1)D2 Xσ2 . As λmax(Pt+1) λmax(P1) for any t, we get µ 1 4X t Pt+1Xtσ2 . Thus, we define 1 4λmax(P1)D2 Xσ2 (yt E[yt | Xt])2 8σ2 X t Pt+1Xt (Vn) is a super-martingale adapted to the filtration (σ(X1, y1, ..., Xk+n 1, yk+n 1, Xk+n))n satisfying almost surely V0 = 1, Vn 0, thus we apply Lemma 17: n=1 (Vn > δ 1) or equivalently t=k+1 X t Pt+1Xt(yt E[yt | Xt])2 > 8σ2 k+n X t=k+1 X t Pt+1Xt + 4λmax(P1)D2 Xσ2 ln δ 1 !! de Vilmarest and Wintenberger Combining it with Equation (8), we get t=k+1 X t Pt+1Xt(yt ˆθ t Xt)2 > 3 8σ2 + D2 app + ε2D2 X k+n X t=k+1 X t Pt+1Xt + 12λmax(P1)D2 Xσ2 ln δ 1 ! Then we apply Lemma 18: the second point of Assumption 3 holds with hε = 1, thus t=k+1 Tr Pt+1(P 1 t+1 P 1 t ) d ln 1 + nλmax(Pk+1)D2 X d We conclude with X t Pt+1Xt = Tr(Pt+1(P 1 t+1 P 1 t )). We sum up our findings and we prove the result for the quadratic loss. The structure of the proof is the same as the one of Theorem 1. Proof of Theorem 2. On the one hand, we sum Lemma 5 and Lemma 8: for any λ, δ > 0 E[ t | Ft 1] (ˆθt θ ) 1 2Qt λ(ˆθt θ ) t t +3 2E t t | Ft 1 (ˆθt θ ) t=T(ε,δ)+1 X t Pt+1Xt(yt ˆθ t Xt)2 + ˆθT(ε,δ)+1 θ 2 λmin(PT(ε,δ)+1) + ln δ 1 λ , n 1 , (9) with probability at least 1 δ. On the other hand, we have t=T(ε,δ)+1 (L(ˆθt) L(θ )) 1 1 0.8 E[ t | Ft 1] (ˆθt θ ) 0.8E[Qt | Ft 1] . We aim to relate Equations (9) and (10) as in the proof of Theorem 1. To that end, we apply Lemma 19: 2Qt + λ(ˆθt θ ) t t + 3 2E t t | Ft 1 (ˆθt θ ) 2 + 3λ(8σ2 + D2 app + ε2D2 X) T(ε,δ)+n X t=T(ε,δ)+1 Qt 2λ σ2 + D2 app + ε2D2 X T(ε,δ)+n X t=T(ε,δ)+1 E [Qt | Ft 1] + 12λε2D2 Xσ2 ln δ 1 ! Stochastic Online Optimization using Kalman Recursion As in the proof of Theorem 1 we apply Lemma A.3 of (Cesa-Bianchi and Lugosi, 2006) and Lemma 17: for any δ > 0, t=T(ε,δ)+1 Qt > 10(e0.1 1) t=T(ε,δ)+1 E[Qt | Ft 1] + 10ε2D2 X ln δ 1 We combine the last two inequalities: 2Qt + λ(ˆθt θ ) t t + 3 2E t t | Ft 1 (ˆθt θ ) > 10(e0.1 1) 1 2 + 3λ(8σ2 + D2 app + ε2D2 X) + 9 2λ(σ2 + D2 app + ε2D2 X) t=T(ε,δ)+1 E [Qt | Ft 1] + 10ε2D2 X 1 2 + 3λ(8σ2 + D2 app + ε2D2 X) + 12λε2D2 Xσ2 ln δ 1 ! λ = 0.8 5(e0.1 1) 30(e0.1 1)(8σ2 + D2 app + ε2D2 X) + 9 2(σ2 + D2 app + ε2D2 X) 1 in order to obtain 10(e0.1 1) 1 2 + 3λ(8σ2 + D2 app + ε2D2 X) + 9 2λ(σ2 + D2 app + ε2D2 X) = 0.8 , 1 109σ2 + 28D2app + 28ε2D2 X < λ < 1 108σ2 + 27D2app + 27ε2D2 X , 2 + 3λ(8σ2 + D2 app + ε2D2 X) + 12λD2 Xε2σ2 8ε2D2 X 1 λ 28(4σ2 + D2 app + ε2D2 X) . de Vilmarest and Wintenberger Combining Equations (9), (10) and (11), we obtain t=T(ε,δ)+1 (L(ˆθt) L(θ )) t=T(ε,δ)+1 X t Pt+1Xt(yt ˆθ t Xt)2 + ε2 λmin(PT(ε,δ)+1) + 28(4σ2 + D2 approx + ε2D2 X) ln δ 1 + 8ε2D2 X ln δ 1 ! Finally, we apply Lemma 20 with PT(ε,δ)+1 P1 and we use Assumption 5: it holds simultaneously t=T(ε,δ)+1 L(ˆθt) L(θ ) 5 3 2 8σ2 + D2 app + ε2D2 X d ln 1 + nλmax(P1)D2 X d + λmax P 1 T(ε,δ)+1 ε2 + 28(4σ2 + D2 approx + ε2D2 X) ln δ 1 + 8ε2D2 X ln δ 1 + 6λmax(P1)D2 Xσ2 ln δ 1 , n 1 , with probability at least 1 5δ. To conclude, we write 28(4σ2 + D2 approx + ε2D2 X) + 8ε2D2 X + 6λmax(P1)D2 Xσ2 28 σ2(4 + λmax(P1)D2 X 4 ) + D2 app + 2ε2D2 X Appendix C. Proofs of Section 4 C.1 Proof of Theorem 10 Proof of Theorem 10. We check Assumption 3 with κε = e DX( θ +ε), hε = 1 4 and ρε = e εDX > 0.95. We can thus apply Theorem 1 with λmax(P 1 T(ε,δ)+1) λmax(P 1 1 ) + 1 2 < 3e DX θ , 30 2κε + ε2D2 X 4 < 64e DX θ , 5ε2D2 X 1/75 . We then control the first terms. To that end, we use a rough bound at any time t 1: L(ˆθt) L(θ ) E y X 1 + eyˆθ t X | ˆθt DX( ˆθ1 θ + (t 1)λmax(P1)DX) , Stochastic Online Optimization using Kalman Recursion because for any s 1, we have Ps P1 and therefore ˆθs+1 ˆθs λmax(P1)DX. Summing from 1 to T( 1 20DX , δ) yields the result. C.2 Concentration of Pt We prove a concentration result based on Tropp (2012), which will be used on the inverse of Pt. Lemma 21 If Assumption 1 is satisfied, then for any 0 β < 1 and t 41/(1 β), it holds d exp t1 β Λ2 min 10D4 X Proof We wish to center the matrices Xs X s by subtracting their (common) expected value. We use that if A and B are symmetric, λmin(A B) λmin(A) λmin(B). Indeed, denoting by v any eigenvector of A associated with its smallest eigenvalue, λmin(A B) = min x x (A B)x = λmin(A) v Bv λmin(A) min x x Bx = λmin(A) λmin(B) . s=1 E Xs X s sβ s=1 E Xs X s sβ Λmin t1 β 1 de Vilmarest and Wintenberger Therefore, we obtain < Λmin(t1 β 2) Xs X s sβ E Xs X s sβ < Λmin(t1 β 2) 2(1 β) Λmin t1 β 1 E Xs X s sβ We check the assumptions of Theorem 1.4 of Tropp (2012): Obviously E h Xs X s sβ i Xs X s sβ is centered, λmax E h Xs X s sβ i Xs X s sβ λmax E h Xs X s sβ i D2 X almost surely. As 0 E E h Xs X s sβ i Xs X s sβ 2 E Xs X s sβ 2 D4 X s2β I D4 X sβ I, we get " E Xs X s sβ I D4 X t1 β Therefore we can apply Theorem 1.4 of Tropp (2012): E Xs X s sβ Λ2 mint2(1 β)/(8(1 β)2) D4 Xt1 β/(1 β) + D2 XΛmint1 β/(6(1 β)) = d exp t1 β Λ2 min 8D4 X 1/(1 β) + Λmin/(6D2 X(1 β)) t1 β Λ2 min 8D4 X 1 β + Λmin(1 β) Using Λmin/D2 X 1 and β 0, we obtain 8(1 β + Λmin(1 β) 6D2 X ) 8(1 + 1/6) = 28/3 10, therefore < Λmin(t1 β 2) d exp t1 β Λ2 min 10D4 X The result follows from 1 2t1 β 2 > 0 for t 41/(1 β). We can now do a union bound to obtain Proposition 11. Stochastic Online Optimization using Kalman Recursion Proof of Proposition 11. We first move our problem to the setting of Lemma 21: λmax(Pt) = λmin s=1 Xs X s αs because αs 1/sβ. Therefore, for t 8 41/(1 β), P λmax(Pt) > 4 Λmint1 β > 4 Λmint1 β d exp t1 β Λ2 min 10D4 X where we applied Lemma 21 to obtain the last line. We take a union bound to obtain, for any k 7, P t > k, λmax(Pt) > 4 Λmint1 β t>k d exp t1 β Λ2 min 10D4 X t>k exp t1 β Λ2 min 10D4 X m 1 exp m Λ2 min 10D4 X t>k 1 t1 β =m t>k 1 t =m: for any m t1 β = m = m1/(1 β) t < (m + 1)1/(1 β) , then using ex 1 + 2x for any 0 x 1, we have (m + 1)1/(1 β) = m1/(1 β)(1 + 1/m)1/(1 β) = m1/(1 β) exp(ln(1 + 1/m)/(1 β)) m1/(1 β) exp(1/(m(1 β))) m1/(1 β)(1 + 2/(m(1 β))) , as long as m 2 1/(1 β). Therefore (m + 1)1/(1 β) m1/(1 β) + 1 2m1/(1 β) 1/(1 β) + 1 4m + 1 4(m + 1) , de Vilmarest and Wintenberger and that is true for m = 1 too. Hence P t > k, λmax(Pt) > 4 Λmint1 β m k1 β (m + 1) exp m Λ2 min 10D4 X = 4d exp Λ2 min 10D4 X 1 exp Λ2 min 10D4 X k1 β + 1 + exp Λ2 min 10D4 X 1 exp Λ2 min 10D4 X 4d exp Λ2 min 10D4 X 1 exp Λ2 min 10D4 X 1 exp Λ2 min 10D4 X exp Λ2 min 10D4 X where the second line is obtained deriving both sides of P m k1 β rm+1 = r k1 β +1 respect to r. Also, as 1 e x xe x for any x R, we get P t > k, λmax(Pt) > 4 Λmint1 β 4d10D4 X Λ2 min exp 2 Λ2 min 10D4 X (k1 β + 10D4 X Λ2 min exp Λ2 min 10D4 X ) exp Λ2 min 10D4 X Furthermore, as xe x e 1 for any x 0, we get for any k 7: k1 β + 10D4 X Λ2 min exp Λ2 min 10D4 X exp k1 β Λ2 min 20D4 X Λ2 min exp 10D4 X Λ2 min exp Λ2 min 10D4 X Λ2 min 20D4 X = 20D4 Xe 1 Λ2 min exp 1 2 exp Λ2 min 10D4 X Combining the last two inequalities, we obtain P t > k, λmax(Pt) > 4 Λmint1 β d800D8 Xe 1 Λ4 min exp 2 Λ2 min 10D4 X + 1 2 exp Λ2 min 10D4 X exp k1 β Λ2 min 20D4 X d625D8 X Λ4 min exp k1 β Λ2 min 20D4 X and the result follows. The last line comes from Λmin D2 X and consequently 800e 1 exp 2 Λ2 min 10D4 X + 1 2 exp Λ2 min 10D4 X 800e 1+0.2+0.5e0.1 624.7 625 . The condition k 7 is not necessary because 20D4 X Λ2 min ln 625d D8 X Λ4 minδ 1/(1 β) 20 ln(625δ 1) , Stochastic Online Optimization using Kalman Recursion and either δ 1 and the result is trivial, either δ < 1 and 20 ln(625δ 1) 128. C.3 Convergence of the Truncated Algorithm In order to prove Theorem 12, we state and prove an intermediate lemma. Lemma 22 Let θ Rd. 1. For any η > 0, we have L(θ) L(θ ) > η = L where Dη = Λmin η 2DX(1+e DX ( θ + 8η/D2 X )) . 2. For any ε > 0, we have θ θ > ε = L(θ) L(θ ) > Λmin 4(1 + e DX( θ +ε))ε2 . Proof Both points derive from a second-order identity, turned in an upper-bound in the one case and in a lower-bound in the other. Using L θ (θ ) = 0, there exists 0 λ 1 such that L(θ) = L(θ ) + 1 2(θ θ ) E 1 (1 + e(λθ+(1 λ)θ ) X)(1 + e (λθ+(1 λ)θ ) X)XX (θ θ ) . 1. We first have L(θ) L(θ ) D2 X 8 θ θ 2 . Assume L(θ) L(θ ) > η. Then θ θ q 8η/D2 X. Also, using the Taylor expansion of θ around some θ0 Rd, we get L(θ ) L(θ0)+ L θ0 (θ θ0)+ 1 4(1 + e DX( θ + θ0 θ ))(θ0 θ ) E h XX i (θ0 θ ) , and that yields θ0 (θ0 θ ) L(θ0) L(θ ) + Λmin 4(1 + e DX( θ + θ0 θ )) θ0 θ 2 . Therefore, as L(θ0) L(θ ) 0, L Λmin 4(1 + e DX( θ + θ0 θ )) θ0 θtrue . de Vilmarest and Wintenberger Finally, as L is convex of minimum θ , L 4(1 + e DX( θ + 2DX(1 + e DX( θ + 2. On the other hand we have L(θ) L(θ ) + Λmin 4(1 + e DX( θ + θ θ )) θ θ 2 . Thus, as L is convex of minimum θ , if θ θ > ε it holds L(θ) L(θ ) > min θ0 θ =ε L(θ0) L(θ ) Λmin 4(1 + e DX( θ +ε))ε2 . Proof of Theorem 12. We prove the convergence of (L(ˆθt))t to L(θ ) and then the convergence of (ˆθt)t to θ follows. The convergence of (L(ˆθt))t comes from the first point of Lemma 22. The link between the two convergences is stated in the second point. To study the evolution of L(ˆθt) we first apply a second-order Taylor expansion: for any t 1 there exists 0 αt 1 such that L(ˆθt+1) = L(ˆθt) + L ˆθt (ˆθt+1 ˆθt) + 1 2(ˆθt+1 ˆθt) 2L ˆθt+αt(ˆθt+1 ˆθt)(ˆθt+1 ˆθt) . (12) 4E[XX ], therefore, using the update formula on ˆθ, the second-order term is bounded with (ˆθt+1 ˆθt) 2L ˆθt+αt(ˆθt+1 ˆθt)(ˆθt+1 ˆθt) 1 (1 + eytˆθ t Xt)2 X t P t+1 E[XX ] 4D4 Xλmax(Pt+1)2 1 4D4 Xλmax(Pt)2 . The first-order term is controlled using the definition of the algorithm: ˆθt+1 ˆθt = Pt Pt Xt X t Pt 1 + X t Pt Xtαt αt yt Xt 1 + eytˆθ t Xt , and as αt 1, αt Pt Xt X t Pt 1 + X t Pt Xtαt yt Xt 1 + eytˆθ t Xt D3 Xλmax(Pt)2 . Stochastic Online Optimization using Kalman Recursion θ DX. Substituting our findings in Equation (12), we obtain L(ˆθt+1) L(ˆθt) + L ˆθt Pt yt Xt 1 + eytˆθ t Xt + 2D4 Xλmax(Pt)2 . (13) ˆθt Pt yt Xt 1 + eytˆθ t Xt E L ˆθt Pt yt Xt 1 + eytˆθ t Xt | X1, y1, ..., Xt 1, yt 1 ˆθt Pt yt Xt 1 + eytˆθ t Xt + L Hence we have ˆθt Pt yt Xt 1 + eytˆθ t Xt Mt λmin(Pt) L 2 Mt 1 t D2 X because Ps I s D2 X . Combining it with Equation (13) and summing consecutive terms, we obtain, for any k < t, L(ˆθt) L(ˆθk) Ms 1 s D2 X 2 + 2D4 Xλmax(Ps)2 ! We recall that there exists Cδ such that P(ACδ) 1 δ where λmax(Pt) Cδ On the previous inequality, we see that the right-hand side is the sum of a martingale and a term which is negative for s large enough, under the event ACδ. We are then interested in P((L(ˆθt) L(θ ) > η) | ACδ) for some η > 0. For 0 k t, we define Bk,t the event ( k < s < t, L(ˆθs) L(θ ) > η/2). Then we use the law of total probability: P(L(ˆθt) L(θ ) > η | ACδ) (15) P (L(ˆθt) L(θ ) > η) B0,t | ACδ k=1 P (L(ˆθt) L(θ ) > η) L(ˆθk) L(θ ) η 2 Bk,t | ACδ (16) P (L(ˆθt) L(θ ) > η) B0,t | ACδ k=1 P L(ˆθt) L(ˆθk) > η 2 Bk,t | ACδ . Lemma 22 yields L(ˆθs) L(θ ) > η de Vilmarest and Wintenberger We combine the last equation, along with Equation (14) and the definition of ACδ to get, for any 1 k < t, P (L(ˆθt) L(ˆθk) > η/2) Bk,t | ACδ P s=k Ms > f(k, t) Bk,t | ACδ s=k Ms > f(k, t) | ACδ where f(k, t) = η 2 + D2 η D2 X 1 s 2D4 XC2 δ 1 s2(1 β) for any 1 k < t. Similarly, we get P (L(ˆθt) L(θ ) > η) B0,t | AC P s=1 Ms > f0(t) | AC with f0(t) = η (L(ˆθ1) L(θ )) + D2 η D2 X 1 s 2D4 XC2 δ 1 s2(1 β) for any t 1. We have E[Ms | X1, y1, ..., Xs 1, ys 1] = 0, and almost surely |Ms| 2D2 Xλmax(Ps). We can therefore apply Azuma-Hoeffding inequality: for t, k such that f(k, t) > 0, s=k Ms > f(k, t) | ACδ f(k, t)2 (1 2β) max 1/2, (k 1)1 2β because + P 1 s2(1 β) 1 (1 2β) max(1/2,(k 1)1 2β). Similarly, for t such that f0(t) > 0, s=1 Ms > f0(t) | ACδ exp f0(t)2 1 2β We need to control f(k, t), f0(t). We see that for t large enough, when k is small compared to t, f(k, t) is driven by D2 η D2 X ln(t) and when k t, f(k, t) is driven by η/2. The following Lemma formally states these approximations as lower-bounds. We prove it right after the end of this proof. Lemma 23 For t max 16D6 X C2 δ D2η(1 2β) , 1 + 8D4 XC2 δ η(1 2β) 1 1 2β 2! f(k, t) D2 η 4D2 X ln(t), 1 k < Similarly, for t e L(ˆθ1) L(θ )+ 4D4 X C2 δ 1 2β f0(t) D2 η 2D2 X ln(t) . Stochastic Online Optimization using Kalman Recursion Then, defining C1 = D4 η(1 2β) 256D8 XC2 δ and C2 = η2(1 2β) 128D4 XC2 δ , we finally get for t large enough: P (L(ˆθt) L(θ ) > η) B0,t | ACδ exp 4C1 ln(t)2 , P (L(ˆθt) L(θ ) > η) (L(ˆθk) L(θ ) η 2) Bk,t | ACδ ( exp C1 ln(t)2 if 1 k < t , exp C2(k 1)1 2β if Substituting in Equation (16) yields: P(L(ˆθt) L(θ ) > η | AC) exp 4C1 ln(t)2 + k=1 exp C1 ln(t)2 + t exp C2(k 1)1 2β t + 1) exp C1 ln(t)2 + t exp C2( Finally, Point 2 of Lemma 22 allows to obtain the result: defining η = Λminε2 4(1+e DX ( θ +ε)), we obtain P( ˆθt θ > ε | ACδ) P(L(ˆθt) L(θ ) > η | ACδ) t + 1) exp C1 ln(t)2 + t exp C2( In order to obtain the constants involved in the Theorem, we write Dη = Λmin q 4(1+e DX ( θ +ε)) 2DX(1 + exp DX( θ + r D2 X(1+e DX ( θ +ε))) ) Λmin 1 + e DX( θ +ε) 3/2 ε 4DX , C1 Λ6 min(1 2β)ε4 216D12 X C2 δ (1 + e DX( θ +ε))6 , C2 Λ2 min(1 2β)ε4 211D4 XC2 δ (1 + e DX( θ +ε))2 , and the conditions of Lemma 23 become 28D8 XC2 δ (1 + e DX( θ +ε))3 Λ3 min(1 2β)ε2 32D4 XC2 δ (1 + e DX( θ +ε)) (1 2β)Λminε2 32D4 X(1 + e DX( θ +ε))3 L(ˆθ1) L(θ ) + 4D4 XC2 δ 1 2β de Vilmarest and Wintenberger We would like to obtain a single condition on t, thus we write 32D4 XC2 δ (1 + e DX( θ +ε)) (1 2β)Λminε2 32D4 XC2 δ (1 + e DX( θ +ε)) (1 2β)Λminε2 1 + 32D4 XC2 δ (1 + e DX( θ +ε)) (1 2β)Λminε2 32D4 XC2 δ (1 + e DX( θ +ε)) (1 2β)Λminε2 28D8 XC2 δ (1 + e DX( θ +ε))3 Λ3 min(1 2β)3/2ε2 The third line is obtained with the inequality ln(1 + x) x for any x > 0. Obviously, as 0 < 1 2β < 1, the first threshold on t is bounded by: 28D8 XC2 δ (1 + e DX( θ +ε))3 Λ3 min(1 2β)ε2 28D8 XC2 δ (1 + e DX( θ +ε))3 Λ3 min(1 2β)3/2ε2 To handle the third one, we use D2 XCδ 4D2 X Λmin 4 and as ˆθ1 = 0 we obtain L(ˆθ1) L(θ ) ln 2 4D4 XC2 δ 1 2β , hence 32D4 X(1 + e DX( θ +ε))3 L(ˆθ1) L(θ ) + 4D4 XC2 δ 1 2β 28D8 XC2 δ (1 + e DX( θ +ε))3 Λ3 min(1 2β)3/2ε2 Proof of Lemma 23. We recall that for any k 1, 1 s ln t ln k , 1 s2(1 β) 1 1 2β 1 max(1/2, (k 1)1 2β) . 2 + D2 η D2 X (ln t ln k) 2D4 XC2 δ 1 2β 1 max(1/2, (k 1)1 2β) , f0(t) η (L(ˆθ1) L(θ ) + D2 η D2 X ln t 4D4 XC2 δ 1 2β . Stochastic Online Optimization using Kalman Recursion For any 1 k < t, it holds ln k 1 2 ln t, and we have f(k, t) D2 η 2D2 X ln(t) 4D4 XC2 δ 1 2β . 16D6 X C2 δ D2η(1 2β) yields f(k, t) D2 η 4D2 X ln(t). For t 2 and any k 2 2D4 XC2 δ (1 2β)(k 1)1 2β η 2 2D4 XC2 δ (1 2β)( Then if t 1 + 8D4 XC2 δ η(1 2β) 1 1 2β 2 , we get f(k, t) η Last point comes from f0(t) D2 η D2 X ln t (L(ˆθ1) L(θ ) 4D4 XC2 δ 1 2β . Proof of Corollary 13. We apply Theorem 12: for any t exp 28D8 XC2 δ/2(1+e DX ( θ +ε))3 Λ3 min(1 2β)3/2ε2 P( ˆθt θ > ε | ACδ/2) ( t + 1) exp C1 ln(t)2 + t exp C2( C1 = Λ6 min(1 2β)ε4 216D12 X C2 δ/2(1 + e DX( θ +ε))6 , C2 = Λ2 min(1 2β)ε4 211D4 XC2 δ/2(1 + e DX( θ +ε))2 . We use a union bound: for any T exp 28D8 XC2 δ/2(1+e DX ( θ +ε))3 Λ3 min(1 2β)3/2ε2 t=T+1 ( ˆθt θ > ε) | ACδ/2 t + 1) exp C1 ln(t)2 t>T t exp C2( 3 2C1 , we have t + 1) exp C1 ln(t)2 X de Vilmarest and Wintenberger For t 4, 1 1/ t 1/2, then for t 12 C2(1 2β) 4/(1 2β) , t 1)1 2β exp 3 ln(t) C2 2 t(1 2β)/2 exp 12 1 2β ln 12 C2(1 2β) 12 C2(1 2β) because for any x > 0, we have ln x x/2. Thus for T 12 C2(1 2β) 4/(1 2β) t>T t exp C2( t 1)1 2β 1/T . Finally, for T satisfying the previous conditions as well as T 6δ 1, we obtain t=T+1 ( ˆθt θ > ε) | ACδ/2 We now compare the constants involved. As long as εDX 1, we have 28D8 XC2 δ/2(1 + e DX( θ +ε))3 Λ3 min(1 2β)3/2ε2 3 215D12 X C2 δ/2(1 + e DX( θ +ε))6 Λ6 min(1 2β)3/2ε4 Furthermore, as 1 2β 1, we have 3 215D12 X C2 δ/2(1 + e DX( θ +ε))6 Λ6 min(1 2β)ε4 3 215D12 X C2 δ/2(1 + e DX( θ +ε))6 Λ6 min(1 2β)3/2ε4 Stochastic Online Optimization using Kalman Recursion 12 C2(1 2β) 4/(1 2β) = exp 4 1 2β ln 12 C2(1 2β) 4 1 2β ln 12 211D4 XC2 δ/2(1 + e DX( θ +ε))2 Λ2 min(1 2β)2ε4 8 1 2β ln 12 211D4 XC2 δ/2(1 + e DX( θ +ε))2 Λ2 min(1 2β)ε4 3 213D4 XC2 δ/2(1 + e DX( θ +ε))2 Λ2 min(1 2β)ε4 629D2 XCδ/2(1 + e DX( θ +ε)) Λmin(1 2β)3/2ε2 3 215D12 X C2 δ/2(1 + e DX( θ +ε))6 Λ6 min(1 2β)3/2ε4 Appendix D. Proofs of Section 5 Proof of Proposition 14. The first order condition of the optimum yields arg min θ Rd s=1 (ys θ Xs)2 + 1 2(θ ˆθ1) P 1 1 (θ ˆθ1) = ˆθ1 + Pt s=1 (ys ˆθ 1 Xs)Xs . Therefore we prove recursively that ˆθt ˆθ1 = Pt Pt 1 s=1(ys ˆθ 1 Xs)Xs. It is clearly true at t = 1. Assuming it is true for some t 1, we use the update formula ˆθt+1 ˆθ1 = (I Pt+1Xt X t )(ˆθt ˆθ1) + Pt+1yt Xt Pt+1Xt X t ˆθ1 = (I Pt+1Xt X t )Pt s=1 (ys ˆθ 1 Xs)Xs + Pt+1(yt ˆθ 1 Xt)Xt . We conclude with the following identity: (I Pt+1Xt X t )Pt = Pt Pt Xt X t Pt + Pt Xt X t Pt Xt X t Pt X t Pt Xt + 1 = Pt Pt Xt X t Pt X t Pt Xt + 1 = Pt+1 . de Vilmarest and Wintenberger D.1 Proof of Theorem 16 We first prove a result controlling the first estimates of the algorithm. Lemma 24 Provided that Assumptions 1, 2 and 4 are satisfied, starting from any ˆθ1 Rd and P1 0, for any δ > 0, it holds simultaneously ˆθt θ ˆθ1 θ + λmax(P1)DX (3σ + Dapprox)(t 1) + 3σ ln δ 1 , t 1, with probability at least 1 δ. Proof From Proposition 14, we obtain, for any t 1, ˆθt ˆθ1 = Pt Pt 1 s=1(ys ˆθ 1 Xs)Xs. Consequently, s=1 (ys ˆθ 1 Xs)Xs Pt s=1 (ys θ Xs)Xs + Pt P 1 1 (ˆθ1 θ ) , and using Pt P 1 1 I, we obtain ˆθt θ ˆθ1 θ + λmax(Pt)DX s=1 |ys θ Xs| ˆθ1 θ + λmax(P1)DX s=1 (|ys E[ys | Xs]| + Dapp) . (17) We apply Lemma 1.4 of Rigollet and H utter (2015) in the second line of the following: for any µ such that 0 < µ < 1 2 E [exp(µ|yt E[yt | Xt]|)] = 1 + X µi E[|yt E[yt | Xt]|i] µi(2σ2)i/2iΓ(i/2) 2µσ i , because Γ(i/2) Γ(i) = (i 1)! 2µσ, because 0 < Therefore exp 1 2 s=1 (|ys E[ys | Xs]| 2 t is a super-martingale to which we can apply Lemma 17. We obtain, for any δ > 0, s=1 |yt E[yt | Xt]| 2 2(t 1)σ + 2 2σ ln δ 1, t 1, Stochastic Online Optimization using Kalman Recursion with probability at least 1 δ. The result follows from Equation (17) and 2 Proof of Theorem 16. We first apply Theorem 2: with probability at least 1 5δ, it holds simultaneously for all n T(ε, δ) t=T(ε,δ)+1 L(ˆθt) L(θ ) 15 2 d 8σ2 + D2 app + ε2D2 X ln 1 + (n T(ε, δ))λmax(P1)D2 X d + 5λmax P 1 T(ε,δ)+1 ε2 + 115 σ2(4 + λmax(P1)D2 X 4 ) + D2 app + 2ε2D2 X Moreover, λmax P 1 T(ε,δ)+1 λmax(P 1 1 ) + T(ε, δ)D2 X. Then we derive a bound on the first T(ε, δ) terms. For any t 1, we have L(ˆθt) L(θ ) D2 X ˆθt θ 2, thus, using (a + b)2 2(a2 + b2) and applying Lemma 24 we obtain the simultaneous property L(ˆθt) L(θ ) 2D2 X( ˆθ1 θ + 3λmax(P1)DXσ ln δ 1)2 + 2λmax(P1)2D4 X(3σ + Dapp)2(t 1)2, t 1, with probability at least 1 δ. A summation argument yields, for any δ > 0, t=1 L(ˆθt) L(θ ) 2D2 X( ˆθ1 θ + 3λmax(P1)DXσ ln δ 1)2T(ε, δ) + λmax(P1)2D4 X(3σ + Dapp)2 (T(ε, δ) 1)T(ε, δ)(2T(ε, δ) 1) with probability at least 1 δ. D.2 Definition of T(ε, δ) We now focus on the definition of T(ε, δ). We first transcript the result of Hsu et al. (2012) to our notations in the following lemma. Lemma 25 Provided that Assumptions 1, 2 and 4 are satisfied, starting from any ˆθ1 Rd and P1 = p1I, p1 > 0, we have, for any 0 < δ < e 2.6 and t 6 D2 X Λmin (ln d + ln δ 1), ˆθt+1 θ 2 Σ 3 2p1 + D2 X Λmin D2 app 4(1 + 8 ln δ 1) 0.072 + 3σ2(d/0.035 + ln δ 1) + 12 0.072t2 D2 X Λmin (1 + + DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1 2 (ln δ 1)2 ! with probability at least 1 4δ. de Vilmarest and Wintenberger Proof We first observe that arg min w Rd 1 s=1 (ys w Xs)2 + λ w ˆβ1 2 = arg min w Rd 1 s=1 (ys ˆβ 1 Xs w Xs)2 + λ w 2 , therefore we apply the ridge analysis of Hsu et al. (2012) to (Xs, ys ˆβ 1 Xs). We note that (ys ˆβ 1 Xs) has the same variance proxy and the same approximation error, therefore it only amounts to translate the optimal w, that is denoted by β. For any λ > 0, we observe that d2,λ d1,λ d , ρλ DX pd1,λΛmin , bλ ρλ(Dapp + DX β ˆβ1 ) . Therefore we can apply Theorem 16 of Hsu et al. (2012): for 0 < δ < e 2.6 and t 6 DX Λmin (ln(d)+ln δ 1), it holds that ˆβt+1,λ β 2 Σ = 3( βλ β 2 Σ+εbs+εvr) with probability 1 4δ, with εbs 4 0.072 D2 X Λmin E[(E[y | X] β X)2] + (1 + D2 X Λmin ) βλ β 2 Σ t (1 + + ( DX Λmin (Dapp + DX β ˆβ1 ) + βλ β Σ)2 t2 (ln δ 1)2 , t DX Λmin (1 + 8 ln δ 1) + 1 D4 X Λ2 mind + 1 εvr σ2d(1 + δf) 0.072t + 2σ2p d(1 + δf) ln δ 1 0.073/2t + 2σ2 ln δ 1 Moreover E[(E[y | X] β X)2] D2 app and Λmin D2 X, hence, using βλ β Σ λ β ˆβ1 we transfer the result in our KF notations, that is, ˆθt = ˆβt,p 1 1 /2(t 1), ˆβ1 = ˆθ1, β = θ . We obtain, for any 0 < δ < e 2.6 and t 6 DX Λmin (ln(d) + ln δ 1), εbs 4 0.072 D2 X Λmin D2 app + D2 X Λmin ˆθ1 θ 2 + ( DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1t )2 t2 (ln δ 1)2 , t DX Λmin (1 + 8 ln δ 1) + 1 D4 X Λ2 mind + 1 εvr σ2d(1 + δf) 0.072t + 2σ2p d(1 + δf) ln δ 1 0.073/2t + 2σ2 ln δ 1 ˆθt+1 θ 2 Σ 3 2p1t + εbs + εvr Stochastic Online Optimization using Kalman Recursion with probability at least 1 4δ. For t D2 X Λmin ln δ 1, as ln δ 1 1, we get 6 ln δ 1 (1 + 8 ln δ 1) + 1 1 d + 1 1 + 2 9 1.9 2 . 2 for any a, b > 0, we have 3d 0.07 + 2 0.07 + 2 ln δ 1 ! 0.07 + 3 ln δ 1 3σ2(d/0.035 + ln δ 1) It yields the result. Lemma 25 allows the definition of an explicit value for T(ε, δ), as displayed in the following Corollary. Corollary 26 Assumption 5 is satisfied for T(ε, δ) = max(T1(δ), T2(ε, δ), T3(ε, δ)) where we define T1(δ) = max 12 D2 X Λmin (ln d + ln δ 1), 48D2 X Λmin ln 24D2 X Λmin T2(ε, δ) = 24ε 1 2p1 + D2 X Λmin D2 app 4(1 + 8 ln δ 1) 0.072 + 3σ2(d/0.035 + ln δ 1) 2p1 + D2 X Λmin D2 app 4(1 + 8 ln δ 1) 0.072 + 3σ2(d/0.035 + ln δ 1) D2 X Λmin (1 + + DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1 2 (ln δ 1)2 !1/2 2p1 (1 + D2 X Λmin )(1 + + DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1 2 (ln δ 1)2 ! We recall that for any η 1, we have ln t t η for t 2η 1 ln(η 1), and we use it in the following proof. Proof of Corollary 26. We define δt = δ/t2 for any t 1. In order to apply Lemma 25 with a union bound, we need t 6 D2 X Λmin (ln d + ln δ 1 t ). If t 12 D2 X Λmin (ln d + ln δ 1) and de Vilmarest and Wintenberger t 48D2 X Λmin ln 24D2 X Λmin , we obtain 6 D2 X Λmin (ln d + ln δ 1) + 12D2 X Λmin ln t, as ln t = 6 D2 X Λmin (ln d + ln δ 1 t ) . Therefore, we define T1(δ) = max 12 D2 X Λmin (ln d + ln δ 1), 48D2 X Λmin ln 24D2 X Λmin , and we apply Lemma 25. We get the simultaneous property ˆθt+1 θ 2 Σ 3 2p1 + D2 X Λmin D2 app 4(1 + q 8 ln δ 1 t ) 0.072 + 3σ2(d/0.035 + ln δ 1 t ) 0.07 + 12 0.072t2 D2 X Λmin (1 + q 8 ln δ 1 t ) + DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1 2 (ln δ 1 t )2 ! for all t T1(δ), with probability at least 1 4δ P t T1(δ) t 2 1 δ because T1(δ) > 4. Thus, as ln t 1 for t T1(δ) and ˆθt+1 θ 2 Σ Λmin ˆθt+1 θ 2, we obtain ˆθt+1 θ 6 ln t 2p1 + D2 X Λmin D2 app 4(1 + 8 ln δ 1) 0.072 + 3σ2(d/0.035 + ln δ 1) + 48(ln t)2 0.072Λmint2 D2 X Λmin (1 + + DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1 2 (ln δ 1)2 ! for all t T1(δ, with probability at least 1 δ. Finally, both terms of the last inequality are bounded by ε/2. From Corollary 26, we obtain the asymptotic rate by comparing T2(δ) and T3(δ). We write T2(δ) = 2A2(δ) ln A2(δ), T3(δ) = 2A3(δ) ln A3(δ) with p1 + D2 X Λmin D2 app ln δ 1 + σ2(d + ln δ 1) v u u t ε 1 ln δ 1 + DX(Dapp + DX θ ) Λmin + ˆθ1 θ p1 2 (ln δ 1)2 ! Stochastic Online Optimization using Kalman Recursion where the symbol means less than up to universal constants. As ab a + b, we obtain + DX Λmin (Dapp + DX θ ) + ˆθ1 θ p1 p1 + D2 X Λmin + DX Λmin (Dapp + DX θ ) + ˆθ1 θ p1 Thus, as long as ε 1 Λmin 1, we get A2(δ), A3(δ) ε 1 p1 + D2 X Λmin (1 + D2 app) ln δ 1 + σ2d + DX Λmin (Dapp + DX θ ) + ˆθ1 θ p1 + σ2 ln δ 1 ! Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10 (2):251 276, 1998. Francis Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Journal of Machine Learning Research, 15(1):595 627, 2014. Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In Advances in Neural Information Processing Systems, pages 773 781, 2013. Bernard Bercu and Abderrahmen Touati. Exponential inequalities for self-normalized martingales with applications. The Annals of Applied Probability, 18(5):1848 1869, 2008. Bernard Bercu, Antoine Godichon, and Bruno Portier. An efficient stochastic newton algorithm for parameter estimation in logistic regressions. SIAM Journal on Control and Optimization, 58(1):348 367, 2020. Jock A. Blackard and Denis J. Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, 24(3):131 151, 1999. Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, Learning, and Games. Cambridge university press, 2006. de Vilmarest and Wintenberger George T. Diderrich. The Kalman filter from the perspective of Goldberger Theil estimators. The American Statistician, 39(3):193 198, 1985. James Durbin and Siem J. Koopman. Time Series Analysis by State Space Methods. Oxford university press, 2012. Ludwig Fahrmeir. Posterior mode estimation by extended Kalman filtering for multivariate dynamic generalized linear models. Journal of the American Statistical Association, 87 (418):501 509, 1992. David A. Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100 118, 1975. Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007. Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on Learning Theory, pages 9 1, 2012. Sham M. Kakade and Andrew Y. Ng. Online bounds for bayesian algorithms. In Advances in Neural Information Processing Systems, pages 641 648, 2005. Rudolph E. Kalman and Richard S. Bucy. New results in linear filtering and prediction theory. Journal of Basic Engineering, 83(1):95 108, 1961. Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In The International Conference on Knowledge Discovery and Data Mining, volume 96, pages 202 207, 1996. Tomer Koren. Open problem: Fast stochastic exp-concave optimization. In Conference on Learning Theory, pages 1073 1075, 2013. Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Lower and upper bounds on the generalization of stochastic exponentially concave optimization. In Conference on Learning Theory, pages 1305 1320, 2015. Peter Mc Cullagh and John A. Nelder. Generalized Linear Models. London Chapman and Hall, 2nd ed edition, 1989. Noboru Murata and Shun-ichi Amari. Statistical analysis of learning dynamics. Signal Processing, 74(1):3 28, 1999. Yann Ollivier. Online natural gradient as a Kalman filter. Electronic Journal of Statistics, 12(2):2930 2961, 2018. Dmitrii M. Ostrovskii and Francis Bach. Finite-sample analysis of m-estimators using selfconcordance. Electronic Journal of Statistics, 15(1):326 391, 2021. Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992. Stochastic Online Optimization using Kalman Recursion Phillippe Rigollet and Jan-Christian H utter. High dimensional statistics. Lecture notes for course 18S997, 2015. Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407, 1951. David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988. Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pages 928 936, 2003.