# stochastic_online_optimization_using_kalman_recursion__1bc660ea.pdf

Journal of Machine Learning Research 22 (2021) 1-55 Submitted 6/20; Revised 8/21; Published 9/21

Stochastic Online Optimization using Kalman Recursion

Joseph de Vilmarest joseph.de vilmarest@sorbonne-universite.fr Laboratoire de Probabilit es, Statistique et Mod elisation, Sorbonne Universit e, CNRS 4 place Jussieu, 75005 Paris, France and Electricit e de France R&D

Olivier Wintenberger olivier.wintenberger@sorbonne-universite.fr Laboratoire de Probabilit es, Statistique et Mod elisation, Sorbonne Universit e, CNRS 4 place Jussieu, 75005 Paris, France

Editor: Tong Zhang

We study the Extended Kalman Filter in constant dynamics, oﬀering a bayesian perspective of stochastic optimization. For generalized linear models, we obtain high probability bounds on the cumulative excess risk in an unconstrained setting, under the assumption that the algorithm reaches a local phase. In order to avoid any projection step we propose a twophase analysis. First, for linear and logistic regressions, we prove that the algorithm enters a local phase where the estimate stays in a small region around the optimum. We provide explicit bounds with high probability on this convergence time, slightly modifying the Extended Kalman Filter in the logistic setting. Second, for generalized linear regressions, we provide a martingale analysis of the excess risk in the local phase, improving existing ones in bounded stochastic optimization. The algorithm appears as a parameter-free online procedure that optimally solves some unconstrained optimization problems. Keywords: extended kalman ﬁlter, online learning, stochastic optimization

1. Introduction

The optimization of convex functions is a long-standing problem with many applications. In supervised machine learning it frequently arises in the form of the prediction of an observation yt R given explanatory variables Xt Rd. The aim is to minimize a cost depending on the prediction and the observation. We focus in this article on linear predictors, hence the loss function is of the form ℓ(yt, θ Xt). Two important settings have emerged in order to analyse learning algorithms. In the online setting (Xt, yt) may be set by an adversary. The assumption required is boundedness and the goal is to upper estimate the regret (cumulative excess loss compared to the optimum). In the stochastic setting with independent identically distributed (i.i.d.) (Xt, yt), we deﬁne the risk L(θ) = E[ℓ(y, θ X)]. We focus on the cumulative excess risk Pn t=1 L(ˆθt) L(θ ) where θ minimizes the risk. We obtain bounds holding with high probability simultaneously for any horizon, that is, we control the whole trajectory of the risk. Furthermore, our bounds on the cumulative risk all lead to similar ones on the excess risk at any step for the averaged version of the algorithm. Due to its low computational cost the Stochastic Gradient Descent (SGD) of Robbins and Monro (1951) has been widely used, along with its equivalent in the online setting,

c 2021 Joseph de Vilmarest and Olivier Wintenberger.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-618.html.

de Vilmarest and Wintenberger

the online gradient descent (Zinkevich, 2003) and a simple variant where the iterates are averaged (Ruppert, 1988; Polyak and Juditsky, 1992). More recently Bach and Moulines (2013) provided a sharp bound in expectation on the excess risk for a two-step procedure that has been extended to the average of SGD with a constant step size (Bach, 2014). Second-order methods based on stochastic versions of Newton-Raphson algorithm have been developed in order to converge faster in iterations, although with a bigger computational cost per iteration (Hazan et al., 2007). In order to obtain a parameter-free second-order algorithm we apply a bayesian perspective, seeing the loss as a negative log-likelihood and approximating the maximum-likelihood estimator at each step. We get a state-space model interpretation of the optimization problem: in a well-speciﬁed setting the space equation is yt pθt( | Xt) exp( ℓ( , θ t Xt)) with θt Rd and the state equation deﬁnes the dynamics of the state θt. The stochastic convex optimization setting corresponds to a degenerate constant state-space model θt = θt 1 called static. As usual in State-Space models, the optimization is realized with the Kalman recursion (Kalman and Bucy, 1961) for the quadratic loss and the Extended Kalman Filter (EKF) (Fahrmeir, 1992) in a more general case. A correspondence has recently been made by Ollivier (2018) between the static EKF and the online natural gradient (Amari, 1998). This motivates a risk analysis in order to enrich the link between Kalman ﬁltering and the optimization community. We may see the static EKF as the online approximation of bayesian model averaging, and similarly to its analysis derived by Kakade and Ng (2005) our analysis is robust to misspeciﬁcation, that is we don t assume the data to be generated by the probabilistic model. The static EKF is very close to the Online Newton Step (ONS) introduced by Hazan et al. (2007) as both are second-order online algorithms and our results are of the same ﬂavor as those obtained on the ONS (Mahdavi et al., 2015). However the ONS requires the knowledge of the region in which the optimization is realized. It is involved in the choice of the gradient step size and a projection step is done to ensure that the search stays in the chosen region. On the other hand the static EKF yields two advantages at the cost of being less generic. First, there is no costly projection step and each recursive update runs in O(d2) operations, where d is the dimension of the features Xt. Therefore, our comparison of the static EKF with the ONS provides a lead to the open question of Koren (2013). Indeed, the problem of the ONS pointed out by Koren (2013) is to control the cost of the projection step and the question is whether it is possible to perform better than the ONS in the stochastic exp-concave setting. We don t answer the open question in the general setting. However, we suggest a general way to get rid of the projection by dividing the analysis between a convergence proof of the algorithm to the optimum and a second phase where the estimate stays in a small region around the optimum where no projection is required. Second, the algorithm is (nearly) parameter-free. We believe that bayesian statistics is the reasonable approach in order to obtain parameter-free online algorithms in the unconstrained setting. Parameter-free is not exactly correct as there are initialization parameters, which we see as a smoothed version of the hard constraint imposed by bounded algorithm, but they have no impact on the leading terms of our bounds. Static Kalman ﬁlter coincides with the ridge forecaster and similarly the static EKF may be seen as the online approximation of a regularized empirical risk minimizer.

Stochastic Online Optimization using Kalman Recursion

1.1 Related Work

Theoretical guarantees for online and stochastic algorithms are multi-criteria and of various natures. The comparison of upper-bounds or computational complexity highly depends on the values of d the dimension of the explanatory vectors and n the time horizon, leading to diﬀerent views on whether the dependence to d or n is the most important. The nature of the guarantee obviously depends on the objective pursued. In the advesarial setting, the learner suﬀers a loss ℓt(ˆθt) depending on its estimate ˆθt at each time step t. It is natural to minimize the cumulative loss, or equivalently the regret

t=1 ℓt(ˆθt)

t=1 ℓt(θ ) ,

where θ reaches the minimum value of the cumulative loss and thus highly depends on (ℓt)1 t n. Under an assumption of bounded gradients, Zinkevich (2003) proved that a ﬁrstorder online gradient descent yields a regret bound in O( n). The Online Newton Step (ONS) is a second-order online gradient descent that has been designed to obtain a regret bound in O(ln n) (Hazan et al., 2007) under the assumption that the losses are exp-concave. The improved guarantee comes at a cost of O(d2) operations per step instead of O(d), along with a projection at each step whose cost depends on the data. In the stochastic setting where the losses (ℓt) are assumed i.i.d., the aim is to minimize the risk L(θ) = E[ℓ(θ)]. A natural candidate is the Empirical Risk Minimizer (ERM), whose asymptotics are well understood (see for example Murata and Amari (1999)). Assuming the existence of θ minimizing the risk and that the Hessian matrix H = 2 θ2 L(θ ) is positive deﬁnite, the ERM ˆθn satisﬁes

E[L(ˆθn)] L(θ ) = tr(G H 1)

2n + o(1/n) , G = E h

θℓ(y, θ X) i .

Although in the well-speciﬁed setting the identity tr(G H 1) = d holds, in the misspeciﬁed case there is no general estimate for tr(G H 1). Recently a non-asymptotic bound L(ˆθn) L(θ ) = O(tr(G H 1) ln δ 1/n) holding with probability 1 δ has been shown by Ostrovskii and Bach (2021) on the ERM. However the ERM is deﬁned only implicitly and may have important computational cost, hence recursive algorithms based on gradient descent have been studied under diﬀerent sets of assumptions to bound tr(G H 1). The most simple is Stochastic Gradient Descent (SGD), where each step is in the opposite direction of the gradient. This algorithm has been widely used with various step sizes. Sharp results have been obtained by Bach (2014) for a constant gradient step size C/ n with ﬁxed horizon n. Under the assumption that gradients are bounded by R we have tr(G H 1) R2/µ where µ is the minimal eigenvalue of H . The fast rate E[L(θn)] L(θ ) = O(R2/(µn)) is obtained by Bach (2014) for the averaged estimate θn of SGD. In the same article the author also derives a bound with high probability but with a slower rate: it degrades into L(θn) L(θ ) = O(log δ 1/ n) with probability 1 δ. Finally, in the quadratic setting a fast rate L(θn) L(θ ) = O(1/(nδα)) is achieved with probability 1 δ for a deﬁned α > 0 (Bach and Moulines, 2013). To obtain fast rates with high probability beyond the quadratic setting, it seems necessary to use second-order information as in the ONS (Mahdavi et al., 2015). Under the

de Vilmarest and Wintenberger

ERM Averaged SGD ONS This article Regret O(ln n) Excess risk in expectation O( 1

n) Excess risk w.h.p. O(ln δ 1

n ) O(ln δ 1 n ) Cumulative excess risk w.h.p. O(ln n + ln δ 1) O(ln n + ln δ 1 + S(δ)) Cost per iteration Implicit O(d) O(d2) + Tproj O(d2)

Table 1: Summary of existing results along with the static EKF for which we prove the bound for the cumulative excess risk. We focus on the dependence on n, and δ for the bounds holding with probability 1 δ (w.h.p.). S(δ) is the cumulative excess risk of the convergence phase.

assumption that the loss is α-exp-concave, tr(G H 1) d/α and for the averaged version of the ONS the rate L(θn) L(θ ) = O(d(ln n + ln δ 1)/(αn)) with probability 1 δ is obtained. From our perspective, the result is stronger than what is claimed by Mahdavi et al. (2015): the bound obtained is

t=1 L(ˆθn) L(θ ) = O(ln n + ln δ 1) , (1)

holding simultaneously for any n with probability 1 δ. Note that although averaging this bound with Jensen s inequality leads to a sub-optimal bound on the excess risk of the last averaged estimate, it is conversely not possible to obtain Equation (1) from

L(ˆθn) L(θ ) = O(ln δ 1/n) ,

holding with probability 1 δ.

1.2 Contributions

Our ﬁrst contribution is a local analysis of the static EKF under assumptions deﬁned in Section 2, and provided that consecutive steps stay in a small ball around the optimum θ . We derive local bounds on the cumulative risk with high probability from a martingale analysis. Our analysis of Section 3 is similar to the one of Mahdavi et al. (2015) and we slightly reﬁne their constants as a by-product. We then show that the convergence property crucial in our analysis is reachable. To that end we focus on linear regression and logistic regression as these two well-known problems are challenging in the unconstrained setting. In linear regression, the gradient of the loss is not bounded globally. In logistic regression, the loss is strictly convex, but neither strongly convex nor exp-concave in the unconstrained setting. In Section 4, we develop a global bound in the logistic setting on a slight modiﬁcation of the algorithm introduced by Bercu et al. (2020). We prove that this modiﬁed algorithm converges and stays into a local region around θ after a ﬁnite number of steps. Moreover we show that it coincides with the static EKF and thus our local analysis applies. In Section 5, we apply our local analysis to the quadratic setting. We rely on Hsu et al. (2012) to obtain the convergence of the algorithm after exhibiting the correspondence between Kalman ﬁlter in constant dynamics and the ridge forecaster, and we therefore obtain similarly a global bound.

Stochastic Online Optimization using Kalman Recursion

Finally, we demonstrate numerically the competitiveness of the static EKF for logistic regression in Section 6.

2. Deﬁnitions and Assumptions

We consider loss functions that may be written as the negative log-likelihood of a generalized linear model (Mc Cullagh and Nelder, 1989). Formally, the loss is deﬁned as ℓ(y, θ X) = log pθ(y | X) where θ Rd, (X, y) X Y for some X Rd and Y R and pθ is of the form

pθ(y | X) = h(y) exp y θ X b(θ X)

where a is a constant and h and b are one-dimensional functions on which a few assumptions are required (Assumption 3). This includes logistic and quadratic regressions, see Sections 4 and 5. We display the static EKF in Algorithm 1 in this setting (see Appendix A for a derivation relying on Durbin and Koopman, 2012). In the quadratic setting, noting that the EKF estimate ˆθt does not depend on the (unknown) variance σ2, we consider the quadratic loss ℓ(y, ˆy) = (y ˆy)2/2 by convention.

Algorithm 1: Static Extended Kalman Filter for Generalized Linear Model

1. Initialization: P1 is any positive deﬁnite matrix, ˆθ1 is any initial parameter in Rd.

2. Iteration: at each time step t = 1, 2, . . .

(a) Update Pt+1 = Pt Pt Xt X t Pt 1+X t Pt Xtαt αt with αt = b (ˆθ t Xt) a .

(b) Update ˆθt+1 = ˆθt + Pt+1 (yt b (ˆθ t Xt))Xt a .

Due to matrix-vector and vector-vector multiplications, Algorithm 1 has a running-time complexity of O(d2) at each iteration and thus O(nd2) for n iterations. Note that although we need the loss function to be derived from a likelihood of the form (2), we do not need the data to be generated under this process. We need two standard hypotheses on the data. The ﬁrst one is the i.i.d. assumption and bounded random design (all along the paper . is the Euclidean norm):

Assumption 1 The observations (Xt, yt)t are i.i.d. copies of the pair (X, y) X Y and there exists DX such that Xt DX almost surely (a.s.).

Working under Assumption 1, we deﬁne the risk function L(θ) = E ℓ(y, θ X) . Note that in Section 3 we don t need E[XX ] invertible, but we will make such an assumption in Sections 4 and 5 to prove the convergence of the algorithm in the logistic and quadratic settings, respectively. In order to work on a well-deﬁned optimization problem we assume there exists a minimum:

Assumption 2 There exists θ Rd such that L(θ ) = inf θ Rd L(θ).

de Vilmarest and Wintenberger

We treat two diﬀerent settings requiring diﬀerent assumptions, summarized in Assumption 3 and 4 respectively. First, motivated by logistic regression we deﬁne:

Assumption 3 There exists (κε)ε>0, (hε)ε>0 and ρε ε 0 1 such that for any ε > 0 and

any θ, θ0 Rd satisfying max( θ θ , θ0 θ ) ε, we have

ℓ (y, θ X)2 κεℓ (y, θ X) a.s.

ℓ (y, θ X) hε a.s.

ℓ (y, θ X) ρεℓ (y, θ 0 X) a.s.

Here ℓ and ℓ are the ﬁrst and second derivatives of ℓwith respect to the second variable. Assumption 3 requires local exp-concavity (around θ ) along with some regularity on ℓ

(ℓ continuous and ℓ (y, θ X) γ > 0 a.s. is suﬃcient). That setting implies Y bounded, because ℓ depends on y whereas ℓ doesn t. In logistic regression, Y = { 1, +1} and Assumption 3 is satisﬁed for κε = e DX( θ +ε), hε = 1

4, ρε = e εDX. Second, we consider the quadratic loss, corresponding to a gaussian model. In order to include the well-speciﬁed setting and to bound G = E[(y θ X)2XX ], we assume y sub-gaussian conditionally to X and not too far away from the model:

Assumption 4 The distribution of (X, y) X Y satisﬁes

There exists σ2 > 0 such that for any s R, E es(y E[y|X]) | X e σ2s2

There exists Dapp 0 such that |E[y | X] θ X| Dapp a.s.

Both conditions of Assumption 4 hold with Y = R and Dapp = 0 for the well-speciﬁed gaussian linear model with random bounded design. The second condition of Assumption 4 is satisﬁed for Dapp > 0 in misspeciﬁed sub-gaussian linear model with a.s. bounded approximation error.

3. The Algorithm Around the Optimum

In this section, we analyse the cumulative risk under a strong convergence assumption. Precisely we deﬁne

τ(ε) = min{k N | t > k, ˆθt θ ε} ,

where (ˆθt)t are the estimates of the static EKF, and with the convention min = + . We assume a bound on τ(ε) holding with high probability:

Assumption 5 For any δ, ε > 0, there exists T(ε, δ) N such that

P τ(ε) T(ε, δ) 1 δ .

Assumption 5 states that with high probability there exists a convergence time after which the algorithm stays trapped in a local region around the optimum. Sections 4 and 5 are devoted to deﬁne explicitly such a convergence time for a modiﬁed EKF in the logistic setting and for the true EKF in the quadratic setting.

Stochastic Online Optimization using Kalman Recursion

We present our result in the bounded and sub-gaussian settings. The results and their proofs are very similar, but two crucial steps are diﬀerent. First, Assumption 3 yields a bound on the gradient holding almost surely. We relax the boundedness condition for the quadratic loss with a sub-gaussian hypothesis, requiring a speciﬁc analysis. Second, our analysis is based on a second-order expansion. The quadratic loss is equal to its secondorder Taylor expansion but we need Assumption 5 along with the third point of Assumption 3 otherwise. The following theorem is our result in the bounded setting.

Theorem 1 Starting the static EKF from any ˆθ1 Rd, P1 0, if Assumptions 1, 2, 3, 5 are satisﬁed and if ρε > 0.95, for any δ > 0, it holds for any n 1 simultaneously

t=T(ε,δ)+1 L(ˆθt) L(θ ) 5

2dκε ln 1 + nhελmax(P1)D2 X d

+ 5λmax P 1 T(ε,δ)+1 ε2

+ 30 2κε + hεε2D2 X ln δ 1 ,

with probability at least 1 3δ.

The constant 0.95 may be chosen arbitrarily close to 0.5 with growing constants in the bound on the cumulative risk. There is a hidden trade-oﬀin ε: on the one hand, the smaller ε the better our upper-bound, but on the other hand T(ε, δ) increases when ε decreases, and thus our bound applies after a bigger convergence time. For the quadratic loss, we obtain the following result under the sub-gaussian hypothesis.

Theorem 2 In the quadratic setting, starting the static EKF from any ˆθ1 Rd, P1 0, if Assumptions 1, 2, 4 and 5 are satisﬁed, for any δ > 0 and any ε > 0, it holds for any n 1 simultaneously

t=T(ε,δ)+1 L(ˆθt) L(θ ) 15

2 d 8σ2 + D2 app + ε2D2 X ln 1 + nλmax(P1)D2 X d

+ 5λmax P 1 T(ε,δ)+1 ε2 + 115 σ2 4 + λmax(P1)D2 X 4

+ D2 app + 2ε2D2 X

with probability at least 1 5δ.

We observe a similar trade-oﬀin ε as in Theorem 1. Up to numerical constants, the tight constant d(σ2+D2 app) (see for instance Hsu et al., 2012) is achieved by choosing ε arbitrarily small, at the cost of a loose control of the T(ε, δ) ﬁrst steps. Both results follow from a regret analysis close to the one on the ONS (see Section 3.1), and on a control on the martingales stated below:

Lemma 3 Let k 0 and ( Nt)t>k be any martingale diﬀerence adapted to the ﬁltration (Ft)t k such that for any t > k, E[ N2 t | Ft 1] < . For any δ, λ > 0, we have the simultaneous property

2 (( Nt)2 + E[( Nt)2 | Ft 1]) ln δ 1

with probability at least 1 δ.

de Vilmarest and Wintenberger

Algorithm 2: Recursive updates of the ONS and the static EKF

Online Newton Step Static Extended Kalman Filter

P 1 t+1 = P 1 t + ℓ (yt, w t Xt)2Xt X t , P 1 t+1 = P 1 t + ℓ (yt, ˆθ t Xt)Xt X t ,

t = ℓ (yt, w t Xt)Xt , t = ℓ (yt, ˆθ t Xt)Xt ,

, ˆθt+1 = ˆθt Pt+1 t ,

K is the projection on K for the norm . P 1 t+1.

This result proved in Appendix B.1 is a corollary of a martingale inequality from Bercu and Touati (2008) and a stopping time construction (Freedman, 1975). We detail the key ideas of the proofs in the rest of the Section, and we defer to Appendix B the proof of the intermediate results along with the detailed proof of Theorems 1 and 2. Speciﬁcally, we display in Section 3.1 the parallel with the ONS, where we compare with the existing result on the cumulative risk, and a similar analysis yields an adversarial bound on a second-order expansion of the loss. In Section 3.2 we compare the excess risk with its second-order expansion thanks to Assumption 5, and we use a martingale analysis to obtain a bound on the cumulative excess risk.

3.1 Comparison with Online Newton Step and Adversarial Analysis

We display the parallel between the ONS and the static EKF in Algorithm 2 through their recursive updates. We observe that the square of the ﬁrst derivative of ℓis replaced with the second derivative. Thus t P 1 t in the static EKF is an estimate of the Hessian H which is the optimal preconditioning matrix as shown in Corollary 3 of Murata and Amari (1999). Then the recursion on the parameter (wt and ˆθt) has two diﬀerences: there is a gradient step size 1/γ in the ONS absent in the static EKF, and after the gradient step the ONS applies a projection. Lemma 3 yields the following reﬁnement on the bound of Mahdavi et al. (2015) obtained on the cumulative excess risk of the ONS:

Corollary 4 Assume the search region K has diameter D and the gradients are bounded by R. Let (wt)t be the ONS estimates starting from w1 K, P1 = λI and using a step-size γ = 1

2 min( 1 4RD, α) with α the exp-concavity constant of ℓon K. Then for any δ > 0, it holds for any n 1 simultaneously

t=1 L(wt) L(θ ) 3

2γ d ln 1 + n R2

with probability at least 1 2δ.

For the sake of consistency, we display Corollary 4 as a bound on the cumulative excess risk, whereas Theorem 3 of Mahdavi et al. (2015) is a bound on the excess risk of the averaged

Stochastic Online Optimization using Kalman Recursion

ONS. The latter follows directly from an application of Jensen s inequality. The proof of Corollary 4 consists in replacing Theorem 4 of Mahdavi et al. (2015) with Lemma 3. We obtain similar constants in Theorem 1 and in Corollary 4, as κε is the inverse of the expconcavity constant α. The use of second-order methods with well-tuned preconditioning is crucial in order to replace the leading constant R2/µ obtained for ﬁrst-order methods by d/α (µ is the minimum eigenvalue of the hessian H ). Our results on the static EKF are less general than the ones obtained on the ONS as a control of the convergence time τ(ε) T(ε, δ) is required with high probability. On the other hand the results obtained on the ONS require the knowledge of the exp-concavity constant α whereas the static EKF is parameter-free. That is why we argue that the static EKF provides an optimal way to tune the step size and the preconditioning matrix. Indeed, as ε is a parameter of the EKF analysis but not of the algorithm, we can improve the leading constant κε on a local region arbitrarily small around θ , at a cost of a loose control of the T(ε, δ) ﬁrst steps. In the ONS the choice of a diameter D > θ makes the gradient step-size sub-optimal and impacts the leading constant. Once the parallel between the ONS and the static EKF has been displayed (Algorithm 2), it is natural to adopt an approach similar to the one in Hazan et al. (2007). The cornerstone of our local analysis is the derivation of an adversarial bound on the second-order Taylor expansion of ℓ, from the recursive update formulae.

Lemma 5 For any sequence (Xt, yt)t, starting from any ˆθ1 Rd, P1 0, it holds for any θ Rd and n N that

ℓ (yt, ˆθ t Xt)Xt (ˆθt θ ) 1

2(ˆθt θ ) ℓ (yt, ˆθ t Xt)Xt X t (ˆθt θ )

t=1 X t Pt+1Xtℓ (yt, ˆθ t Xt)2 + ˆθ1 θ 2

We cannot compare the excess loss with the second-order Taylor expansion in general, and it is natural to use a step size parameter. In Hazan et al. (2007), the regret analysis of the ONS is based on a very similar bound on ℓ (yt, w t Xt)Xt (wt θ ) γ

2(wt θ ) ℓ (yt, w t Xt)2Xt X t (wt θ ) ,

where γ is a step size parameter. Then the regret bound follows from the exp-concavity property, bounding the excess loss ℓ(yt, w t Xt) ℓ(yt, θ Xt) with the previous quantity for a speciﬁc γ. The dependence of γ on the exp-concavity constant and the bound on the gradients require that the algorithm stays in a bounded region around the optimum θ , and a projection on this region is used, potentially at each step. We follow a very diﬀerent approach, to stay parameter-free, unconstrained and to avoid any additional cost in the leading constant. In the stochastic setting, we observe that we can upper-bound the excess risk with a second-order expansion, up to a multiplicative factor.

3.2 From Adversarial to Stochastic: the Cumulative Risk

In order to compare the excess risk with a second-order expansion, we compare the ﬁrstorder term with the second-order one.

de Vilmarest and Wintenberger

Proposition 6 If Assumptions 1, 2 and 3 are satisﬁed, for any θ Rd, it holds

θ (θ θ ) ρ θ θ (θ θ ) 2L

This result leads immediately to the following proposition, using the ﬁrst-order convexity property of L.

Proposition 7 If Assumptions 1, 2 and 3 are satisﬁed, for any θ Rd, 0 < c < ρ θ θ , it holds

L(θ) L(θ ) ρ θ θ ρ θ θ c

θ (θ θ ) c(θ θ ) 2L

Lemma 5 motivates the use of c > 1

2, thus we need at least ρ θ θ > 1

2. In the quadratic setting, it holds as an equality with ρ = 1 because the second derivative of the quadratic loss is constant. In the bounded setting we need to control the second derivative in a small range, and we can achieve that only locally, therefore we separate the condition ρ θ θ > 1

2 between the third point of Assumption 3 and Assumption 5. Then we are left to obtain a bound on the cumulative risk from Lemma 5. In order to compare the derivatives of the risk and the losses, we need to control the martingale diﬀerence adapted to the natural ﬁltration (Ft) and deﬁned as

(ˆθt θ ), where t = ℓ (yt, ˆθ t Xt)Xt .

We thus apply Lemma 3 to this martingale diﬀerence.

Lemma 8 Starting the static EKF from any ˆθ1 Rd, P1 0, if Assumptions 1 and 2 are satisﬁed, for any k 0 and δ, λ > 0, it holds simultaneously

Mt λ(ˆθt θ ) t t + 3

2E t t | Ft 1 (ˆθt θ ) ln δ 1

with probability at least 1 δ.

The proof of Theorems 1 and 2 deferred to Appendix B builds on the above results. Summing Lemma 5 and 8, we obtain for any δ, λ > 0 the simultaneous bound

ˆθt (ˆθt θ ) (ˆθt θ ) 1

2 (2) t + λ t t + 3

2λE t t | Ft 1 (ˆθt θ )

t=T(ε,δ)+1 X t Pt+1Xtℓ (yt, ˆθ t Xt)2 + ˆθ1 θ 2

λmin(PT(ε,δ)+1) + ln δ 1

with probability at least 1 δ, where we deﬁne (2) t = ℓ (yt, ˆθ t Xt)Xt X t for any t. In the last equation, we control (see Appendix B.4 and B.5) the quadratic term in ˆθt θ on the left hand-side in terms of (ˆθt θ ) 2L

θ2 |ˆθt(ˆθt θ ) in order to lower-bound the left expression proportionally to the cumulative excess risk using Proposition 7 for well chosen λ.

Stochastic Online Optimization using Kalman Recursion

Algorithm 3: Truncated Extended Kalman Filter for Logistic Regression

1. Initialization: P1 is any positive deﬁnite matrix, ˆθ1 is any initial parameter in Rd.

2. Iteration: at each time step t = 1, 2, . . .

(a) Update Pt+1 = Pt Pt Xt X t Pt 1+X t Pt Xtαt αt, with αt = max 1 tβ , 1

(1+e ˆθ t Xt)(1+e ˆθ t Xt)

(b) Update ˆθt+1 = ˆθt + Pt+1 yt Xt 1+eyt ˆθ t Xt .

4. Logistic Setting

Logistic regression is a widely used statistical model in classiﬁcation. The prediction of a binary random variable y Y = { 1, 1} consists in modelling L(y | X) with

pθ(y | X) = 1 1 + e yθ X = exp

yθ X (2 ln(1 + eθ X) θ X)

In the GLM notations, it yields a = 2 and b(θ X) = 2 ln(1 + eθ X) θ X.

4.1 Results for the Truncated Algorithm

In order to prove the convergence of the algorithm needed in the local phase, we follow a trick introduced by Bercu et al. (2020) consisting in changing slightly the update on Pt. Indeed, when the authors tried to prove the asymptotic convergence of the static EKF (which they named stochastic Newton step) using Robbins-Siegmund Theorem, they needed the convergence of P t λmax(Pt)2. This seems very likely to hold as we have intuitively Pt 1/t. However, in order to obtain λmax(Pt) = O(1/t), one needs to lower-bound αt, that is, to upper-bound |ˆθ t Xt|, and that is impossible in the global logistic setting. Therefore, the idea is to force a lower-bound on αt in its deﬁnition. We thus deﬁne, for some 0 < β < 1/2,

(1 + eˆθ t Xt)(1 + e ˆθ t Xt)

This modiﬁcation yields Algorithm 3, where we keep the notations ˆθt, Pt, τ(ε) with some abuse in the rest of this section. We impose a decreasing threshold on αt (β > 0) and we prove that the recursion coincides with Algorithm 1 after some steps. The sensitivity of the algorithm to β is discussed at the end of Section 4.2. Also, note that the threshold could be c/tβ, c > 0, as in Bercu et al. (2020). We consider 1/tβ for clarity. We control the convergence time τ(ε) of this version of the EKF:

Proposition 9 Starting Algorithm 3 from ˆθ1 = 0 and any P1 0, if Assumptions 1 and 2 are satisﬁed and E[XX ] is invertible, for any ε, δ > 0, it holds τ(ε) T(ε, δ) along with

t > T(ε, δ), αt = 1

(1 + eˆθ t Xt)(1 + e ˆθ t Xt) ,

de Vilmarest and Wintenberger

with probability at least 1 δ, where T(ε, δ) N is deﬁned in Corollary 13.

Besides the convergence of the truncated EKF, the proposition states that the truncated recursions coincide with the static EKF ones after the ﬁrst T(ε, δ) steps. Thus we can apply our analysis of Section 3. We state the global result for ε = 1/(20DX):

Theorem 10 Under the assumptions of Proposition 9, for any δ > 0, it holds for any n 1 simultaneously

t=1 L(ˆθt) L(θ ) 3de DX θ ln 1 + nλmax(P1)D2 X 4d

+ λmax(P 1 1 ) 75D2 X + 64e DX θ ln δ 1

+ T 1 20DX , δ 1

300 + DX ˆθ1 θ + T 1 20DX , δ 2 λmax(P1)D2 X 2 ,

with probability at least 1 4δ, where T(1/(20DX), δ) is deﬁned in Corollary 13.

4.2 Explicit Deﬁnition of T(ε, δ) in Proposition 9

It is proved that ˆθn θ 2 = O (ln n/n) almost surely (Bercu et al., 2020, Theorem 4.2). We don t obtain a non-asymptotic version of this rate of convergence, but the aim of this paragraph is to prove Proposition 9 for an explicit value of T(ε, δ) for any δ, ε > 0. The objective of the truncation introduced in the algorithm is to improve the control on Pt. We state that fact formally with a concentration result relying on Tropp (2012). We deﬁne Λmin the smallest eigenvalue of E[XX ].

Proposition 11 Under the assumptions of Proposition 9, for any δ > 0, it holds simultaneously

t > 20D4 X Λ2 min ln 625d D8 X Λ4 minδ

1/(1 β) , λmax(Pt) 4 Λmint1 β ,

with probability at least 1 δ.

This proposition justiﬁes the choice β < 1/2 in the introduction of the truncated algorithm to satisfy the condition P t λmax(Pt)2 < + with high probability. Motivated by Proposition 11, we deﬁne, for C > 0, the event

λmax(Pt) C t1 β

To obtain a control on Pt holding for any t, we use the relation λmax(Pt) λmax(P1) holding almost surely. We thus deﬁne

Cδ = max 4 Λmin , λmax(P1) 20D4 X Λ2 min ln 625d D8 X Λ4 minδ

and we obtain P (ACδ) 1 δ. We obtain the following theorem under that condition.

Stochastic Online Optimization using Kalman Recursion

Theorem 12 Under the assumptions of Proposition 9, we have for any δ, ε > 0 and t

exp 28D8 XC2 δ (1+e DX ( θ +ε))3

Λ3 min(1 2β)3/2ε2

P( ˆθt θ > ε | ACδ) (

t + 1) exp Λ6 min(1 2β)ε4

216D12 X C2 δ (1 + e DX( θ +ε))6 ln(t)2

+ t exp Λ2 min(1 2β)ε4

211D4 XC2 δ (1 + e DX( θ +ε))2 (

The beginning of our convergence proof starts similarly as the analysis of Bercu et al. (2020): we obtain a recursive inequality ensuring that (L(ˆθt))t is decreasing in expectation. However, in order to obtain a non-asymptotic result we cannot apply Robbins-Siegmund Theorem. Instead we use the fact that the variations of the algorithm ˆθt are slow thanks to the control on Pt. Thus, if the algorithm was far from the optimum, the last estimates were far too which contradicts the decrease in expectation of the risk. Consequently, we look at the last k t such that ˆθk θ < ε/2, if it exists. We decompose the probability of being outside the local region in two scenarii, yielding the two terms in Theorem 12. If k <

t, the recursive decrease in expectation makes it unlikely that the estimate stays far from the optimum for a long period. If k >

t, the control on Pt allows a control on the probability that the algorithm moves fast, in t k steps, away from the optimum. The following corollary explicitly deﬁnes a guarantee for the convergence time.

Corollary 13 Proposition 9 holds with for any ε, δ > 0

T(ε, δ) = max

2(1 + e DX( θ +ε)) 1/β , exp 3 215D12 X C2 δ/2(1 + e DX( θ +ε))6

Λ6 min(1 2β)3/2ε4

This deﬁnition of T(ε, δ) allows a discussion on the dependence of the bound Theorem 10 to the diﬀerent parameters. Note that the choice ε = 1/(20DX) in Theorem 10 is artiﬁcially made for simplifying constants since the bound actually holds for any ε > 0 simultaneously. The truncation has introduced an extra parameter 0 < β < 1/2 that does not impact the leading term in Theorem 10. However, it impacts the ﬁrst step control in an intricate way. On the one hand, when β is close to 0, the algorithm is slow to coincide with the true EKF as T(ε, δ) = e O(1)/β. On the other hand, the larger β, the larger our control on λmax(Pt) and thus we get T(ε, δ) = e O(1)/(1 2β)3/2. Practical considerations show that the truncation is artiﬁcial and can even deteriorate the performence of the EKF, see Section 6. Thus Bercu et al. (2020) suggest to choose β = 0.49. The dependence on δ is even more complex. The third constraint on T(ε, δ) is O(δ 1) which should not be sharp. To improve this lousy dependence in the bound, one needs a better control of Pt. It would follow from a speciﬁc analysis of the O(ln δ 1) ﬁrst recursions in order to initialize the control on Pt. However the objective of Corollary 13 was to prove Proposition 9 and not to get an optimal value of T(ε, δ). A reﬁnement of our convergence analysis following from a tighter control on Pt of the EKF than the one provided by Tropp (2012) is a very important and challenging open question.

de Vilmarest and Wintenberger

5. Quadratic Setting

We obtain a global result for the quadratic loss where Algorithm 1 becomes the standard Kalman ﬁlter (recall that we take σ2 = 1, that is ℓ(y, ˆy) = (y ˆy)2/2 and a = 1, b (ˆθ t Xt) = ˆθ t Xt, αt = 1). The parallel with the ridge forecaster was evoked by Diderrich (1985), and it is crucial that the static Kalman ﬁlter is the ridge regression estimator for a decaying regularization parameter. It highlights that the static EKF may be seen as an approximation of the regularized ERM:

Proposition 14 In the quadratic setting, for any sequence (Xt, yt), starting from any ˆθ1 Rd and P1 0, the static EKF satisﬁes the optimisation problem

ˆθt = arg min θ Rd

s=1 (ys θ Xs)2 + 1

2(θ ˆθ1) P 1 1 (θ ˆθ1)

Note that the static Kalman ﬁlter provides automatically a right choice of the ridge regularization parameter. This equivalence yields a logarithmic regret bound for the Kalman ﬁlter (Theorem 11.7 of Cesa-Bianchi and Lugosi, 2006). It follows from Lemma 5 as the quadratic loss coincides with its second-order Taylor expansion. The leading term of the bound is d ln n maxt(yt ˆθ t Xt)2, thus yt ˆθ t Xt needs to be bounded. As the static Kalman ﬁlter estimator is exactly the ridge forecaster, we can also use the regularized empirical risk minimization properties to control T(ε, δ). In particular, we apply the ridge analysis of Hsu et al. (2012), and we check Assumption 5:

Proposition 15 Starting from any ˆθ1 Rd and P1 0, if Assumptions 1, 2 and 4 hold and if E[XX ] is invertible then Assumption 5 holds for T(ε, δ) deﬁned explicitly in Appendix D, Corollary 26.

Up to universal constants, deﬁning Λmin as the smallest eigenvalue of E[XX ], we get

p1 + D2 X Λmin (1 + D2 app)

ln δ 1 + σ2d

+ DX Λmin (Dapp + DX θ ) + ˆθ1 θ p1 + σ2 ln δ 1 !!

with h(x) = x ln x. We obtain a much less dramatic dependence in ε than in the logistic setting. However we could not avoid a Λ 1 min factor in the deﬁnition of T(ε, δ). It is not surprising since the convergence phase relies deeply on the behavior of Pt. As for the logistic setting, we split the cumulative risk into two sums. The sum of the ﬁrst terms is roughly bounded by a worst case analysis, and the sum of the last terms is estimated thanks to our local analysis (Theorem 2). However, as the loss and its gradient are not bounded we cannot obtain a similar almost sure upper-bound on the convergence phase. The sub-gaussian assumption provides a high probability bound instead.

Stochastic Online Optimization using Kalman Recursion

Theorem 16 Under the assumptions of Proposition 15, for any ε, δ > 0, it holds simultaneously

t=1 L(ˆθt) L(θ ) 15

2 d 8σ2 + D2 app + ε2D2 X ln 1 + nλmax(P1)D2 X d

+ 5λmax(P 1 1 )ε2

+ 115 σ2(4 + λmax(P1)D2 X 4 ) + D2 app + 2ε2D2 X

+ D2 X 5ε2 + 2( ˆθ1 θ 2 + 3λmax(P1)DXσ ln δ 1)2 T(ε, δ)

+ 2λmax(P1)2D4 X(3σ + Dapp)2

3 T(ε, δ)3, n 1 ,

with probability at least 1 6δ.

Note that the dependence of the cumulative excess risk of the convergence phase in terms of δ is O(log(δ 1)3).

6. Experiments

We experiment the static EKF for logistic regression. Precisely, we compare the following sequential algorithms that we all initialize at 0:

The static EKF and the truncated version (Algorithm 3). We take the default value P1 = Id along with the value β = 0.49 suggested by Bercu et al. (2020). Note that a threshold 10 10/t0.49 as recommended by Bercu et al. (2020) would coincide with the static EKF.

The ONS and the averaged version. The convex region of search is a ball centered in 0 and of radius Dθ = 1.1 θ , a setting where we have good knowledge of θ . We consider two choices of the exp-concavity constant on which the ONS crucially relies to deﬁne the gradient step size. First, we use the only available bound e DθDX. Second, in the settings where the step size is so small that the ONS doesn t move, we use the exp-concavity constant κ0 at θ . This yields a bigger step size, though the exp-concavity is not satisﬁed on the region of search.

Two Averaged Stochastic Gradient Descent as described by Bach (2014). First we test the choice of the gradient step size γ = 1/(2D2 X

N) denoted by ASGD and a second version with γ = θ /(DX

N) denoted by ASGD oracle. Note that these algorithms are with ﬁxed horizon, thus at each step t, we have to re-run the whole procedure.

6.1 Synthetic Data

We ﬁrst consider well-speciﬁed data generated by the process of Bercu et al. (2020). The explanatory variables X = (1, Z ) are of dimension d = 11 where Z is a random vector composed of 10 independent components uniformly generated in [0, 1], thus DX =

d. With this distribution for X we deﬁne three synthetic settings that we evaluate:

de Vilmarest and Wintenberger

Figure 1: Density of the Bernoulli parameter on 107 samples: on the left and on the middle density of (1 + e θ X) 1 for the two well-speciﬁed settings (left, the ordinate is in log scale), and on the right density of (1+e θ j Xt) 1 with j {1, 2} uniformly at random for the misspeciﬁed setting. On the right we observe the two modes E[(1 + e θ 1 Xt) 1] 0.28 and E[(1 + e θ 2 Xt) 1] 0.79.

Well-speciﬁed 1: we deﬁne θ = ( 9, 0, 3, 9, 4, 9, 15, 0, 7, 1, 0) , and at each iteration t, the variable yt { 1, 1} is a Bernoulli variable of parameter (1+e θ Xt) 1.

Well-speciﬁed 2: in the ﬁrst well-speciﬁed setting the Bernoulli parameter is mostly distributed around 0 and 1 (see Figure 1), thus we try a less discriminated setting with θ = 1

10( 9, 0, 3, 9, 4, 9, 15, 0, 7, 1, 0) .

Misspeciﬁed: In order to demonstrate the robustness of the EKF we test the algorithms in a misspeciﬁed setting switching randomly between two well-speciﬁed logistic processes. We deﬁne θ1 = 1

10( 9, 0, 3, 9, 4, 9, 15, 0, 7, 1, 0) and θ2 where we have only changed the ﬁrst coeﬃcient from 9/10 to 15/10. Then yt is a Bernoulli random variable whose parameter is either (1 + e θ 1 Xt) 1 or (1 + e θ 2 Xt) 1 uniformly at random. We checked that Assumption 2 is still satisﬁed.

We evaluate the diﬀerent algorithms with the mean squared error E[ ˆθt θ 2] that we approximate by its empirical version on 100 samples. We display the results in Figure 2.

6.2 Real Data Sets

To illustrate better the robustness to misspeciﬁcation, we run the same procedures on real data sets:

Forest cover-type (Blackard and Dean, 1999): the feature vector is of dimension d = 54, and as it is a multi-class task (7 classes) we focus on classifying 2 versus all others. There are n = 581012 instances and we randomly split in two halves for training and testing.

Adult income (Kohavi, 1996): the objective is to predict whether a person s annual income is smaller or bigger than 50K. There are 14 explanatory variables, and we

Stochastic Online Optimization using Kalman Recursion

Figure 2: Mean squared error in log-log scale for the three synthetic settings. For the ﬁrst well-speciﬁed setting (left) the ONS is applied using the exp-concavity constant κ0 1.7 10 15 instead of e Dθ

d to accelerate the algorithm, and both the ONS and its averaged version still don t move. In the other two (middle and right) we use e Dθ

d for the ONS. We observe that the EKF and the truncated version coincide in the two last settings.

obtain d = 98 once categorical variables are transformed into binary variables. We use the canonical split between training (32561 instances) and testing (16281 instances).

For each data set, we standardize X such that each feature ranges from 0 to 1. At each step we sample within the training set (with replacement). We evaluate through an empirical version of E[L(ˆθn)] L(θ ) estimated on 100 samples and where L is estimated on the test set, see Figure 3.

6.3 Summary

Our experiments show the superiority of the EKF for logistic regression compared to the ONS or to averaged SGD in all the settings we tested. We display in Table 2 a few indicators of the data sets. In particular, it is interesting that the static EKF works well even in a setting where the Hessian matrix H is singular. It appears clear that low exp-concavity constants are responsible of the poor performances of the ONS. One may tune the gradient step size at the cost of losing the expconcavity property and thus the regret guarantee of (Hazan et al., 2007) or its analogous for the cumulative risk (Mahdavi et al., 2015). Averaging is crucial for the ONS, whereas it is useless for the static EKF. Indeed we chose not to plot the averaged version of the EKF for clarity, but the EKF performs better than its averaged version. It is important to note that in the ﬁrst synthetic setting the truncation deteriorates the performance of the EKF, as well as in the adult income data set to a lesser extent, whereas the results are the same in the other settings. Bercu et al. (2020) argue that the truncation is artiﬁcially introduced for the convergence property, thus they use the threshold 10 10/t0.49 instead of 1/t0.49 and the truncated version almost coincides with the true EKF. We conﬁrm here that the truncation may be damaging if the threshold is set too high and

de Vilmarest and Wintenberger

Figure 3: Excess test risk for forest cover type (left) and adult income (right). As the ONS doesn t move when applied with the exp-concavity constant e DθDX we use instead the exp-concavity constant at θ : κ0 1.4 10 3 for forest cover type and κ0 5.5 10 6 for adult income. The EKF and the truncated version almost coincide for both data sets.

Setting d λmax(H )/µ tr(G H 1) R2/µ de DθDX dκ0 Synthetic well-speciﬁed 1 11 6.9 102 1.7 102 1.0 105 9.2 1037 6.4 1015

Synthetic well-speciﬁed 2 11 1.5 102 7.1 101 2.5 103 5.4 104 3.3 102

Synthetic misspeciﬁed 3 11 1.5 102 7.1 101 2.0 103 3.6 103 7.4 101

Forest cover type 54 9.2 1032 3.8 104

Adult income 98 2.5 107 7.2 105 5.3 108 1.8 1062 1.8 107

Table 2: For the diﬀerent experimental settings we display the dimension d and the condition number of the Hessian at θ (λmax(H ) and µ are the maximal and minimal eigenvalues of H ). We present the value of tr(G H 1) which is bounded either by R2/µ, or by de DθDX because e DθDX bounds the exp-concavity constant on the centered ball of radius Dθ. We add to the table dκ0 de DθDX where κ0 is the inverse of the exp-concavity constant of the loss at θ .

Stochastic Online Optimization using Kalman Recursion

we recommend to use the EKF in practice, or equivalently the truncated version with the low threshold suggested by Bercu et al. (2020).

7. Conclusion

We studied an eﬃcient way to tackle some unconstrained optimization problems, in which we get rid of the projection step of bounded algorithm such as the ONS. We presented a bayesian approach where we transformed the loss into a negative log-likelihood. We used the Kalman recursion to provide a parameter free approximation of the maximum-likelihood estimator. We demonstrated the optimality of the local phase for locally exp-concave losses which can be expressed as GLM log-likelihoods. We proved the ﬁniteness of the convergence phase in logistic and quadratic regressions. We illustrated our theoretical results with numerical experiments for logistic regression. It would be interesting to generalize our results to a larger class of optimization problems. Finally, this article aimed at strengthening the bridge between Kalman recursion and the optimization community. Therefore we made the i.i.d. assumption, standard in the stochastic optimization literature and we focus on the static EKF. It may lead the way to a risk analysis of the general EKF in non i.i.d. state-space models.

de Vilmarest and Wintenberger

The Appendix follows the structure of the article:

Appendix A presents the EKF for generalized linear models.

Appendix B contains the proofs of Section 3. Precisely, Lemma 3 is proved in Section B.1, the intermediate results of Sections 3.1 and 3.2 are proved in Sections B.2 and B.3, then Theorem 1 is proved in Section B.4 and Theorem 2 in Section B.5.

Appendix C contains the proofs of Section 4. We derive the global bound (Theorem 10) in Section C.1, then we obtain the concentration result on Pt in Section C.2, and ﬁnally we prove the convergence of the truncated algorithm in Section C.3.

Appendix D contains the proofs of Section 5. We prove Theorem 16 in Section D.1 and then in Section D.2 we prove the convergence of the algorithm, and we deﬁne an explicit value of T(ε, δ) satisfying Assumption 5.

Appendix A. Derivation of the Static EKF for Generalized Linear Models

As in Section 10.2 of Durbin and Koopman (2012) we consider the following state-space model:

yt = Zt(θt) + εt ,

θt+1 = Tt(θt) + ηt .

where εt and ηt are independent with mean zero and variances ht(θt), Qt(θt). The statespace version of equation (2) is

p(yt | Xt) = h(yt) exp ytθ t Xt b(θ t Xt) a

The preceding equation matches the space equation form with Zt(θt) = b (θ t Xt) and ht(θt) = ab (θ t Xt). Thus we can write the EKF as follows (see Equation 10.4 of Durbin and Koopman, 2012): denoting by Tt the derivative of Tt,

vt = yt b (ˆθ t Xt) , Ft = X t Pt Xtb (ˆθ t Xt)2 + ab (ˆθ t Xt) , ˆθt|t = ˆθt + Pt Xtb (ˆθ t Xt)F 1 t vt , Pt|t = Pt Pt Xt F 1 t X t Ptb (ˆθ t Xt)2 ,

ˆθt+1 = Tt(ˆθt|t) , Pt+1 = Tt Pt|t Tt + Qt(ˆθt|t) .

We focus on the static setting where the state equation becomes θt+1 = θt, thus we have ˆθt+1 = ˆθt|t and Pt+1 = Pt|t. We rewrite the update on Pt as follows:

Pt+1 = Pt Pt Xt X t Ptb (ˆθ t Xt)/a

X t Pt Xtb (ˆθ t Xt)/a + 1 .

Moreover we have Pt+1Xt = Pt Xt F 1 t ab (ˆθ t Xt) thus we can rewrite the update on ˆθt as follows:

ˆθt+1 = ˆθt + Pt+1Xt(yt b (ˆθ t Xt) a .

This yields Algorithm 1.

Stochastic Online Optimization using Kalman Recursion

Appendix B. Proofs of Section 3

B.1 Proof of Lemma 3

We prove the following Lemma inspired by the stopping time technique of Freedman (1975) from which we derive Lemma 3. We give a general form useful in several proofs.

Lemma 17 Let (Fn) be a ﬁltration, and we consider a sequence of events (An) that is adapted to (Fn). Let (Vn) be a sequence of random variables adapted to (Fn) satisfying V0 = 1, Vn 0 almost surely for any n, and

E[Vn | Fn 1, An 1] Vn 1, n 1.

Then for any δ > 0, it holds

n=1 Vn > δ 1 [

An important particular case is when (Vn) is a super-martingale adapted to the ﬁltration (Fn) satisfying V0 = 1 and Vn 0 almost surely: then we have simultaneously Vn δ 1

for n 1 with probability larger than 1 δ. Proof We deﬁne

Vn > δ 1 An 1 .

As (Ek) is increasing, we have, for any k 1,

n=1 P En En 1

An 1 En 1 +

n=1 P Vn > δ 1 En 1 An 1 .

First, we have

An 1 En 1 P

Second, we apply Markov s inequality:

n=1 P Vn > δ 1 En 1 An 1

δ 1 1En En 1 An 1

n=1 E h Vn(1En 1 An 1 1En) i

E h Vn1En 1 An 1

i E Vn1En .

de Vilmarest and Wintenberger

The second line is obtained since En

En 1 An 1 . According to the tower property and the super-martingale assumption,

E h Vn1En 1 An 1

i = E h E[Vn | Fn 1, An 1]1En 1 An 1

E h E[Vn | Fn 1, An 1]1En 1

E h Vn 11En 1

Therefore, a telescopic argument along with V0 = 1 and Vk1Ek 0 yields

n=1 P Vn > δ 1 En 1 An 1 δ .

Finally, for any k 1, we obtain

and the desired result follows by letting k .

Proof of Lemma 3. Let λ > 0. For any n 1, we deﬁne

2 (( Nt)2 + E[( Nt)2 | Ft 1]) !

Lemma B.1 of Bercu and Touati (2008) states that (Vn) is a super-martingale adapted to the ﬁltration (Fk+n). Moreover V0 = 1 and for any n, it holds Vn 0 almost surely. Therefore we can apply Lemma 17.

B.2 Proof of Lemma 5

Proof of Lemma 5. We start from the update formula ˆθt+1 = ˆθt + Pt+1 (yt b (ˆθ t Xt))Xt a yielding

(ˆθt+1 θ ) P 1 t+1(ˆθt+1 θ ) = (ˆθt θ ) P 1 t+1(ˆθt θ ) + 2(yt b (ˆθ t Xt))X t a (ˆθt θ )

+ X t Pt+1Xt

yt b (ˆθ t Xt) a

Stochastic Online Optimization using Kalman Recursion

With a summation argument, re-arranging terms, we obtain:

(b (ˆθ t Xt) yt)X t a (ˆθt θ ) 1

2(ˆθt θ ) (P 1 t+1 P 1 t )(ˆθt θ )

t=1 X t Pt+1Xt

yt b (ˆθ t Xt) a

(ˆθt θ ) P 1 t (ˆθt θ ) (ˆθt+1 θ ) P 1 t+1(ˆθt+1 θ ) .

We bound the telescopic sum: as P 1 n+1 0, we have

(ˆθt θ ) P 1 t (ˆθt θ ) (ˆθt+1 θ ) P 1 t+1(ˆθt+1 θ )

(ˆθ1 θ ) P 1 1 (ˆθ1 θ ) ˆθ1 θ 2

The result follows from the identities

(b (ˆθ t Xt) yt)Xt

a = ℓ (yt, ˆθ t Xt)Xt , P 1 t+1 P 1 t = ℓ (yt, ˆθ t Xt)Xt X t .

B.3 Proofs of Section 3.2

Proof of Proposition 6. The ﬁrst-order condition satisﬁed by θ is

E (y b (θ X))X

yielding E [y X] = E[b (θ X)X]. Therefore

E (b (θ X) y)X

a(θ θ ) E h X(b (θ X) b (θ X)) i .

Considering the function f : λ (θ θ ) E Xb (θ X + λ(θ θ ) X)) , we know there exists λ [0, 1] such that f (λ) = f(1) f(0). This translates into

θ (θ θ ) = 1

a(θ θ ) E h Xb θ X + λ(θ θ) X (θ θ ) X i .

Then we use Assumption 3:

b θ X + λ(θ θ) X

b (θ X) = ℓ yt, θ X + λ(θ θ) X

ℓ (yt, θ X) ρ θ θ ,

de Vilmarest and Wintenberger

θ (θ θ ) ρ θ θ (θ θ ) E h ℓ (y, θ X)XX i (θ θ )

= ρ θ θ (θ θ ) 2L

Proof of Proposition 7. We ﬁrst recall that L(θ) L(θ ) L

θ (θ θ ), then Proposition

θ (θ θ ) c(θ θ ) 2L

θ(θ θ ) (1 c ρ θ θ ) L

and the result follows.

Proof of Lemma 8. We ﬁrst develop ( Mt)2:

( Mt)2 = (E [ t | Ft 1] t) (ˆθt θ ) 2

= (ˆθt θ ) E[ t | Ft 1]E[ t | Ft 1] + t t

t E[ t | Ft 1] E[ t | Ft 1] t (ˆθt θ )

2(ˆθt θ ) E[ t | Ft 1]E[ t | Ft 1] + t t (ˆθt θ )

2(ˆθt θ ) E[ t t | Ft 1] + t t (ˆθt θ ) .

The third line holds because if U, V Rd, it holds UV V U UU + V V . The last one comes from E h ( t E[ t | Ft 1])( t E[ t | Ft 1]) | Ft 1 i 0. Also, we have the relation

E[( Mt)2 | Ft 1] (ˆθt θ ) E[ t t | Ft 1](ˆθt θ ) .

( Mt)2 + E[( Mt)2 | Ft 1] (ˆθt θ ) 3E[ t t | Ft 1] + 2 t t (ˆθt θ ) ,

and the result follows from Lemma 3.

We derive the following Lemma in order to control the right-hand side of Lemma 5, in both settings.

Lemma 18 Assume the second point of Assumption 3 holds. For any k, n 1, if ˆθt θ 2 ε for any k < t k + n then we have

t=k+1 Tr Pt+1(P 1 t+1 P 1 t ) d ln 1 + nhελmax(Pk+1)D2 X d

Stochastic Online Optimization using Kalman Recursion

Proof We apply Lemma 11.11 of Cesa-Bianchi and Lugosi (2006):

t=k+1 Tr Pt+1(P 1 t+1 P 1 t ) =

1 det(P 1 t ) det(P 1 t+1)

det(P 1 t+1)

det(P 1 t )

det(P 1 k+n+1)

det(P 1 k+1)

t=k+1 ℓ (yt, ˆθ t Xt)(P 1/2 k+1Xt)(P 1/2 k+1Xt) !

i=1 ln(1 + λi) ,

where λ1, ..., λd are the eigenvalues of k+n P

t=k+1 ℓ (yt, ˆθ t Xt)(P 1/2 k+1Xt)(P 1/2 k+1Xt) . Therefore we

t=k+1 Tr Pt+1(P 1 t+1 P 1 t ) d ln

dnhελmax(Pk+1)D2 X

B.4 Bounded Setting (Assumption 3)

Proof of Theorem 1. Let δ > 0. On the one hand, we sum Lemma 5 and 8. We obtain, for any λ > 0,

E[ t | Ft 1] (ˆθt θ ) 1

λ(ˆθt θ ) t t + 3

2E t t | Ft 1 (ˆθt θ )

t=T(ε,δ)+1 X t Pt+1Xtℓ (yt, ˆθ t Xt)2 + ˆθ1 θ 2

λmin(PT(ε,δ)+1) + ln δ 1

λ , n 1 , (3)

with probability at least 1 δ, where we deﬁne Qt = (ˆθt θ ) ℓ (yt, ˆθ t Xt)Xt X t (ˆθt θ ) for any t.

de Vilmarest and Wintenberger

On the other hand, thanks to Assumption 3, we can apply Proposition 7 with c = 0.75 to obtain, for any t 1,

= L(ˆθt) L(θ ) ρε ρε 0.75

ˆθt (ˆθt θ ) 0.75(ˆθt θ ) 2L

ˆθt (ˆθt θ ) ,

= L(ˆθt) L(θ ) 5 E[ t | Ft 1] (ˆθt θ ) 0.75E[Qt | Ft 1] , (4)

because ρε > 0.95. In order to bridge the gap between Equations (3) and (4), we need to control the quadratic terms of Equation (3) with E[Qt | Ft 1]. First, for any t, if ˆθt θ ε, we have Qt [0, hεε2D2 X], and we apply Lemma A.3 of Cesa-Bianchi and Lugosi (2006) to the random variable 1 hεε2D2 X Qt [0, 1]: for any s > 0,

E exp s hεε2D2 X Qt es 1

hεε2D2 X E [Qt | Ft 1] | Ft 1, ˆθt θ ε 1 .

We ﬁx s = 0.1 and we deﬁne

0.1 hεε2D2 X Qt (e0.1 1)E 1 hεε2D2 X Qt | Ft 1

The sequence (Vn) is adapted to (FT(ε,δ)+n), almost surely we have V0 = 1 and Vn 0. Finally,

E h Vn | FT(ε,δ)+n 1, ˆθT(ε,δ)+n θ ε i Vn 1 ,

and ( ˆθT(ε,δ)+n θ ε) belongs to FT(ε,δ)+n 1. We apply Lemma 17:

n=1 Vn > δ 1 [

n=1 ( ˆθT(ε,δ)+n θ > ε) !

n=1 ( ˆθT(ε,δ)+n θ > ε)

We deﬁne Aε k = T

n=k+1 ( ˆθn θ ε) for any k. The last inequality is equivalent to

t=T(ε,δ)+1 Qt > 10(e0.1 1)

t=T(ε,δ)+1 E [Qt | Ft 1] + 10hεε2D2 X ln δ 1 Aε T(ε,δ)

We then bound the two quadratic terms coming from Lemma 8: using Assumption 3 we have the implications

ˆθt θ ε = (ˆθt θ ) t t (ˆθt θ ) κεQt ,

ˆθt θ ε = (ˆθt θ ) E t t | Ft 1 (ˆθt θ ) κεE [Qt | Ft 1] .

Stochastic Online Optimization using Kalman Recursion

Therefore, we get from (5)

2Qt + λ(ˆθt θ ) t t (ˆθt θ )

2λ(ˆθt θ ) E t t | Ft 1 (ˆθt θ ) >

10(e0.1 1)(1

2 + λκε) + 3

2λκε T(ε,δ)+n X

t=T(ε,δ)+1 E [Qt | Ft 1]

2 + λκε)hεε2D2 X ln δ 1 Aε T(ε,δ)

We set λ = 0.75 5(e0.1 1) (10(e0.1 1)+ 3

2 )κε , so that

10(e0.1 1)(1

2 + λκε) + 3

2λκε = 0.75 ,

1 2 + λκε = 1

2 + 0.75 5(e0.1 1)

10(e0.1 1) + 3

2 0.59 0.6 ,

and consequently

E[ t | Ft 1] (ˆθt θ ) 0.75E[Qt | Ft 1] > 6hεε2D2 X ln δ 1

E[ t | Ft 1] (ˆθt θ ) 1

λ(ˆθt θ ) t t + 3

2E t t | Ft 1 (ˆθt θ ) !

We plug Equation (4) in the last inequality:

t=T(ε,δ)+1 (L(ˆθt) L(θ )) > 30hεε2D2 X ln δ 1

E[ t | Ft 1] (ˆθt θ ) 1

λ(ˆθt θ ) t t + 3

2E t t | Ft 1 (ˆθt θ ) !

de Vilmarest and Wintenberger

We then use Equation (3) with 1

λ = (10(e0.1 1)+ 3

2 )κε 0.75 5(e0.1 1) 11.4κε 12κε. It yields

t=T(ε,δ)+1 (L(ˆθt) L(θ )) > 5

t=T(ε,δ)+1 X t Pt+1Xtℓ (yt, ˆθ t Xt)2

+ 5 ˆθ1 θ 2

λmin(PT(ε,δ)+1) + 30(2κε + hεε2D2 X) ln δ 1 !

Thanks to Assumption 3, we have

X t Pt+1Xtℓ (yt, ˆθ t Xt)2 κε Tr Pt+1(P 1 t+1 P 1 t ) , t > T(ε, δ) ,

therefore we apply Lemma 18: for any n, it holds

t=T(ε,δ)+1 X t Pt+1Xtℓ (yt, ˆθ t Xt)2 dκε ln

1 + nhελmax(PT(ε,δ)+1)D2 X d

As PT(ε,δ)+1 P1, we obtain

t=T(ε,δ)+1 (L(ˆθt) L(θ )) > 5

2dκε ln 1 + nhελmax(P1)D2 X d

+ 5 ˆθ1 θ 2

λmin(PT(ε,δ)+1) + 30(2κε + hεε2D2 X) ln δ 1 !

To conclude, we use Assumption 5.

B.5 Quadratic Setting (Assumption 4)

We recall two deﬁnitions introduced in the previous subsection:

n=k+1 ( ˆθn θ ε), k 1 ,

Qt = (ˆθt θ ) Xt X t (ˆθt θ ), t 1 .

The sub-gaussian hypothesis requires a diﬀerent treatment of several steps in the proof. In the following proofs, we use a consequence of the ﬁrst points of Assumption 4. We apply Lemma 1.4 of Rigollet and H utter (2015): for any X Rd,

E[(y E[y | X])2i | X] 2i(2σ2)iΓ(i) = 2(2σ2)ii!, i N . (6)

First, we control the quadratic terms in t = (yt ˆθ t Xt)Xt in the following lemma.

Stochastic Online Optimization using Kalman Recursion

Lemma 19 1. For any k N and δ > 0, we have

t=k+1 (ˆθt θ ) t t (ˆθt θ )

> 3 8σ2 + D2 app + ε2D2 X k+n X

t=k+1 Qt + 12ε2D2 Xσ2 ln δ 1 !

2. For any t, it holds almost surely

(ˆθt θ ) E[ t t | Ft 1](ˆθt θ ) 3 σ2 + D2 app + ˆθt θ 2D2 X E[Qt | Ft 1] .

1. We recall that for any a, b, c, we have (a + b + c)2 3(a2 + b2 + c2). Thus

(ˆθt θ ) t t (ˆθt θ ) = Qt(yt ˆθ t Xt)2

3Qt (yt E[yt | Xt])2 + (E[yt | Xt] θ Xt)2 + ((θ ˆθt) Xt)2

3Qt (yt E[yt | Xt])2 + D2 app + ˆθt θ 2D2 X . (7)

To obtain the last inequality, we use the second point of Assumption 4 to bound the middle term. Then we use Taylor series for the exponential, and we apply Equation (6). For any t and any µ satisfying 0 < µ 1 4Qtσ2 , we have

E exp µQt(yt E[yt | Xt])2 | Ft 1, Xt = 1 + X

µi Qi t E[(yt E yt | Xt])2i | Xt

µi Qi ti!(2σ2)i

1 + 8µQtσ2, 2µQtσ2 1

2 exp 8µQtσ2 .

Therefore, for any t,

E exp 1 4ε2D2 Xσ2 Qt (yt E[yt | Xt])2 8σ2 | Ft 1, Xt, ˆθt θ ε 1 .

We deﬁne the random variable

1 4ε2D2 Xσ2

t=k+1 Qt (yt E[yt | Xt])2 8σ2 !

de Vilmarest and Wintenberger

(Vn)n is adapted to the ﬁltration (σ(X1, y1, ..., Xk+n, yk+n, Xk+n+1)n, moreover V0 = 1 and Vn 0 almost surely, and

E h Vn | X1, y1, ..., Xk+n 1, yk+n 1, Xk+n, ˆθk+n θ ε i Vn 1 .

Therefore we apply Lemma 17: for any δ > 0,

n=1 (Vn > δ 1) Aε k

which is equivalent to

t=k+1 Qt(yt E[yt | Xt])2 > 8σ2 k+n X

t=k+1 Qt + 4ε2D2 Xσ2 ln δ 1 !

Substituting in Equation (7), we obtain the desired result.

2. We apply the same decomposition as for Equation (7): for any t,

(ˆθt θ ) E[ t t | Ft 1](ˆθt θ )

3(ˆθt θ ) E Xt X t (yt E[yt | Xt])2 + D2 app + θ ˆθt 2D2 X | Ft 1

Assumption 4 implies that for any Xt, E[(yt E[yt | Xt])2 | Xt] σ2. Thus, the tower property yields

(ˆθt θ ) E[ t t | Ft 1](ˆθt θ )

3 σ2 + D2 app + ˆθt θ 2D2 X (ˆθt θ ) E[Xt X t | Ft 1](ˆθt θ ) .

Second, we bound the right-hand side of Lemma 5, that is the objective of the following lemma.

Lemma 20 Let k N. For any δ > 0, we have

t=k+1 X t Pt+1Xt(yt ˆθ t Xt)2 > 12λmax(P1)D2 Xσ2 ln δ 1

+ 3 8σ2 + D2 app + ε2D2 X d ln 1 + nλmax(Pk+1)D2 X d

Proof We apply a similar analysis as in the proof of Lemma 19 in order to use the subgaussian assumption, and then we apply the telescopic argument as in the bounded setting.

Stochastic Online Optimization using Kalman Recursion

We decompose yt ˆθ t Xt:

X t Pt+1Xt(yt ˆθ t Xt)2

3X t Pt+1Xt (yt E[yt | Xt])2 + (E[yt | Xt] b (θ Xt))2 + ((θ ˆθt) Xt)2

3X t Pt+1Xt (yt E[yt | Xt])2 + D2 app + ˆθt θ 2D2 X . (8)

To control (yt E[yt | Xt])2X t Pt+1Xt, we use its positivity along with Equation (6). Precisely, for any t and any µ > 0 satisfying 0 < µ 1 4X t Pt+1Xtσ2 , we have

E h exp µ(yt E[yt | Xt])2X t Pt+1Xt | Ft 1, Xt i

µi(X t Pt+1Xt)i E (yt E[yt | Xt])2i | Xt

µi(X t Pt+1Xt)ii!(2σ2)i

2µX t Pt+1Xtσ2 i

1 + 8µX t Pt+1Xtσ2, 0 < 2µX t Pt+1Xtσ2 1

exp 8µX t Pt+1Xtσ2 .

We apply the previous bound with a uniform µ = 1 4λmax(P1)D2 Xσ2 . As λmax(Pt+1) λmax(P1)

for any t, we get µ 1 4X t Pt+1Xtσ2 . Thus, we deﬁne

1 4λmax(P1)D2 Xσ2

(yt E[yt | Xt])2 8σ2 X t Pt+1Xt

(Vn) is a super-martingale adapted to the ﬁltration (σ(X1, y1, ..., Xk+n 1, yk+n 1, Xk+n))n satisfying almost surely V0 = 1, Vn 0, thus we apply Lemma 17:

n=1 (Vn > δ 1)

or equivalently

t=k+1 X t Pt+1Xt(yt E[yt | Xt])2

> 8σ2 k+n X

t=k+1 X t Pt+1Xt + 4λmax(P1)D2 Xσ2 ln δ 1 !!

de Vilmarest and Wintenberger

Combining it with Equation (8), we get

t=k+1 X t Pt+1Xt(yt ˆθ t Xt)2 > 3 8σ2 + D2 app + ε2D2 X k+n X

t=k+1 X t Pt+1Xt

+ 12λmax(P1)D2 Xσ2 ln δ 1 !

Then we apply Lemma 18: the second point of Assumption 3 holds with hε = 1, thus

t=k+1 Tr Pt+1(P 1 t+1 P 1 t ) d ln 1 + nλmax(Pk+1)D2 X d

We conclude with X t Pt+1Xt = Tr(Pt+1(P 1 t+1 P 1 t )).

We sum up our ﬁndings and we prove the result for the quadratic loss. The structure of the proof is the same as the one of Theorem 1. Proof of Theorem 2. On the one hand, we sum Lemma 5 and Lemma 8: for any λ, δ > 0

E[ t | Ft 1] (ˆθt θ ) 1

2Qt λ(ˆθt θ ) t t +3

2E t t | Ft 1 (ˆθt θ )

t=T(ε,δ)+1 X t Pt+1Xt(yt ˆθ t Xt)2 + ˆθT(ε,δ)+1 θ 2

λmin(PT(ε,δ)+1) + ln δ 1

λ , n 1 , (9)

with probability at least 1 δ. On the other hand, we have

t=T(ε,δ)+1 (L(ˆθt) L(θ )) 1 1 0.8

E[ t | Ft 1] (ˆθt θ ) 0.8E[Qt | Ft 1] .

We aim to relate Equations (9) and (10) as in the proof of Theorem 1. To that end, we apply Lemma 19:

2Qt + λ(ˆθt θ ) t t + 3

2E t t | Ft 1 (ˆθt θ )

2 + 3λ(8σ2 + D2 app + ε2D2 X) T(ε,δ)+n X

t=T(ε,δ)+1 Qt

2λ σ2 + D2 app + ε2D2 X T(ε,δ)+n X

t=T(ε,δ)+1 E [Qt | Ft 1] + 12λε2D2 Xσ2 ln δ 1 !

Stochastic Online Optimization using Kalman Recursion

As in the proof of Theorem 1 we apply Lemma A.3 of (Cesa-Bianchi and Lugosi, 2006) and Lemma 17: for any δ > 0,

t=T(ε,δ)+1 Qt > 10(e0.1 1)

t=T(ε,δ)+1 E[Qt | Ft 1] + 10ε2D2 X ln δ 1

We combine the last two inequalities:

2Qt + λ(ˆθt θ ) t t + 3

2E t t | Ft 1 (ˆθt θ )

> 10(e0.1 1) 1

2 + 3λ(8σ2 + D2 app + ε2D2 X) + 9

2λ(σ2 + D2 app + ε2D2 X)

t=T(ε,δ)+1 E [Qt | Ft 1]

+ 10ε2D2 X 1

2 + 3λ(8σ2 + D2 app + ε2D2 X) + 12λε2D2 Xσ2 ln δ 1 !

λ = 0.8 5(e0.1 1) 30(e0.1 1)(8σ2 + D2 app + ε2D2 X) + 9

2(σ2 + D2 app + ε2D2 X) 1

in order to obtain

10(e0.1 1) 1

2 + 3λ(8σ2 + D2 app + ε2D2 X) + 9

2λ(σ2 + D2 app + ε2D2 X) = 0.8 ,

1 109σ2 + 28D2app + 28ε2D2 X < λ < 1 108σ2 + 27D2app + 27ε2D2 X ,

2 + 3λ(8σ2 + D2 app + ε2D2 X) + 12λD2 Xε2σ2 8ε2D2 X 1 λ 28(4σ2 + D2 app + ε2D2 X) .

de Vilmarest and Wintenberger

Combining Equations (9), (10) and (11), we obtain

t=T(ε,δ)+1 (L(ˆθt) L(θ ))

t=T(ε,δ)+1 X t Pt+1Xt(yt ˆθ t Xt)2 + ε2

λmin(PT(ε,δ)+1)

+ 28(4σ2 + D2 approx + ε2D2 X) ln δ 1 + 8ε2D2 X ln δ 1 !

Finally, we apply Lemma 20 with PT(ε,δ)+1 P1 and we use Assumption 5: it holds simultaneously

t=T(ε,δ)+1 L(ˆθt) L(θ ) 5 3

2 8σ2 + D2 app + ε2D2 X d ln 1 + nλmax(P1)D2 X d

+ λmax P 1 T(ε,δ)+1 ε2 + 28(4σ2 + D2 approx + ε2D2 X) ln δ 1

+ 8ε2D2 X ln δ 1 + 6λmax(P1)D2 Xσ2 ln δ 1 , n 1 ,

with probability at least 1 5δ. To conclude, we write

28(4σ2 + D2 approx + ε2D2 X) + 8ε2D2 X + 6λmax(P1)D2 Xσ2

28 σ2(4 + λmax(P1)D2 X 4 ) + D2 app + 2ε2D2 X

Appendix C. Proofs of Section 4

C.1 Proof of Theorem 10

Proof of Theorem 10. We check Assumption 3 with κε = e DX( θ +ε), hε = 1 4 and ρε = e εDX > 0.95. We can thus apply Theorem 1 with

λmax(P 1 T(ε,δ)+1) λmax(P 1 1 ) + 1

2 < 3e DX θ , 30 2κε + ε2D2 X 4

< 64e DX θ , 5ε2D2 X 1/75 .

We then control the ﬁrst terms. To that end, we use a rough bound at any time t 1:

L(ˆθt) L(θ ) E y X

1 + eyˆθ t X | ˆθt

DX( ˆθ1 θ + (t 1)λmax(P1)DX) ,

Stochastic Online Optimization using Kalman Recursion

because for any s 1, we have Ps P1 and therefore ˆθs+1 ˆθs λmax(P1)DX. Summing from 1 to T( 1 20DX , δ) yields the result.

C.2 Concentration of Pt

We prove a concentration result based on Tropp (2012), which will be used on the inverse of Pt.

Lemma 21 If Assumption 1 is satisﬁed, then for any 0 β < 1 and t 41/(1 β), it holds

d exp t1 β Λ2 min 10D4 X

Proof We wish to center the matrices Xs X s by subtracting their (common) expected value. We use that if A and B are symmetric, λmin(A B) λmin(A) λmin(B). Indeed, denoting by v any eigenvector of A associated with its smallest eigenvalue,

λmin(A B) = min x x (A B)x

= λmin(A) v Bv

λmin(A) min x x Bx

= λmin(A) λmin(B) .

s=1 E Xs X s sβ

s=1 E Xs X s sβ

Λmin t1 β 1

de Vilmarest and Wintenberger

Therefore, we obtain

< Λmin(t1 β 2)

Xs X s sβ E Xs X s sβ

< Λmin(t1 β 2)

2(1 β) Λmin t1 β 1

E Xs X s sβ

We check the assumptions of Theorem 1.4 of Tropp (2012):

Obviously E h Xs X s sβ i Xs X s sβ is centered,

λmax E h Xs X s sβ i Xs X s sβ λmax E h Xs X s sβ i D2 X almost surely.

As 0 E E h Xs X s sβ i Xs X s sβ 2 E Xs X s sβ 2 D4 X s2β I D4 X sβ I, we get

" E Xs X s sβ

I D4 X t1 β

Therefore we can apply Theorem 1.4 of Tropp (2012):

E Xs X s sβ

Λ2 mint2(1 β)/(8(1 β)2) D4 Xt1 β/(1 β) + D2 XΛmint1 β/(6(1 β))

= d exp t1 β Λ2 min 8D4 X

1/(1 β) + Λmin/(6D2 X(1 β))

t1 β Λ2 min 8D4 X

1 β + Λmin(1 β)

Using Λmin/D2 X 1 and β 0, we obtain 8(1 β + Λmin(1 β)

6D2 X ) 8(1 + 1/6) = 28/3 10, therefore

< Λmin(t1 β 2)

d exp t1 β Λ2 min 10D4 X

The result follows from 1

2t1 β 2 > 0 for t 41/(1 β).

We can now do a union bound to obtain Proposition 11.

Stochastic Online Optimization using Kalman Recursion

Proof of Proposition 11. We ﬁrst move our problem to the setting of Lemma 21:

λmax(Pt) = λmin

s=1 Xs X s αs

because αs 1/sβ. Therefore, for t 8 41/(1 β),

P λmax(Pt) > 4 Λmint1 β

> 4 Λmint1 β

d exp t1 β Λ2 min 10D4 X

where we applied Lemma 21 to obtain the last line. We take a union bound to obtain, for any k 7,

P t > k, λmax(Pt) > 4 Λmint1 β

t>k d exp t1 β Λ2 min 10D4 X

t>k exp t1 β Λ2 min 10D4 X

m 1 exp m Λ2 min 10D4 X

t>k 1 t1 β =m

t>k 1 t =m: for any m

t1 β = m = m1/(1 β) t < (m + 1)1/(1 β) ,

then using ex 1 + 2x for any 0 x 1, we have

(m + 1)1/(1 β) = m1/(1 β)(1 + 1/m)1/(1 β)

= m1/(1 β) exp(ln(1 + 1/m)/(1 β))

m1/(1 β) exp(1/(m(1 β)))

m1/(1 β)(1 + 2/(m(1 β))) ,

as long as m 2 1/(1 β). Therefore

(m + 1)1/(1 β) m1/(1 β) + 1 2m1/(1 β) 1/(1 β) + 1 4m + 1 4(m + 1) ,

de Vilmarest and Wintenberger

and that is true for m = 1 too. Hence

P t > k, λmax(Pt) > 4 Λmint1 β

m k1 β (m + 1) exp m Λ2 min 10D4 X

= 4d exp Λ2 min 10D4 X

1 exp Λ2 min 10D4 X

k1 β + 1 + exp Λ2 min 10D4 X

1 exp Λ2 min 10D4 X

4d exp Λ2 min 10D4 X

1 exp Λ2 min 10D4 X

1 exp Λ2 min 10D4 X

exp Λ2 min 10D4 X

where the second line is obtained deriving both sides of P

m k1 β rm+1 = r k1 β +1

respect to r. Also, as 1 e x xe x for any x R, we get

P t > k, λmax(Pt) > 4 Λmint1 β

4d10D4 X Λ2 min exp 2 Λ2 min 10D4 X

(k1 β + 10D4 X Λ2 min exp Λ2 min 10D4 X

) exp Λ2 min 10D4 X

Furthermore, as xe x e 1 for any x 0, we get for any k 7: k1 β + 10D4 X Λ2 min exp Λ2 min 10D4 X

exp k1 β Λ2 min 20D4 X

Λ2 min exp 10D4 X Λ2 min exp Λ2 min 10D4 X

Λ2 min 20D4 X

= 20D4 Xe 1

Λ2 min exp 1

2 exp Λ2 min 10D4 X

Combining the last two inequalities, we obtain

P t > k, λmax(Pt) > 4 Λmint1 β

d800D8 Xe 1

Λ4 min exp 2 Λ2 min 10D4 X + 1

2 exp Λ2 min 10D4 X

exp k1 β Λ2 min 20D4 X

d625D8 X Λ4 min exp k1 β Λ2 min 20D4 X

and the result follows. The last line comes from Λmin D2 X and consequently

800e 1 exp 2 Λ2 min 10D4 X + 1

2 exp Λ2 min 10D4 X

800e 1+0.2+0.5e0.1 624.7 625 .

The condition k 7 is not necessary because 20D4 X Λ2 min ln 625d D8 X Λ4 minδ

1/(1 β) 20 ln(625δ 1) ,

Stochastic Online Optimization using Kalman Recursion

and either δ 1 and the result is trivial, either δ < 1 and 20 ln(625δ 1) 128.

C.3 Convergence of the Truncated Algorithm

In order to prove Theorem 12, we state and prove an intermediate lemma.

Lemma 22 Let θ Rd.

1. For any η > 0, we have

L(θ) L(θ ) > η = L

where Dη = Λmin η

2DX(1+e DX ( θ +

8η/D2 X )) .

2. For any ε > 0, we have

θ θ > ε = L(θ) L(θ ) > Λmin 4(1 + e DX( θ +ε))ε2 .

Proof Both points derive from a second-order identity, turned in an upper-bound in the one case and in a lower-bound in the other. Using L

θ (θ ) = 0, there exists 0 λ 1 such that

L(θ) = L(θ ) + 1

2(θ θ ) E 1 (1 + e(λθ+(1 λ)θ ) X)(1 + e (λθ+(1 λ)θ ) X)XX (θ θ ) .

1. We ﬁrst have

L(θ) L(θ ) D2 X 8 θ θ 2 .

Assume L(θ) L(θ ) > η. Then θ θ q

8η/D2 X. Also, using the Taylor

expansion of θ around some θ0 Rd, we get

L(θ ) L(θ0)+ L

θ0 (θ θ0)+ 1 4(1 + e DX( θ + θ0 θ ))(θ0 θ ) E h XX i (θ0 θ ) ,

and that yields

θ0 (θ0 θ ) L(θ0) L(θ ) + Λmin 4(1 + e DX( θ + θ0 θ )) θ0 θ 2 .

Therefore, as L(θ0) L(θ ) 0, L

Λmin 4(1 + e DX( θ + θ0 θ )) θ0 θtrue .

de Vilmarest and Wintenberger

Finally, as L is convex of minimum θ , L

4(1 + e DX( θ +

2DX(1 + e DX( θ +

2. On the other hand we have

L(θ) L(θ ) + Λmin 4(1 + e DX( θ + θ θ )) θ θ 2 .

Thus, as L is convex of minimum θ , if θ θ > ε it holds

L(θ) L(θ ) > min θ0 θ =ε L(θ0) L(θ ) Λmin 4(1 + e DX( θ +ε))ε2 .

Proof of Theorem 12. We prove the convergence of (L(ˆθt))t to L(θ ) and then the convergence of (ˆθt)t to θ follows. The convergence of (L(ˆθt))t comes from the ﬁrst point of Lemma 22. The link between the two convergences is stated in the second point. To study the evolution of L(ˆθt) we ﬁrst apply a second-order Taylor expansion: for any t 1 there exists 0 αt 1 such that

L(ˆθt+1) = L(ˆθt) + L

ˆθt (ˆθt+1 ˆθt) + 1

2(ˆθt+1 ˆθt) 2L

ˆθt+αt(ˆθt+1 ˆθt)(ˆθt+1 ˆθt) . (12)

4E[XX ], therefore, using the update formula on ˆθ, the second-order term is bounded with

(ˆθt+1 ˆθt) 2L

ˆθt+αt(ˆθt+1 ˆθt)(ˆθt+1 ˆθt) 1

(1 + eytˆθ t Xt)2 X t P t+1 E[XX ]

4D4 Xλmax(Pt+1)2 1

4D4 Xλmax(Pt)2 .

The ﬁrst-order term is controlled using the deﬁnition of the algorithm:

ˆθt+1 ˆθt = Pt Pt Xt X t Pt 1 + X t Pt Xtαt αt

yt Xt 1 + eytˆθ t Xt ,

and as αt 1, αt Pt Xt X t Pt 1 + X t Pt Xtαt

yt Xt 1 + eytˆθ t Xt

D3 Xλmax(Pt)2 .

Stochastic Online Optimization using Kalman Recursion

θ DX. Substituting our ﬁndings in Equation (12), we obtain

L(ˆθt+1) L(ˆθt) + L

ˆθt Pt yt Xt 1 + eytˆθ t Xt + 2D4 Xλmax(Pt)2 . (13)

ˆθt Pt yt Xt 1 + eytˆθ t Xt E L

ˆθt Pt yt Xt 1 + eytˆθ t Xt | X1, y1, ..., Xt 1, yt 1

ˆθt Pt yt Xt 1 + eytˆθ t Xt + L

Hence we have

ˆθt Pt yt Xt 1 + eytˆθ t Xt Mt λmin(Pt) L

2 Mt 1 t D2 X

because Ps I s D2 X . Combining it with Equation (13) and summing consecutive terms, we obtain, for any k < t,

L(ˆθt) L(ˆθk)

Ms 1 s D2 X

2 + 2D4 Xλmax(Ps)2 !

We recall that there exists Cδ such that P(ACδ) 1 δ where

λmax(Pt) Cδ

On the previous inequality, we see that the right-hand side is the sum of a martingale and a term which is negative for s large enough, under the event ACδ. We are then interested in P((L(ˆθt) L(θ ) > η) | ACδ) for some η > 0. For 0 k t, we deﬁne Bk,t the event ( k < s < t, L(ˆθs) L(θ ) > η/2). Then we use the law of total probability:

P(L(ˆθt) L(θ ) > η | ACδ) (15)

P (L(ˆθt) L(θ ) > η) B0,t | ACδ

k=1 P (L(ˆθt) L(θ ) > η) L(ˆθk) L(θ ) η

2 Bk,t | ACδ (16)

P (L(ˆθt) L(θ ) > η) B0,t | ACδ

k=1 P L(ˆθt) L(ˆθk) > η

2 Bk,t | ACδ .

Lemma 22 yields

L(ˆθs) L(θ ) > η

de Vilmarest and Wintenberger

We combine the last equation, along with Equation (14) and the deﬁnition of ACδ to get, for any 1 k < t,

P (L(ˆθt) L(ˆθk) > η/2) Bk,t | ACδ P

s=k Ms > f(k, t) Bk,t | ACδ

s=k Ms > f(k, t) | ACδ

where f(k, t) = η

2 + D2 η D2 X

1 s 2D4 XC2 δ

1 s2(1 β) for any 1 k < t.

Similarly, we get

P (L(ˆθt) L(θ ) > η) B0,t | AC P

s=1 Ms > f0(t) | AC

with f0(t) = η (L(ˆθ1) L(θ )) + D2 η D2 X

1 s 2D4 XC2 δ

1 s2(1 β) for any t 1.

We have E[Ms | X1, y1, ..., Xs 1, ys 1] = 0, and almost surely |Ms| 2D2 Xλmax(Ps). We can therefore apply Azuma-Hoeﬀding inequality: for t, k such that f(k, t) > 0,

s=k Ms > f(k, t) | ACδ

f(k, t)2 (1 2β) max 1/2, (k 1)1 2β

because + P

1 s2(1 β) 1 (1 2β) max(1/2,(k 1)1 2β). Similarly, for t such that f0(t) > 0,

s=1 Ms > f0(t) | ACδ

exp f0(t)2 1 2β

We need to control f(k, t), f0(t). We see that for t large enough, when k is small

compared to t, f(k, t) is driven by D2 η D2 X ln(t) and when k t, f(k, t) is driven by η/2. The following Lemma formally states these approximations as lower-bounds. We prove it right after the end of this proof.

Lemma 23 For t max

16D6 X C2 δ D2η(1 2β) , 1 + 8D4 XC2 δ η(1 2β) 1 1 2β 2!

f(k, t) D2 η 4D2 X ln(t), 1 k <

Similarly, for t e

L(ˆθ1) L(θ )+ 4D4 X C2 δ 1 2β

f0(t) D2 η 2D2 X ln(t) .

Stochastic Online Optimization using Kalman Recursion

Then, deﬁning C1 = D4 η(1 2β) 256D8 XC2 δ and C2 = η2(1 2β)

128D4 XC2 δ , we ﬁnally get for t large enough:

P (L(ˆθt) L(θ ) > η) B0,t | ACδ exp 4C1 ln(t)2 ,

P (L(ˆθt) L(θ ) > η) (L(ˆθk) L(θ ) η

2) Bk,t | ACδ

( exp C1 ln(t)2 if 1 k <

t , exp C2(k 1)1 2β if

Substituting in Equation (16) yields:

P(L(ˆθt) L(θ ) > η | AC)

exp 4C1 ln(t)2 +

k=1 exp C1 ln(t)2 +

t exp C2(k 1)1 2β

t + 1) exp C1 ln(t)2 + t exp C2(

Finally, Point 2 of Lemma 22 allows to obtain the result: deﬁning η = Λminε2

4(1+e DX ( θ +ε)), we obtain

P( ˆθt θ > ε | ACδ) P(L(ˆθt) L(θ ) > η | ACδ)

t + 1) exp C1 ln(t)2 + t exp C2(

In order to obtain the constants involved in the Theorem, we write

Dη = Λmin q

4(1+e DX ( θ +ε))

2DX(1 + exp DX( θ + r

D2 X(1+e DX ( θ +ε))) ) Λmin 1 + e DX( θ +ε)

3/2 ε 4DX ,

C1 Λ6 min(1 2β)ε4

216D12 X C2 δ (1 + e DX( θ +ε))6 ,

C2 Λ2 min(1 2β)ε4

211D4 XC2 δ (1 + e DX( θ +ε))2 ,

and the conditions of Lemma 23 become

28D8 XC2 δ (1 + e DX( θ +ε))3

Λ3 min(1 2β)ε2

32D4 XC2 δ (1 + e DX( θ +ε)) (1 2β)Λminε2

32D4 X(1 + e DX( θ +ε))3

L(ˆθ1) L(θ ) + 4D4 XC2 δ 1 2β

de Vilmarest and Wintenberger

We would like to obtain a single condition on t, thus we write

32D4 XC2 δ (1 + e DX( θ +ε)) (1 2β)Λminε2

32D4 XC2 δ (1 + e DX( θ +ε)) (1 2β)Λminε2

1 + 32D4 XC2 δ (1 + e DX( θ +ε)) (1 2β)Λminε2

32D4 XC2 δ (1 + e DX( θ +ε)) (1 2β)Λminε2

28D8 XC2 δ (1 + e DX( θ +ε))3

Λ3 min(1 2β)3/2ε2

The third line is obtained with the inequality ln(1 + x) x for any x > 0. Obviously, as 0 < 1 2β < 1, the ﬁrst threshold on t is bounded by:

28D8 XC2 δ (1 + e DX( θ +ε))3

Λ3 min(1 2β)ε2

28D8 XC2 δ (1 + e DX( θ +ε))3

Λ3 min(1 2β)3/2ε2

To handle the third one, we use D2 XCδ 4D2 X Λmin 4 and as ˆθ1 = 0 we obtain L(ˆθ1) L(θ )

ln 2 4D4 XC2 δ 1 2β , hence

32D4 X(1 + e DX( θ +ε))3

L(ˆθ1) L(θ ) + 4D4 XC2 δ 1 2β

28D8 XC2 δ (1 + e DX( θ +ε))3

Λ3 min(1 2β)3/2ε2

Proof of Lemma 23. We recall that for any k 1,

1 s ln t ln k ,

1 s2(1 β) 1 1 2β 1 max(1/2, (k 1)1 2β) .

2 + D2 η D2 X (ln t ln k) 2D4 XC2 δ 1 2β 1 max(1/2, (k 1)1 2β) ,

f0(t) η (L(ˆθ1) L(θ ) + D2 η D2 X ln t 4D4 XC2 δ 1 2β .

Stochastic Online Optimization using Kalman Recursion

For any 1 k <

t, it holds ln k 1

2 ln t, and we have

f(k, t) D2 η 2D2 X ln(t) 4D4 XC2 δ 1 2β .

16D6 X C2 δ D2η(1 2β) yields f(k, t) D2 η 4D2 X ln(t).

For t 2 and any k

2 2D4 XC2 δ (1 2β)(k 1)1 2β η

2 2D4 XC2 δ (1 2β)(

Then if t 1 + 8D4 XC2 δ η(1 2β) 1 1 2β 2 , we get f(k, t) η

Last point comes from f0(t) D2 η D2 X ln t (L(ˆθ1) L(θ ) 4D4 XC2 δ 1 2β .

Proof of Corollary 13. We apply Theorem 12: for any t exp 28D8 XC2 δ/2(1+e DX ( θ +ε))3

Λ3 min(1 2β)3/2ε2

P( ˆθt θ > ε | ACδ/2) (

t + 1) exp C1 ln(t)2 + t exp C2(

C1 = Λ6 min(1 2β)ε4

216D12 X C2 δ/2(1 + e DX( θ +ε))6 , C2 = Λ2 min(1 2β)ε4

211D4 XC2 δ/2(1 + e DX( θ +ε))2 .

We use a union bound: for any T exp 28D8 XC2 δ/2(1+e DX ( θ +ε))3

Λ3 min(1 2β)3/2ε2

t=T+1 ( ˆθt θ > ε) | ACδ/2

t + 1) exp C1 ln(t)2

t>T t exp C2(

3 2C1 , we have

t + 1) exp C1 ln(t)2 X

de Vilmarest and Wintenberger

For t 4, 1 1/

t 1/2, then for t 12 C2(1 2β) 4/(1 2β) ,

t 1)1 2β exp 3 ln(t) C2

2 t(1 2β)/2

exp 12 1 2β ln 12 C2(1 2β)

12 C2(1 2β)

because for any x > 0, we have ln x x/2.

Thus for T 12 C2(1 2β) 4/(1 2β)

t>T t exp C2(

t 1)1 2β 1/T .

Finally, for T satisfying the previous conditions as well as T 6δ 1, we obtain

t=T+1 ( ˆθt θ > ε) | ACδ/2

We now compare the constants involved. As long as εDX 1, we have

28D8 XC2 δ/2(1 + e DX( θ +ε))3

Λ3 min(1 2β)3/2ε2

3 215D12 X C2 δ/2(1 + e DX( θ +ε))6

Λ6 min(1 2β)3/2ε4

Furthermore, as 1 2β 1, we have

3 215D12 X C2 δ/2(1 + e DX( θ +ε))6

Λ6 min(1 2β)ε4

3 215D12 X C2 δ/2(1 + e DX( θ +ε))6

Λ6 min(1 2β)3/2ε4

Stochastic Online Optimization using Kalman Recursion

12 C2(1 2β)

4/(1 2β) = exp 4 1 2β ln 12 C2(1 2β)

4 1 2β ln 12 211D4 XC2 δ/2(1 + e DX( θ +ε))2

Λ2 min(1 2β)2ε4

8 1 2β ln 12 211D4 XC2 δ/2(1 + e DX( θ +ε))2

Λ2 min(1 2β)ε4

3 213D4 XC2 δ/2(1 + e DX( θ +ε))2

Λ2 min(1 2β)ε4

629D2 XCδ/2(1 + e DX( θ +ε))

Λmin(1 2β)3/2ε2

3 215D12 X C2 δ/2(1 + e DX( θ +ε))6

Λ6 min(1 2β)3/2ε4

Appendix D. Proofs of Section 5

Proof of Proposition 14. The ﬁrst order condition of the optimum yields

arg min θ Rd

s=1 (ys θ Xs)2 + 1

2(θ ˆθ1) P 1 1 (θ ˆθ1) = ˆθ1 + Pt

s=1 (ys ˆθ 1 Xs)Xs .

Therefore we prove recursively that ˆθt ˆθ1 = Pt Pt 1 s=1(ys ˆθ 1 Xs)Xs. It is clearly true at t = 1. Assuming it is true for some t 1, we use the update formula

ˆθt+1 ˆθ1 = (I Pt+1Xt X t )(ˆθt ˆθ1) + Pt+1yt Xt Pt+1Xt X t ˆθ1

= (I Pt+1Xt X t )Pt

s=1 (ys ˆθ 1 Xs)Xs + Pt+1(yt ˆθ 1 Xt)Xt .

We conclude with the following identity:

(I Pt+1Xt X t )Pt = Pt Pt Xt X t Pt + Pt Xt X t Pt Xt X t Pt X t Pt Xt + 1 = Pt Pt Xt X t Pt X t Pt Xt + 1 = Pt+1 .

de Vilmarest and Wintenberger

D.1 Proof of Theorem 16

We ﬁrst prove a result controlling the ﬁrst estimates of the algorithm.

Lemma 24 Provided that Assumptions 1, 2 and 4 are satisﬁed, starting from any ˆθ1 Rd

and P1 0, for any δ > 0, it holds simultaneously

ˆθt θ ˆθ1 θ + λmax(P1)DX (3σ + Dapprox)(t 1) + 3σ ln δ 1 , t 1,

with probability at least 1 δ.

Proof From Proposition 14, we obtain, for any t 1, ˆθt ˆθ1 = Pt Pt 1 s=1(ys ˆθ 1 Xs)Xs. Consequently,

s=1 (ys ˆθ 1 Xs)Xs Pt

s=1 (ys θ Xs)Xs + Pt P 1 1 (ˆθ1 θ ) ,

and using Pt P 1 1 I, we obtain

ˆθt θ ˆθ1 θ + λmax(Pt)DX

s=1 |ys θ Xs|

ˆθ1 θ + λmax(P1)DX

s=1 (|ys E[ys | Xs]| + Dapp) . (17)

We apply Lemma 1.4 of Rigollet and H utter (2015) in the second line of the following: for any µ such that 0 < µ < 1 2

E [exp(µ|yt E[yt | Xt]|)] = 1 + X

µi E[|yt E[yt | Xt]|i]

µi(2σ2)i/2iΓ(i/2)

2µσ i , because Γ(i/2) Γ(i) = (i 1)!

2µσ, because 0 <

Therefore exp 1 2

s=1 (|ys E[ys | Xs]| 2

t is a super-martingale to which we

can apply Lemma 17. We obtain, for any δ > 0,

s=1 |yt E[yt | Xt]| 2

2(t 1)σ + 2

2σ ln δ 1, t 1,

Stochastic Online Optimization using Kalman Recursion

with probability at least 1 δ. The result follows from Equation (17) and 2

Proof of Theorem 16. We ﬁrst apply Theorem 2: with probability at least 1 5δ, it holds simultaneously for all n T(ε, δ)

t=T(ε,δ)+1 L(ˆθt) L(θ ) 15

2 d 8σ2 + D2 app + ε2D2 X ln 1 + (n T(ε, δ))λmax(P1)D2 X d

+ 5λmax P 1 T(ε,δ)+1 ε2

+ 115 σ2(4 + λmax(P1)D2 X 4 ) + D2 app + 2ε2D2 X

Moreover, λmax P 1 T(ε,δ)+1 λmax(P 1 1 ) + T(ε, δ)D2 X.

Then we derive a bound on the ﬁrst T(ε, δ) terms. For any t 1, we have L(ˆθt) L(θ ) D2 X ˆθt θ 2, thus, using (a + b)2 2(a2 + b2) and applying Lemma 24 we obtain the simultaneous property

L(ˆθt) L(θ ) 2D2 X( ˆθ1 θ + 3λmax(P1)DXσ ln δ 1)2

+ 2λmax(P1)2D4 X(3σ + Dapp)2(t 1)2, t 1,

with probability at least 1 δ. A summation argument yields, for any δ > 0,

t=1 L(ˆθt) L(θ ) 2D2 X( ˆθ1 θ + 3λmax(P1)DXσ ln δ 1)2T(ε, δ)

+ λmax(P1)2D4 X(3σ + Dapp)2 (T(ε, δ) 1)T(ε, δ)(2T(ε, δ) 1)

with probability at least 1 δ.

D.2 Deﬁnition of T(ε, δ)

We now focus on the deﬁnition of T(ε, δ). We ﬁrst transcript the result of Hsu et al. (2012) to our notations in the following lemma.

Lemma 25 Provided that Assumptions 1, 2 and 4 are satisﬁed, starting from any ˆθ1 Rd

and P1 = p1I, p1 > 0, we have, for any 0 < δ < e 2.6 and t 6 D2 X Λmin (ln d + ln δ 1),

ˆθt+1 θ 2 Σ 3

2p1 + D2 X Λmin D2 app 4(1 +

8 ln δ 1) 0.072 + 3σ2(d/0.035 + ln δ 1)

+ 12 0.072t2

D2 X Λmin (1 +

+ DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1

2 (ln δ 1)2 !

with probability at least 1 4δ.

de Vilmarest and Wintenberger

Proof We ﬁrst observe that

arg min w Rd 1

s=1 (ys w Xs)2 + λ w ˆβ1 2 = arg min w Rd 1

s=1 (ys ˆβ 1 Xs w Xs)2 + λ w 2 ,

therefore we apply the ridge analysis of Hsu et al. (2012) to (Xs, ys ˆβ 1 Xs). We note that (ys ˆβ 1 Xs) has the same variance proxy and the same approximation error, therefore it only amounts to translate the optimal w, that is denoted by β. For any λ > 0, we observe that

d2,λ d1,λ d , ρλ DX pd1,λΛmin , bλ ρλ(Dapp + DX β ˆβ1 ) .

Therefore we can apply Theorem 16 of Hsu et al. (2012): for 0 < δ < e 2.6 and t 6 DX Λmin (ln(d)+ln δ 1), it holds that ˆβt+1,λ β 2 Σ = 3( βλ β 2 Σ+εbs+εvr) with probability 1 4δ, with

εbs 4 0.072

D2 X Λmin E[(E[y | X] β X)2] + (1 + D2 X Λmin ) βλ β 2 Σ t (1 +

+ ( DX Λmin (Dapp + DX β ˆβ1 ) + βλ β Σ)2

t2 (ln δ 1)2 ,

t DX Λmin (1 +

8 ln δ 1) + 1

D4 X Λ2 mind + 1

εvr σ2d(1 + δf)

0.072t + 2σ2p

d(1 + δf) ln δ 1

0.073/2t + 2σ2 ln δ 1

Moreover E[(E[y | X] β X)2] D2 app and Λmin D2 X, hence, using βλ β Σ λ β ˆβ1 we transfer the result in our KF notations, that is, ˆθt = ˆβt,p 1 1 /2(t 1), ˆβ1 = ˆθ1, β = θ .

We obtain, for any 0 < δ < e 2.6 and t 6 DX Λmin (ln(d) + ln δ 1),

εbs 4 0.072

D2 X Λmin D2 app + D2 X Λmin ˆθ1 θ 2

+ ( DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1t )2

t2 (ln δ 1)2 ,

t DX Λmin (1 +

8 ln δ 1) + 1

D4 X Λ2 mind + 1

εvr σ2d(1 + δf)

0.072t + 2σ2p

d(1 + δf) ln δ 1

0.073/2t + 2σ2 ln δ 1

ˆθt+1 θ 2 Σ 3

2p1t + εbs + εvr

Stochastic Online Optimization using Kalman Recursion

with probability at least 1 4δ. For t D2 X Λmin ln δ 1, as ln δ 1 1, we get

6 ln δ 1 (1 +

8 ln δ 1) + 1

1 d + 1 1 +

2 9 1.9 2 .

2 for any a, b > 0, we have

3d 0.07 + 2

0.07 + 2 ln δ 1 !

0.07 + 3 ln δ 1

3σ2(d/0.035 + ln δ 1)

It yields the result.

Lemma 25 allows the deﬁnition of an explicit value for T(ε, δ), as displayed in the following Corollary.

Corollary 26 Assumption 5 is satisﬁed for T(ε, δ) = max(T1(δ), T2(ε, δ), T3(ε, δ)) where we deﬁne

T1(δ) = max 12 D2 X Λmin (ln d + ln δ 1), 48D2 X Λmin ln 24D2 X Λmin

T2(ε, δ) = 24ε 1

2p1 + D2 X Λmin D2 app 4(1 +

8 ln δ 1) 0.072 + 3σ2(d/0.035 + ln δ 1)

2p1 + D2 X Λmin D2 app 4(1 +

8 ln δ 1) 0.072 + 3σ2(d/0.035 + ln δ 1)

D2 X Λmin (1 +

+ DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1

2 (ln δ 1)2 !1/2

2p1 (1 + D2 X Λmin )(1 +

+ DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1

2 (ln δ 1)2 !

We recall that for any η 1, we have ln t

t η for t 2η 1 ln(η 1), and we use it in the following proof. Proof of Corollary 26. We deﬁne δt = δ/t2 for any t 1. In order to apply Lemma

25 with a union bound, we need t 6 D2 X Λmin (ln d + ln δ 1 t ). If t 12 D2 X Λmin (ln d + ln δ 1) and

de Vilmarest and Wintenberger

t 48D2 X Λmin ln 24D2 X Λmin , we obtain

6 D2 X Λmin (ln d + ln δ 1) + 12D2 X Λmin ln t, as ln t

= 6 D2 X Λmin (ln d + ln δ 1 t ) .

Therefore, we deﬁne T1(δ) = max 12 D2 X Λmin (ln d + ln δ 1), 48D2 X Λmin ln 24D2 X Λmin

, and we apply Lemma 25. We get the simultaneous property

ˆθt+1 θ 2 Σ 3

2p1 + D2 X Λmin D2 app 4(1 + q

8 ln δ 1 t )

0.072 + 3σ2(d/0.035 + ln δ 1 t ) 0.07

+ 12 0.072t2

D2 X Λmin (1 + q

8 ln δ 1 t )

+ DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1

2 (ln δ 1 t )2 !

for all t T1(δ), with probability at least 1 4δ P

t T1(δ) t 2 1 δ because T1(δ) > 4.

Thus, as ln t 1 for t T1(δ) and ˆθt+1 θ 2 Σ Λmin ˆθt+1 θ 2, we obtain

ˆθt+1 θ 6 ln t

2p1 + D2 X Λmin D2 app 4(1 +

8 ln δ 1) 0.072 + 3σ2(d/0.035 + ln δ 1)

+ 48(ln t)2

0.072Λmint2

D2 X Λmin (1 +

+ DX Λmin (Dapp + DX θ ) + ˆθ1 θ 2p1

2 (ln δ 1)2 !

for all t T1(δ, with probability at least 1 δ. Finally, both terms of the last inequality are bounded by ε/2.

From Corollary 26, we obtain the asymptotic rate by comparing T2(δ) and T3(δ). We write T2(δ) = 2A2(δ) ln A2(δ), T3(δ) = 2A3(δ) ln A3(δ) with

p1 + D2 X Λmin D2 app

ln δ 1 + σ2(d + ln δ 1)

v u u t ε 1

ln δ 1 + DX(Dapp + DX θ ) Λmin + ˆθ1 θ p1

2 (ln δ 1)2 !

Stochastic Online Optimization using Kalman Recursion

where the symbol means less than up to universal constants. As

ab a + b, we obtain

+ DX Λmin (Dapp + DX θ ) + ˆθ1 θ p1

p1 + D2 X Λmin

+ DX Λmin (Dapp + DX θ ) + ˆθ1 θ p1

Thus, as long as ε 1 Λmin 1, we get

A2(δ), A3(δ) ε 1

p1 + D2 X Λmin (1 + D2 app)

ln δ 1 + σ2d

+ DX Λmin (Dapp + DX θ ) + ˆθ1 θ p1 + σ2 ln δ 1 !

Shun-Ichi Amari. Natural gradient works eﬃciently in learning. Neural Computation, 10 (2):251 276, 1998.

Francis Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Journal of Machine Learning Research, 15(1):595 627, 2014.

Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In Advances in Neural Information Processing Systems, pages 773 781, 2013.

Bernard Bercu and Abderrahmen Touati. Exponential inequalities for self-normalized martingales with applications. The Annals of Applied Probability, 18(5):1848 1869, 2008.

Bernard Bercu, Antoine Godichon, and Bruno Portier. An eﬃcient stochastic newton algorithm for parameter estimation in logistic regressions. SIAM Journal on Control and Optimization, 58(1):348 367, 2020.

Jock A. Blackard and Denis J. Dean. Comparative accuracies of artiﬁcial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, 24(3):131 151, 1999.

Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, Learning, and Games. Cambridge university press, 2006.

de Vilmarest and Wintenberger

George T. Diderrich. The Kalman ﬁlter from the perspective of Goldberger Theil estimators. The American Statistician, 39(3):193 198, 1985.

James Durbin and Siem J. Koopman. Time Series Analysis by State Space Methods. Oxford university press, 2012.

Ludwig Fahrmeir. Posterior mode estimation by extended Kalman ﬁltering for multivariate dynamic generalized linear models. Journal of the American Statistical Association, 87 (418):501 509, 1992.

David A. Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100 118, 1975.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on Learning Theory, pages 9 1, 2012.

Sham M. Kakade and Andrew Y. Ng. Online bounds for bayesian algorithms. In Advances in Neural Information Processing Systems, pages 641 648, 2005.

Rudolph E. Kalman and Richard S. Bucy. New results in linear ﬁltering and prediction theory. Journal of Basic Engineering, 83(1):95 108, 1961.

Ron Kohavi. Scaling up the accuracy of naive-bayes classiﬁers: A decision-tree hybrid. In The International Conference on Knowledge Discovery and Data Mining, volume 96, pages 202 207, 1996.

Tomer Koren. Open problem: Fast stochastic exp-concave optimization. In Conference on Learning Theory, pages 1073 1075, 2013.

Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Lower and upper bounds on the generalization of stochastic exponentially concave optimization. In Conference on Learning Theory, pages 1305 1320, 2015.

Peter Mc Cullagh and John A. Nelder. Generalized Linear Models. London Chapman and Hall, 2nd ed edition, 1989.

Noboru Murata and Shun-ichi Amari. Statistical analysis of learning dynamics. Signal Processing, 74(1):3 28, 1999.

Yann Ollivier. Online natural gradient as a Kalman ﬁlter. Electronic Journal of Statistics, 12(2):2930 2961, 2018.

Dmitrii M. Ostrovskii and Francis Bach. Finite-sample analysis of m-estimators using selfconcordance. Electronic Journal of Statistics, 15(1):326 391, 2021.

Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992.

Stochastic Online Optimization using Kalman Recursion

Phillippe Rigollet and Jan-Christian H utter. High dimensional statistics. Lecture notes for course 18S997, 2015.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407, 1951.

David Ruppert. Eﬃcient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.

Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In International Conference on Machine Learning, pages 928 936, 2003.