# horizonindependent_minimax_linear_regression__a93d152c.pdf

Horizon-Independent Minimax Linear Regression

Alan Malek Laboratory for Information and Decision Systems Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge, MA 02139-4307, USA amalek@mit.edu

Peter L. Bartlett Department of EECS and Statistics University of California Berkeley, CA 94720-1776, USA bartlett@cs.berkeley.edu

We consider online linear regression: at each round, an adversary reveals a covariate vector, the learner predicts a real value, the adversary reveals a label, and the learner suffers the squared prediction error. The aim is to minimize the difference between the cumulative loss and that of the linear predictor that is best in hindsight. Previous work demonstrated that the minimax optimal strategy is easy to compute recursively from the end of the game; this requires the entire sequence of covariate vectors in advance. We show that, once provided with a measure of the scale of the problem, we can invert the recursion and play the minimax strategy without knowing the future covariates. Further, we show that this forward recursion remains optimal even against adaptively chosen labels and covariates, provided that the adversary adheres to a set of constraints that prevent misrepresentation of the scale of the problem. This strategy is horizon-independent in that the regret and minimax strategies depend on the size of the constraint set and not on the time-horizon, and hence it incurs no more regret than the optimal strategy that knows in advance the number of rounds of the game. We also provide an interpretation of the minimax algorithm as a follow-the-regularized-leader strategy with a data-dependent regularizer and obtain an explicit expression for the minimax regret.

1 Introduction

Linear regression is a fundamental prediction problem in machine learning and statistics. In this paper, we study a sequential version: on round t, the adversary chooses and reveals a covariate vector xt Rd, the learner makes a real-valued prediction ˆyt, the adversary chooses and reveals the true outcome yt R, and ﬁnally the learner is penalized by the square loss, (ˆyt yt)2.

Since it is hopeless to guarantee a small loss (the adversary can always cause constant loss per round), we instead aim to guarantee that we are able to predict almost as well as the best ﬁxed linear predictor in hindsight. Letting xt s and yt s denote xs, . . . , xt and ys, . . . , yt, respectively, deﬁne the regret of a strategy that predicts ˆy T 1 as

RT ˆy T 1 , x T 1 , y T 1 :=

t=1 (ˆyt yt)2 min θ Rd

t=1 (θ xt yt)2.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

A strategy s : S t 1(Rd R)t 1 Rd R, is a map from observations to predictions, and we deﬁne RT s, x T 1 , y T 1 := RT ˆy T 1 , x T 1 , y T 1 where ˆyt = s(x1, y1, . . . , xt 1, yt 1, xt). Our goal is to ﬁnd a strategy that guarantees low regret for all data sequences. In particular, this paper is concerned with the minimax strategy s , which is the strategy that minimizes the worst case regret over all possible covariate and outcome sequences in some constraint set, i.e. s satisﬁes

max x T 1 ,y T 1 RT s , x T 1 , y T 1 = min s max x T 1 ,y T 1 RT s, x T 1 , y T 1 .

In general, computing minimax strategies is computationally intractable because the optimal prediction ˆyt depends on the complete history (x1, y1, . . . , xt 1, yt 1, xt), and the dependence might be a rather arbitrary function of this enormous space of histories. So it is surprising that, in the case of ﬁxed-design linear regression (where the strategy knows the covariate sequence in advance), the minimax strategy can be efﬁciently computed [Bartlett et al., 2015].

This paper builds on results from Bartlett et al. [2015], which studied ﬁxed-design online linear regression, where the game length T and covariates x T 1 := x1, . . . , x T are known to the learner a priori. Under constraints on the adversarial labels y T 1 , the value function and minimax strategy were calculable in closed form using backwards induction. The resulting minimax strategy

ˆyt+1 = x t+1Pt+1

s=1 ysxs, (MMS)

is a simple, linear predictor with coefﬁcient matrices deﬁned by

and recursion Pt = Pt+1 + Pt+1xt+1x t+1Pt+1. (1)

The ˆyt is a function of the whole sequence x T 1 , and thus an extension to online-design seems difﬁcult.

Given: covariate constraints X and label constraints Y({xt}) For t = 1, 2, . . . ,

Adversary chooses xt s.t. xt 1 X Learner predicts ˆyt Adversary may end the game Adversary reveals yt s.t. yt 1 Y(x T 1 )

Learner incurs loss (ˆyt yt)2

The game ends if no xt+1 exists such that xt+1 1 X

Figure 1: Adversarial Covariates Protocol

Our contributions This paper extends the ﬁxed design setting to adversarial design where neither the covariates nor the length of the game are ﬁxed a priori. We use {xt} and {yt} to denote arbitrary length sequences of covariates and labels, respectively. We allow the adversary to play any covariate sequence in some constraint set X and labels in some set Y({xt}) (which may depend on the covariates).

In particular, we identify a family X, Y parameterized by a positive-deﬁnite matrix Σ, representing the size of future covariates, and a scalar γ0, representing the size of the future labels, and present a strategy that is minimax optimal against all adversarial sequences in this family. The algorithm needs only know Σ, and the guarantee is horizon-independent in the sense that the family does not constrain the length of the covariate sequence and includes covariate sequences of arbitrary length for any Σ, γ0 pair.

The protocol of the general, horizon-independent setting is outlined in Figure 1. We derive the minimax strategy and show that it is optimal in the following way. Deﬁnition 1. A strategy s is horizon-independent minimax optimal for some class X of covariate sequences and some class Y({xt}) of label sequences, possibly depending on {xt} X, if

sup x T 1 X, y T 1 Y(x T 1 ) RT s , x T 1 , y T 1 min s sup x T 1 X, y T 1 Y(x T 1 ) RT s, x T 1 , y T 1 !

We require s to have regret no larger than even a strategy that knows T.

In other words, we establish a more natural measure of game length than the number of rounds. The covariate constraints on {xt} ensure that the adversary respects the scale constraint Σ so that the

learner is not led to under-regularize or over-regularize. The minimax strategy is efﬁcient and is simultaneously minimax optimal against all covariate sequences corresponding to Σ.

We motivate our constraint set by showing that every condition is necessary, and we also cast the minimax strategy as follow the regularized leader strategy with a data-dependent regularizer. Finally, we provide a general regret upper bound.

Outline We begin with a review of how backwards induction is used to derive the ﬁxed-design minimax algorithm (MMS) in Section 2. By inverting the recursion, we show in Section 3 how to calculate (MMS) given only P0, and thus we have the minimax strategy for any covariate sequence that perfectly agrees with the given P0.

Section 4 greatly expands the scope of our algorithm by deriving weaker conditions on the adversary and proves that, under these conditions, the same minimax strategy is horizon-independent minimax optimal. We argue that these conditions are necessary. We then interpret the minimax strategy as a follow the regularized leader with a speciﬁc, data-dependent regularizer in Section 5.

Related Work While linear regression has a long history in statistics and optimization, its online sibling is much more recent, starting with the work of Foster [1991], which considered binary labels and ℓ1-constrained parameters θ. He proved an O(d log(d T)) regret bound for an ℓ2-regularized follow-the-leader strategy. Cesa-Bianchi et al. [1996] considered ℓ2-constrained parameters and gave O(

T) regret bounds for a gradient descent algorithm with ℓ2 regularization. Kivinen and Warmuth [1997] showed that an Exponentiated Gradient algorithm with relative entropy gives the same regret without the need for a constraint on the parameters. Vovk [1998] applied the Aggregating Algorithm [Vovk, 1990] to continuously many experts and arrived at a scale free algorithm by using the inverse second moment matrix of past and current covariates. Forster [1999] and Azoury and Warmuth [2001] showed that this algorithm is last step minimax and achieves an O(log T) scale-dependent regret bound. (See also the work of Moroshko and Crammer [2014] on last-step minimax.)

Takimoto and Warmuth [2000] obtained the minimax strategy for prediction in Euclidean space with squared loss. This was extended to more general losses in [Koolen et al., 2014] and to tracking problems in [Koolen et al., 2015]. Finally, Bartlett et al. [2015] obtained the minimax strategy for ﬁxed-design linear regression. We present this strategy in the next section, because we build on these results. In these papers, the minimax analysis provides a natural, data-dependent regularization, in contrast to the follow-the-leader methods described above. We make this comparison explicit in Section 5, by calculating the implied regularization.

2 Fixed Design Linear Regression

We begin by summarizing the main results of Bartlett et al. [2015]. Recall that in the ﬁxed design setting, the game length T and covariates x T 1 are ﬁxed and known to both players. Deﬁne the summary statistics st := Pt s=1 ysxs, σ2 t = Pt s=1 y2 t , and Πt = Pt s=1 xsx s . The minimax strategy can be computed by solving the ofﬂine problem minθ PT t=1(x t θ yt)2 = PT t=1 y2 t s T Π T s T , where M is the pseudo-inverse of matrix M. The optimal actions ˆyt and yt are computed as a function of the state st 1 and covariates x T 1 by solving the backward induction

V st, σ2 t , t, x T 1 := min ˆyt+1 max yt+1

(ˆyt+1 yt+1)2 + V st + yt+1xt+1, σ2 t + y2 t+1, t + 1, x T 1

with base case V s T , σ2 T , T, x T 1 := minθ Rd PT t=1 θ xt yt 2. The arguments of V include x T 1 to emphasize the ﬁxed-design setting. Performing the backwards induction generates plays ˆy T 1 and y T 1 that witness the value of the game,

min ˆy1 max y1 min ˆy T max y T

t=1 (ˆyt yt)2 min w Rd

t=1 (w xt yt)2,

which is the minimum guaranteeable regret against all data sequences. The resulting minimax strategy is precisely the linear predictor ˆyt+1 = x t+1Pt+1st, ((MMS)) with coefﬁcient matrices deﬁned by the recursion (1). Note that Pt is a function of every covariate x T 1 . The minimax strategy is similar

to follow-the-leader, which would predicts with Π t in place of Pt; however, Pt is a shrunken version of Π t that takes future covariances into account.

The main result of Bartlett et al. [2015] is the minimax optimality of (MMS) for the following classes. For some ﬁxed sequence of positive label budgets B1, . . . , BT > 0, deﬁne

1. Label constraints on yt: L(BT 1 ) := {y T 1 : |yt| Bt t = 1, . . . , T}

2. Box constraints on xt: B(BT 1 ) := n x T 1 : Bt Pt 1 s=1 x t Ptxs Bs for 2 t o .

3. Ellipsoidal constraints: E(x T 1 , R) := n y T 1 : PT t=1 y2 t x t Ptxt R o .

Theorem 1. [Bartlett et al., 2015, Theorems 2 and 10] For each x T 1 , the corresponding strategy (MMS) is minimax optimal with respect to B(BT 1 ) if y T 1 L(BT 1 ) and with respect to E(x T 1 , R), for any Bt > 0 sequence and any R > 0, in the following sense:

(1) If x T 1 B(BT 1 ), then

sup y T 1 L(BT 1 ) RT ((MMS), x T 1 , y T 1 ) = min s sup y T 1 L(BT 1 ) RT (s, x T 1 , y T 1 ) =

t=1 B2 t x t Ptxt,

(2) sup y T 1 E(x T 1 ,R) RT ((MMS), x T 1 , y T 1 ) = min s sup y T 1 E(x T 1 ,R) RT (s, x T 1 , y T 1 ) = R.

3 The Forward Algorithm

The previous section described the ﬁxed-design minimax strategy and established sufﬁcient conditions for its optimality. Unfortunately, Pt is recursively deﬁned as a function of the entire x T 1 sequence. In this section, we show that it is possible to remove the ﬁxed-design and known-game-length requirement if we limit the adversary to play sequences that follow the Adversarial Covariate conditions. Letting X = S T >0 Rd T denote the set of covariate sequences of ﬁnite length, deﬁne

A(Σ) := n x T 1 X : for P0, . . . , PT deﬁned by (1), P 0 Σ o , and

A(Σ) := n x T 1 X : for P0, . . . , PT deﬁned by (1), P 0 = Σ o ; (2)

that is, x T 1 A(Σ) if the Pt computed by applying (1) to the sequence x T 1 results in P 0 Σ.

The key insight of this section is that it is possible to invert the Pt recursion: we can compute Pt from Pt 1 and xt. Hence, if we are given P0, then we can compute every Pt online. For some initial condition Σ, deﬁne the forward recursion with base case P0 = Σ and induction step

Pt := Pt 1 at

b2 t Pt 1xtx t Pt 1, where b2 t := x t Pt 1xt, at :=

4b2 t + 1 1 p

4b2 t + 1 + 1 . (3)

The prediction matrix Pt is a function of Σ and xt 1 only. For the rest of the paper, we will deﬁne (MMS) with respect to the forward recursion, i.e. ˆyt := x t Ptst 1, where Pt is deﬁned by recursion (3). The calculation of ˆyt only requires knowledge of Σ, xt 1, and yt 1 1 , all of which are available to the learner when choosing ˆyt. The algorithm needs O(d2) memory and at each round the computational complexity is O(d2). It is essential that the two recursions are equivalent, which is guaranteed by the following lemma.

Lemma 1. Let Σ 0 be a positive semideﬁnite matrix. For any covariate sequence x T 1 A(Σ), the Pt matrices deﬁned by the backwards recursion (1) applied to x T 1 are identical to the Pt matrices deﬁned by the forward recursion (3) with base case P0 = Σ and updates given by x T 1 .

Proof. Let P t be deﬁned by the forwards recursion starting from P0 = Σ and let Pt be deﬁned by the backwards recursion (1). Our goal is to show that Pt = P t for all t. The base case P0 = P 0 is a simple consequence [Bartlett et al., 2015, Lemma 11], which uses repeated applications on Sherman-Morrison to show that

x s Psxs 1 + x s Psxs xsx s . (4)

Now, assuming the induction hypothesis P t 1 = Pt 1, we can evaluate

P t = Pt 1 at

b2 t Pt 1xtx t Pt 1

= Pt + Ptxtx t Pt at

Pt + Ptxtx t Pt xtx t Pt + Ptxtx t Pt

= Pt + Ptxt

1 + 2x t Ptxt + x t Ptxt 2 x t Pt (5)

By deﬁnition, we have b2 t = x t Pt 1xt = x t Ptxt + x t Ptxt 2, which we can invert to ﬁnd that

x t Ptxt = 1

4b2 t + 1 1 . Plugging this is, the term in the parenthesis in (5) is

1 + 2x t Ptxt + x t Ptxt 2 = 1 at

4b2 t + 1 1 + 1

4b2 t + 1 1 2!

4b2 t + 1 + 1

4b2 t + 1 + 1 1

4b2 t + 1 1

4b2 t + 1 1 p

4b2 t + 1 + 1

4b2 t + 1 + 1 = 0,

implying that P t = Pt, as desired.

Our ﬁrst result is that this algorithm is actually minimax optimal if we constrain the adversary to play in A(Σ). Another interpretation is that Σ encodes all the necessary scale information the learner needs to respond optimally. That is, (MMS) performs as well as the best strategy that sees the covariate sequence in advance. In particular, knowledge of Σ, not T, is necessary for the learner. Theorem 2. For all positive semideﬁnite Σ, label bounds B1, B2, . . . > 0, and constants b > 0 and R > 0, the minimax strategy (MMS) using the forward recursion (3) starting from P0 = Σ is horizon-independent minimax optimal, i.e.

sup T sup x T 1 X

sup y T 1 Y(x T 1 ) RT (s , x T 1 , y T 1 ) min s sup y T 1 Y(x T 1 ) RT (s, x T 1 , y T 1 )

for X, Y(x T 1 ) equal to either (A(Σ), E(x T 1 , R)) or (B(BT 1 ) A(Σ), L(BT 1 )).

Proof of Theorem 2. Since x T 1 A(Σ), Lemma 1 implies that the Pt matrices from the forwards and backwards recursions are equivalent, and therefore (MMS) corresponds to the minimax strategy for the ﬁxed-design game with P 0 = Σ. The can then apply Theorem 1, part (1), which yields

sup y T 1 B(BT 1 ) RT (s , x T 1 , y T 1 ) min s sup y T 1 B(BT 1 ) RT (s, x T 1 , y T 1 ) = 0.

Since this holds for all x T 1 , we actually get the stronger result

sup T sup x T 1 A(BT 1 ) A(Σ)

sup y T 1 B(BT 1 ) RT (s , x T 1 , y T 1 ) min s sup y T 1 B(BT 1 ) RT (s, x T 1 , y T 1 )

Identical reasoning extends part (2) of Theorem 1 to the adversarial covariate context.

The adversarial covariate conditions are deﬁned for entire x T 1 sequences, but there is an online characterization, derived from the following lemma. Lemma 2. Consider any t 0, x1, . . . , xt, and symmetric matrix P 0. We have that P Πt if and only if, for any T t + rank P Πt , there is a continuation of the covariate sequence, xt+1, . . . , x T , such that setting Pt = P and deﬁning Pt+1, . . . , PT by the forward recursion (3) gives P T = ΠT .

A stronger version with proof is presented in the Appendix as Theorem 6 and explicitly derives conditions on xt+1 that ensure P Πt.

In words, a sequence of covariates xt 1 is the preﬁx of some x T 1 A(Σ) if P s Πs for all s t, where Ps corresponds to the forward recursion (3) deﬁned by intuition condition P0 = Σ and covariates xt 1. Hence, it is equivalent to constrain the adversary to play xt satisfying this condition at every round, and we do not require the adversary to ﬁx the covariate sequence in advance; it is equivalent to deﬁne

A(Σ) = n x T 1 X : P 0 = Σ and P t Πt t 1 o , and (6)

A(Σ) = n x T 1 X : P 0 = Σ, P t Πt t 1, and P T = ΠT o . (7)

4 Expanding the Minimax Conditions

The strategy (MMS) is minimax optimal for any covariate sequence x T 1 A(Σ) if the adversary plays covariates that meet the Σ constraint with equality, which is quite restrictive. In this section, we identify a much broader set of constraints on the adversary s actions where (MMS) remains the best learner response. These conditions allow for adversarial design; the data may be chosen in response to the learner s actions.

A natural relaxation is to remove the equality constraints; this results in a set of constraints on the adversary where the labels {yt} are in L({Bt}) := {yt : |yt| Bt t 1}, and the covariates {xt}

are in A (Σ) B (Σ), where B (Σ) = n {xt} : Bt Pt 1 s=1 x t Ptxs t 1 o .

The B(Σ) condition is necessary for an efﬁcient algorithm [Bartlett et al., 2015], and without the A(Σ) condition, the adversary could choose xt to be a scaled version of st 1 and yt = θ t 1xt, where θ t 1 is the best least squares predictor of xt 1 t and yt 1 1 . The comparator will never suffer more regret, the algorithm will suffer some regret, and we can scale xt such that the B(Σ) conditions are satisﬁed. To summarize, without the A constraint, the adversary can cause arbitrary regret. However, the A and B constraints are not sufﬁcient to guarantee a solvable game: Lemma 3. Fix any Σ and any {Bt} with Bt b > 0 for all t. Then, for any M > 0, there exists x T 1 A(Σ) B(Σ) and y T 1 L(BT 1 ) such that the minimax regret is larger than M.

A covariate budget is not sufﬁcient for a minimax algorithm; it is not even clear how to deﬁne minimax when the regrets are not bounded. Hence, we will introduce continuation constraints (the name will become clear soon). Let γ0 > 0 be some initial label budget and deﬁne γt = γt 1 B2 t x t Ptxt, with Pt deﬁned by the forward recursion (3). Let B (Bt 1) := {ξ Rt : |ξi| Bi, i = 1, . . . , t} be the hypercube with sides of length B1, . . . , Bt and Xt be the matrix with columns x1, . . . , xt. For a given covariate budget Σ and label budget γ0, deﬁne the continuation condition

C (Σ, γ0) := n x T 1 : γt ξ X t Π t Pt Xtξ ξ B (Bt) and t = 1, . . . , T o , (8)

which is equivalent to requiring that s t Π t Pt st γt for all possible st.

The rest of this section proves the main result of this paper: if the adversary plays in ABC(Σ, γ0) := A(Σ) B(Σ) C(Σ, γ0), then (MMS) is minimax optimal. Theorem 3. Consider the two player game deﬁned in Figure 1. For any {Bt} > 0, Σ 0 and γ0 0, the player strategy (MMS) has minimax regret γ0 and is horizon-independent minimax optimal for x T 1 X = ABC(Σ, γ0) and y T 1 Y = L(Bt). That is,

sup x T 1 X,y T 1 Y RT ((MMS), x T 1 , y T 1 ) min s sup x T 1 X,y T 1 Y RT (s, x T 1 , y T 1 )

We will prove Theorem 3 by ﬁrst considering adversarial strategies under A(Σ) with a ﬁxed game length. We show that, somewhat counterintuitively, the adversary may cause more regret by not using the entire Σ budget. Then, we show that the C condition eliminates these troublesome cases and the adversary exhausts the budget; therefore, the adversary plays x T 1 A(Σ) which implies

that that (MMS) is minimax optimal by results of the previous section. Finally, we note that all the previous arguments apply uniformly across T, and since (MMS) is ignorant of T, it must be horizon-independent minimax optimal. The Σ constraint, not the game length, seems to be the correct notion of game size.

4.1 Limiting T

Consider a ﬁxed T > 0 and deﬁne AT (Σ) := n x T 1 Rd T : P 0 = Σ and P t Πt 1 t T o , the restriction of A(Σ) to sequences of length T. This goal of this section is to show i) that it is possible for the adversary to cause more regret by not using up the covariance budget, i.e. P T ΠT , and ii) that the C conditions are sufﬁcient to stop this.

We cannot calculate the minimax solution of AT (Σ) directly. Section G in the appendix explicitly evaluates the ﬁrst backwards induction step; it is quite complicated and has no closed form solution, and this suggests that efﬁcient backwards induction is unlikely. Instead, we will study the related ﬁxed-design early-stopping game. For some ﬁxed x T 1 , the game protocol is: at round t, the learner predicts ˆyt, the adversary chooses et {0, 1} and yt L(BT 1 ). If et = 1, the learner incurs loss (ˆyt yt)2 and the game continues, but if et = 0, the game ends. Intuitively, the adversary may be able to cause more regret because the learner is regularizing for a covariance budget corresponding to x T 1 , and therefore ending the game early causes the learner to over-regularize.

We will derive C as a condition where the adversary always continues to T. In turn, this implies that the adversary will use up the Σ budget in the AT game: any x T 1 with remaining Σ budget has a continuation x T +k 1 A(Σ) by Lemma 2, and the C condition implies that the adversary will continue until T + k and use up the budget. We will make this argument formal.

We begin by deﬁning an incremental version of regret. Deﬁne t := minθ Rd Pt s=1(θ xs ys)2 minθ Rd Pt 1 s=1(θ xs ys)2, the additional loss suffered by the comparator from playing t rounds instead of t 1 rounds. We have t 0 and L T = Pt t=1 t . The regret of the game

with early stopping can be written as RT = PT t=1 Qt s=1 es (yt ˆyt)2 t . One might notice

that δ t = 0 for the choice yt = θ t 1 xt, where θ t 1 is the ordinary least squares solution on data through time t 1, and the regret always increases. However, this choice of yt may violate the label constraints, in particular, for Bt = 1 and xt R increasing. Additionally, we want a constraint where the adversary wants to play all remaining rounds, not just the next one, and hence the constraint on yt will depend on the future covariates.

The value-to-go deﬁnition also needs to be adapted to the incremental setting. To this end, we deﬁne the instantaneous value-to-go W(st, σ2 t , t, x T 1 ) by W(s T , σ2 T , T, x T 1 ) = 0 and

W(st 1, σ2 t 1, t 1, x T 1 ) = max et {0,1} et

min ˆyt max yt (ˆyt yt)2 t + W(st, σ2 t , t, x T 1 ) ,

where the statistics are updated as st = st 1 + ytxt and σ2 t = σ2 t 1 + y2 t . It is easy to check that W0 is the minimax regret for this game and that it equals the regret of the ﬁxed design game when the adversary plays every round.

4.2 Calculating the Instantaneous Value-to-go

This section derives C as the condition where et = 1 for all t and evaluates Wt. Throughout, R(M) denotes the row space of matrix M. Proofs from this section are heavy on calculation and have been collected in Appendix B. We begin by explicitly calculating t . Lemma 4. The marginal loss for the comparator of playing another round with covariate x = x + x , where x R(Πt 1) and x is its orthogonal complement, is

t = y2 t 1 x t Π txt 2yts t 1Π txt + s t 1Π txt 2 x t Π t 1xt x t Π txt .

Theorem 4. Consider the ﬁxed-design game with early stopping, with covariates x T 1 . Deﬁne the Pt by the backwards recursion (1) and deﬁne γt = PT s=t+1 B2 sx s Psxs. Suppose that, for all t, γt

s t Π t Pt st. Then the instantaneous value-to-go is W(st, σ2 t , t, x T 1 ) = s t Pt Π t st +γt, the adversary causes more regret by continuing the game, and the optimal learner strategy is (MMS).

Proof outline. The proof is by induction, where the base case is easily established with γT = 0 and PT = Π T . Now, assuming that W(st, σ2 t , t, x T 1 ) = s t Pt Π t st +

γt, we wish to calculate the t 1 case by evaluating W(st 1, σ2 t 1, t 1, x T 1 ) = maxet {0,1} et minˆyt maxyt(ˆyt yt)2 t + Wt(st, σ2 t , x T 1 ) . We use our expression for t , perform elementary calculations to evaluate the saddle-point, and show that the above evaluates to

( s t 1Ptxt 2 + B2 t x t Ptxt s t 1Π txt 2 x t Π t 1xt x t Π txt + s t 1 Pt Π t st 1 + γt, 0

which can be shown to always take the ﬁrst value so long as γt 1 s t 1 Π t 1 Pt 1 st 1. In this case, the induction hypothesis is veriﬁed with the Pt update described in the theorem. This implies that the instantaneous value-to-go is always positive and that an optimal adversary will always continue. As a consequence, the covariate sequence x T 1 A(P 0 ), which conﬁrms that (MMS) using the forward recursion is minimax optimal via Theorem 2.

All the ingredients are in place to prove our main result. For convenience, deﬁne ABC(Σ, γ0) := {x T 1 ABC(Σ, γ0) : PT = Π T , γT = 0}, the set of sequences that deplete the Σ and γ0 budgets. Roughly, we will argue that, under C(Σ, γ0), the adversary causes the most regret by playing x T 1 A(Σ), which implies that x T 1 ABC(Σ, γ0) and the regret is γ0. The ﬁrst step in the analysis is to check that the constraint set is non-trivial.

Lemma 5. Consider the game deﬁned by Σ 0, γ0 Bt and a Bt sequence. If there exists

some T such that PT t=1 B2 t t+log(T +1) γ0, then there exists a covariate sequence x T 1 ABC(Σ, γ0). In particular, any Bt that are bounded below satisfy this condition.

In reasoning about optimal strategies, Theorem 4 allows us to easily establish conditions when the learner is playing suboptimally and could be causing more regret. However, Theorem 4 applies to a ﬁxed design game that is allowed to stop early, and we wish to reason about the adversarial covariate case. The next lemma makes the crucial connection.

Lemma 6. Suppose xt 1 ABC(Σ, γ0) but γt > 0. Then there exists an extension xt+1, . . . , x T in ABC(Σ, γ0) with x T 1 A(Σ, γ0) and W(st, σ2 t , t, x T 1 ) = s t (Pt Π t)st + γt equal to the instantaneous value-to-go.

The proof is a simple consequence of checking that the extension Lemma 2 is compatible with condition C. We can now prove the minimax optimality of (MMS) on the ABC game.

Proof of Theorem 3. We will show something stronger: the optimal adversary strategy for the game in Figure 1 plays an x T 1 sequence in ABC and causes exactly γ0 regret against (MMS).

First, assume that the game stops before round T + 1 and x1, . . . , x T have been played. There are four possible scenarios depending on whether the Σ or γ0 budgets are exhausted.

Case: both budgets exhausted. In this case, x T 1 ABC(Σ, γ0) and optimal holds by results from Section 3.

Case: neither budget exhausted. We apply Lemma 2 to conclude that there exists a covariate sequence x T +k T +1 that uses up the Σ budget. The C(Σ, γ0) constraint guarantees that the adversary can cause more regret by playing these rounds. Hence, an adversary that exhausts neither budget is suboptimal.

Case: only Σ budget exhausted. Since Pt Π t 0, we cannot exhaust the γ0 before the Σ budget and still satisfy the C constraint.

Case: only γ0 budget exhausted. If the Σ budget is exhausted, then x T 1 A and hence the minimax regret is PT t=1 B2 t x t Ptxt by Theorem 2. Since γT = γ0 PT t=1 B2 t x t Ptxt, the adversary strategy is suboptimal if γT > 0 since it is possible to cause γ0 regret. These arguments cover all four cases, we can conclude that the adversary can cause at most γ0 regret and that any strategy that causes γ0 regret must exhaust the Σ and γ0 budgets.

In all cases, the adversary can cause at most γ0 regret and it is necessary for the adversary to play x T 1 ABC(Σ, γ0), which implies that (MMS) is optimal. In other words, for x T 1 X = ABC(Σ, γ0) and y T 1 Y = L(Bt), we have

sup x T 1 X,y T 1 Y RT ((MMS), x T 1 , y T 1 ) min s sup x T 1 X,y T 1 Y RT (s, x T 1 , y T 1 ) = 0

for all T > 0, which implies the result.

The Necessity of a γ0 Bound Requiring a γ0 bound may seem artiﬁcial at ﬁrst, especially since it translates directly into a bound on the regret. However, it is a reasonable constraint to impose, for several reasons. First, recall that Lemma 3 argues that the regret of just the A(Σ) B(Σ) game is inﬁnite. Second, the restriction on the adversary is mild: if x T 1 ABC(Σ, γ0), then x T 1 ABC(Σ, γ ) for γ γ0, and so the budget can be adjusted online. Finally, we emphasize that the learner does not need to know γ0 to play (MMS).

5 Follow the Regularized Leader

The minimax strategy (MMS) can be interpreted as playing follow-the-regularized-leader with a certain data-dependent regularizer. Lemma 7. The minimax strategy (MMS) is exactly follow-the-regularized-leader, predicting ˆyt = θ xt at round t, where regularization matrices Rt are

R0 := P 1 0 , and Rt := Rt 1 + 1 1 + x t Ptxt xtx t xt 1x t 1, (9)

and θ is the solution to minθ Pt 1 s=1(θ xs ys)2 + θ Rtθ.

It is also possible to derive a Rt recursion without referring to Pt; see Lemma 11. For comparison,

the last step minimax algorithm [Azoury and Warmuth, 2001] plays ˆyt = Pt s=1 xsx s 1 st 1, so we can also view the minimax algorithm as last step minimax with a regularization of PT s=t+1 x s Psxs 1+x s Psxs xsx s .

We have shown that for the adversarial covariates protocol with X = ABC(Σ, γ0), (MMS) is the minimax optimal strategy and receives γ0 regret. Our last result helps quantify this regret by proving a O(log(T)) regret bound for the games analyzed in Section 3. Theorem 5. For any ﬁxed T and BT 1 , the minimax regret of the box-constrained game has the bound

sup x T 1 A(Σ) sup y T 1 L(BT 1 ) RT (s , x T 1 , y T 1 ) d BT 1 Σ 2

1 + 2 ln 1 + ||Σ||2 2 2 BT 1 2 ||BT 1 ||2 2

6 Conclusion

We have presented the minimax optimal strategy for online linear regression where the covariate and label sequence are chosen adversarially and the measure of game length is a covariance budget instead of the number of rounds. Because the strategy has access to a more informative measure of game size, Σ, it can compete with strategies that know the number of rounds. The minimax strategy is efﬁcient and only needs to update Pt and st.

One could interpret the results of our paper as ﬁnding a more natural way to measure the length of the game that admits a tractable minimax strategy. What other game protocols can be reparameterized to admit efﬁcient minimax strategies? As a general method, one could start with minimax algorithms for constrained cases then search for parameterizations which preserve the optimality.

We have also provided an intuitive view of the algorithm as follow-the-regularized-leader with a speciﬁc data-dependent regularizer. This interpretation can be used to bound the excess regret when the budget Σ is misspeciﬁed, perhaps allowing for adaptation to Σ.

Acknowledgements

We gratefully acknowledge the support of the NSF through grant IIS-1619362.

Katy S. Azoury and Manfred K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211 246, 2001.

Peter L. Bartlett, Wouter M. Koolen, Alan Malek, Manfred K. Warmuth, and Eiji Takimoto. Minimax ﬁxed-design linear regression. In P. Grünwald, E. Hazan, and S. Kale, editors, Proceedings of The 28th Annual Conference on Learning Theory (COLT), pages 226 239, 2015.

Nicolo Cesa-Bianchi, Philip M. Long, and Manfred K. Warmuth. Worst-case quadratic loss bounds for prediction using linear functions and gradient descent. Neural Networks, IEEE Transactions on, 7(3):604 619, 1996.

Jürgen Forster. On relative loss bounds in generalized linear regression. In Fundamentals of Computation Theory, pages 269 280. Springer, 1999.

Dean P. Foster. Prediction in the worst case. Annals of Statistics, 19(2):1084 1090, 1991.

David A Harville. Matrix algebra from a statistician s perspective, volume 1. Springer, 1997.

Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, 1997.

Wouter M. Koolen, Alan Malek, and Peter L. Bartlett. Efﬁcient minimax strategies for square loss games. In Advances in Neural Information Processing Systems, pages 3230 3238, 2014.

Wouter M. Koolen, Alan Malek, Peter L. Bartlett, and Yasin Abbasi. Minimax time series prediction. In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2548 2556. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5730-minimax-time-series-prediction.pdf.

Edward Moroshko and Koby Crammer. Weighted last-step min max algorithm with improved sub-logarithmic regret. Theoretical Computer Science, 558:107 124, 2014.

Eiji Takimoto and Manfred K. Warmuth. The minimax strategy for Gaussian density estimation. In 13th COLT, pages 100 106, 2000.

Volodimir G. Vovk. Aggregating strategies. In Proc. Third Workshop on Computational Learning Theory, pages 371 383. Morgan Kaufmann, 1990.

Volodya Vovk. Competitive on-line linear regression. Advances in Neural Information Processing Systems, pages 364 370, 1998.