# horizonindependent_minimax_linear_regression__a93d152c.pdf Horizon-Independent Minimax Linear Regression Alan Malek Laboratory for Information and Decision Systems Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge, MA 02139-4307, USA amalek@mit.edu Peter L. Bartlett Department of EECS and Statistics University of California Berkeley, CA 94720-1776, USA bartlett@cs.berkeley.edu We consider online linear regression: at each round, an adversary reveals a covariate vector, the learner predicts a real value, the adversary reveals a label, and the learner suffers the squared prediction error. The aim is to minimize the difference between the cumulative loss and that of the linear predictor that is best in hindsight. Previous work demonstrated that the minimax optimal strategy is easy to compute recursively from the end of the game; this requires the entire sequence of covariate vectors in advance. We show that, once provided with a measure of the scale of the problem, we can invert the recursion and play the minimax strategy without knowing the future covariates. Further, we show that this forward recursion remains optimal even against adaptively chosen labels and covariates, provided that the adversary adheres to a set of constraints that prevent misrepresentation of the scale of the problem. This strategy is horizon-independent in that the regret and minimax strategies depend on the size of the constraint set and not on the time-horizon, and hence it incurs no more regret than the optimal strategy that knows in advance the number of rounds of the game. We also provide an interpretation of the minimax algorithm as a follow-the-regularized-leader strategy with a data-dependent regularizer and obtain an explicit expression for the minimax regret. 1 Introduction Linear regression is a fundamental prediction problem in machine learning and statistics. In this paper, we study a sequential version: on round t, the adversary chooses and reveals a covariate vector xt Rd, the learner makes a real-valued prediction ˆyt, the adversary chooses and reveals the true outcome yt R, and finally the learner is penalized by the square loss, (ˆyt yt)2. Since it is hopeless to guarantee a small loss (the adversary can always cause constant loss per round), we instead aim to guarantee that we are able to predict almost as well as the best fixed linear predictor in hindsight. Letting xt s and yt s denote xs, . . . , xt and ys, . . . , yt, respectively, define the regret of a strategy that predicts ˆy T 1 as RT ˆy T 1 , x T 1 , y T 1 := t=1 (ˆyt yt)2 min θ Rd t=1 (θ xt yt)2. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. A strategy s : S t 1(Rd R)t 1 Rd R, is a map from observations to predictions, and we define RT s, x T 1 , y T 1 := RT ˆy T 1 , x T 1 , y T 1 where ˆyt = s(x1, y1, . . . , xt 1, yt 1, xt). Our goal is to find a strategy that guarantees low regret for all data sequences. In particular, this paper is concerned with the minimax strategy s , which is the strategy that minimizes the worst case regret over all possible covariate and outcome sequences in some constraint set, i.e. s satisfies max x T 1 ,y T 1 RT s , x T 1 , y T 1 = min s max x T 1 ,y T 1 RT s, x T 1 , y T 1 . In general, computing minimax strategies is computationally intractable because the optimal prediction ˆyt depends on the complete history (x1, y1, . . . , xt 1, yt 1, xt), and the dependence might be a rather arbitrary function of this enormous space of histories. So it is surprising that, in the case of fixed-design linear regression (where the strategy knows the covariate sequence in advance), the minimax strategy can be efficiently computed [Bartlett et al., 2015]. This paper builds on results from Bartlett et al. [2015], which studied fixed-design online linear regression, where the game length T and covariates x T 1 := x1, . . . , x T are known to the learner a priori. Under constraints on the adversarial labels y T 1 , the value function and minimax strategy were calculable in closed form using backwards induction. The resulting minimax strategy ˆyt+1 = x t+1Pt+1 s=1 ysxs, (MMS) is a simple, linear predictor with coefficient matrices defined by and recursion Pt = Pt+1 + Pt+1xt+1x t+1Pt+1. (1) The ˆyt is a function of the whole sequence x T 1 , and thus an extension to online-design seems difficult. Given: covariate constraints X and label constraints Y({xt}) For t = 1, 2, . . . , Adversary chooses xt s.t. xt 1 X Learner predicts ˆyt Adversary may end the game Adversary reveals yt s.t. yt 1 Y(x T 1 ) Learner incurs loss (ˆyt yt)2 The game ends if no xt+1 exists such that xt+1 1 X Figure 1: Adversarial Covariates Protocol Our contributions This paper extends the fixed design setting to adversarial design where neither the covariates nor the length of the game are fixed a priori. We use {xt} and {yt} to denote arbitrary length sequences of covariates and labels, respectively. We allow the adversary to play any covariate sequence in some constraint set X and labels in some set Y({xt}) (which may depend on the covariates). In particular, we identify a family X, Y parameterized by a positive-definite matrix Σ, representing the size of future covariates, and a scalar γ0, representing the size of the future labels, and present a strategy that is minimax optimal against all adversarial sequences in this family. The algorithm needs only know Σ, and the guarantee is horizon-independent in the sense that the family does not constrain the length of the covariate sequence and includes covariate sequences of arbitrary length for any Σ, γ0 pair. The protocol of the general, horizon-independent setting is outlined in Figure 1. We derive the minimax strategy and show that it is optimal in the following way. Definition 1. A strategy s is horizon-independent minimax optimal for some class X of covariate sequences and some class Y({xt}) of label sequences, possibly depending on {xt} X, if sup x T 1 X, y T 1 Y(x T 1 ) RT s , x T 1 , y T 1 min s sup x T 1 X, y T 1 Y(x T 1 ) RT s, x T 1 , y T 1 ! We require s to have regret no larger than even a strategy that knows T. In other words, we establish a more natural measure of game length than the number of rounds. The covariate constraints on {xt} ensure that the adversary respects the scale constraint Σ so that the learner is not led to under-regularize or over-regularize. The minimax strategy is efficient and is simultaneously minimax optimal against all covariate sequences corresponding to Σ. We motivate our constraint set by showing that every condition is necessary, and we also cast the minimax strategy as follow the regularized leader strategy with a data-dependent regularizer. Finally, we provide a general regret upper bound. Outline We begin with a review of how backwards induction is used to derive the fixed-design minimax algorithm (MMS) in Section 2. By inverting the recursion, we show in Section 3 how to calculate (MMS) given only P0, and thus we have the minimax strategy for any covariate sequence that perfectly agrees with the given P0. Section 4 greatly expands the scope of our algorithm by deriving weaker conditions on the adversary and proves that, under these conditions, the same minimax strategy is horizon-independent minimax optimal. We argue that these conditions are necessary. We then interpret the minimax strategy as a follow the regularized leader with a specific, data-dependent regularizer in Section 5. Related Work While linear regression has a long history in statistics and optimization, its online sibling is much more recent, starting with the work of Foster [1991], which considered binary labels and ℓ1-constrained parameters θ. He proved an O(d log(d T)) regret bound for an ℓ2-regularized follow-the-leader strategy. Cesa-Bianchi et al. [1996] considered ℓ2-constrained parameters and gave O( T) regret bounds for a gradient descent algorithm with ℓ2 regularization. Kivinen and Warmuth [1997] showed that an Exponentiated Gradient algorithm with relative entropy gives the same regret without the need for a constraint on the parameters. Vovk [1998] applied the Aggregating Algorithm [Vovk, 1990] to continuously many experts and arrived at a scale free algorithm by using the inverse second moment matrix of past and current covariates. Forster [1999] and Azoury and Warmuth [2001] showed that this algorithm is last step minimax and achieves an O(log T) scale-dependent regret bound. (See also the work of Moroshko and Crammer [2014] on last-step minimax.) Takimoto and Warmuth [2000] obtained the minimax strategy for prediction in Euclidean space with squared loss. This was extended to more general losses in [Koolen et al., 2014] and to tracking problems in [Koolen et al., 2015]. Finally, Bartlett et al. [2015] obtained the minimax strategy for fixed-design linear regression. We present this strategy in the next section, because we build on these results. In these papers, the minimax analysis provides a natural, data-dependent regularization, in contrast to the follow-the-leader methods described above. We make this comparison explicit in Section 5, by calculating the implied regularization. 2 Fixed Design Linear Regression We begin by summarizing the main results of Bartlett et al. [2015]. Recall that in the fixed design setting, the game length T and covariates x T 1 are fixed and known to both players. Define the summary statistics st := Pt s=1 ysxs, σ2 t = Pt s=1 y2 t , and Πt = Pt s=1 xsx s . The minimax strategy can be computed by solving the offline problem minθ PT t=1(x t θ yt)2 = PT t=1 y2 t s T Π T s T , where M is the pseudo-inverse of matrix M. The optimal actions ˆyt and yt are computed as a function of the state st 1 and covariates x T 1 by solving the backward induction V st, σ2 t , t, x T 1 := min ˆyt+1 max yt+1 (ˆyt+1 yt+1)2 + V st + yt+1xt+1, σ2 t + y2 t+1, t + 1, x T 1 with base case V s T , σ2 T , T, x T 1 := minθ Rd PT t=1 θ xt yt 2. The arguments of V include x T 1 to emphasize the fixed-design setting. Performing the backwards induction generates plays ˆy T 1 and y T 1 that witness the value of the game, min ˆy1 max y1 min ˆy T max y T t=1 (ˆyt yt)2 min w Rd t=1 (w xt yt)2, which is the minimum guaranteeable regret against all data sequences. The resulting minimax strategy is precisely the linear predictor ˆyt+1 = x t+1Pt+1st, ((MMS)) with coefficient matrices defined by the recursion (1). Note that Pt is a function of every covariate x T 1 . The minimax strategy is similar to follow-the-leader, which would predicts with Π t in place of Pt; however, Pt is a shrunken version of Π t that takes future covariances into account. The main result of Bartlett et al. [2015] is the minimax optimality of (MMS) for the following classes. For some fixed sequence of positive label budgets B1, . . . , BT > 0, define 1. Label constraints on yt: L(BT 1 ) := {y T 1 : |yt| Bt t = 1, . . . , T} 2. Box constraints on xt: B(BT 1 ) := n x T 1 : Bt Pt 1 s=1 x t Ptxs Bs for 2 t o . 3. Ellipsoidal constraints: E(x T 1 , R) := n y T 1 : PT t=1 y2 t x t Ptxt R o . Theorem 1. [Bartlett et al., 2015, Theorems 2 and 10] For each x T 1 , the corresponding strategy (MMS) is minimax optimal with respect to B(BT 1 ) if y T 1 L(BT 1 ) and with respect to E(x T 1 , R), for any Bt > 0 sequence and any R > 0, in the following sense: (1) If x T 1 B(BT 1 ), then sup y T 1 L(BT 1 ) RT ((MMS), x T 1 , y T 1 ) = min s sup y T 1 L(BT 1 ) RT (s, x T 1 , y T 1 ) = t=1 B2 t x t Ptxt, (2) sup y T 1 E(x T 1 ,R) RT ((MMS), x T 1 , y T 1 ) = min s sup y T 1 E(x T 1 ,R) RT (s, x T 1 , y T 1 ) = R. 3 The Forward Algorithm The previous section described the fixed-design minimax strategy and established sufficient conditions for its optimality. Unfortunately, Pt is recursively defined as a function of the entire x T 1 sequence. In this section, we show that it is possible to remove the fixed-design and known-game-length requirement if we limit the adversary to play sequences that follow the Adversarial Covariate conditions. Letting X = S T >0 Rd T denote the set of covariate sequences of finite length, define A(Σ) := n x T 1 X : for P0, . . . , PT defined by (1), P 0 Σ o , and A(Σ) := n x T 1 X : for P0, . . . , PT defined by (1), P 0 = Σ o ; (2) that is, x T 1 A(Σ) if the Pt computed by applying (1) to the sequence x T 1 results in P 0 Σ. The key insight of this section is that it is possible to invert the Pt recursion: we can compute Pt from Pt 1 and xt. Hence, if we are given P0, then we can compute every Pt online. For some initial condition Σ, define the forward recursion with base case P0 = Σ and induction step Pt := Pt 1 at b2 t Pt 1xtx t Pt 1, where b2 t := x t Pt 1xt, at := 4b2 t + 1 1 p 4b2 t + 1 + 1 . (3) The prediction matrix Pt is a function of Σ and xt 1 only. For the rest of the paper, we will define (MMS) with respect to the forward recursion, i.e. ˆyt := x t Ptst 1, where Pt is defined by recursion (3). The calculation of ˆyt only requires knowledge of Σ, xt 1, and yt 1 1 , all of which are available to the learner when choosing ˆyt. The algorithm needs O(d2) memory and at each round the computational complexity is O(d2). It is essential that the two recursions are equivalent, which is guaranteed by the following lemma. Lemma 1. Let Σ 0 be a positive semidefinite matrix. For any covariate sequence x T 1 A(Σ), the Pt matrices defined by the backwards recursion (1) applied to x T 1 are identical to the Pt matrices defined by the forward recursion (3) with base case P0 = Σ and updates given by x T 1 . Proof. Let P t be defined by the forwards recursion starting from P0 = Σ and let Pt be defined by the backwards recursion (1). Our goal is to show that Pt = P t for all t. The base case P0 = P 0 is a simple consequence [Bartlett et al., 2015, Lemma 11], which uses repeated applications on Sherman-Morrison to show that x s Psxs 1 + x s Psxs xsx s . (4) Now, assuming the induction hypothesis P t 1 = Pt 1, we can evaluate P t = Pt 1 at b2 t Pt 1xtx t Pt 1 = Pt + Ptxtx t Pt at Pt + Ptxtx t Pt xtx t Pt + Ptxtx t Pt = Pt + Ptxt 1 + 2x t Ptxt + x t Ptxt 2 x t Pt (5) By definition, we have b2 t = x t Pt 1xt = x t Ptxt + x t Ptxt 2, which we can invert to find that x t Ptxt = 1 4b2 t + 1 1 . Plugging this is, the term in the parenthesis in (5) is 1 + 2x t Ptxt + x t Ptxt 2 = 1 at 4b2 t + 1 1 + 1 4b2 t + 1 1 2! 4b2 t + 1 + 1 4b2 t + 1 + 1 1 4b2 t + 1 1 4b2 t + 1 1 p 4b2 t + 1 + 1 4b2 t + 1 + 1 = 0, implying that P t = Pt, as desired. Our first result is that this algorithm is actually minimax optimal if we constrain the adversary to play in A(Σ). Another interpretation is that Σ encodes all the necessary scale information the learner needs to respond optimally. That is, (MMS) performs as well as the best strategy that sees the covariate sequence in advance. In particular, knowledge of Σ, not T, is necessary for the learner. Theorem 2. For all positive semidefinite Σ, label bounds B1, B2, . . . > 0, and constants b > 0 and R > 0, the minimax strategy (MMS) using the forward recursion (3) starting from P0 = Σ is horizon-independent minimax optimal, i.e. sup T sup x T 1 X sup y T 1 Y(x T 1 ) RT (s , x T 1 , y T 1 ) min s sup y T 1 Y(x T 1 ) RT (s, x T 1 , y T 1 ) for X, Y(x T 1 ) equal to either (A(Σ), E(x T 1 , R)) or (B(BT 1 ) A(Σ), L(BT 1 )). Proof of Theorem 2. Since x T 1 A(Σ), Lemma 1 implies that the Pt matrices from the forwards and backwards recursions are equivalent, and therefore (MMS) corresponds to the minimax strategy for the fixed-design game with P 0 = Σ. The can then apply Theorem 1, part (1), which yields sup y T 1 B(BT 1 ) RT (s , x T 1 , y T 1 ) min s sup y T 1 B(BT 1 ) RT (s, x T 1 , y T 1 ) = 0. Since this holds for all x T 1 , we actually get the stronger result sup T sup x T 1 A(BT 1 ) A(Σ) sup y T 1 B(BT 1 ) RT (s , x T 1 , y T 1 ) min s sup y T 1 B(BT 1 ) RT (s, x T 1 , y T 1 ) Identical reasoning extends part (2) of Theorem 1 to the adversarial covariate context. The adversarial covariate conditions are defined for entire x T 1 sequences, but there is an online characterization, derived from the following lemma. Lemma 2. Consider any t 0, x1, . . . , xt, and symmetric matrix P 0. We have that P Πt if and only if, for any T t + rank P Πt , there is a continuation of the covariate sequence, xt+1, . . . , x T , such that setting Pt = P and defining Pt+1, . . . , PT by the forward recursion (3) gives P T = ΠT . A stronger version with proof is presented in the Appendix as Theorem 6 and explicitly derives conditions on xt+1 that ensure P Πt. In words, a sequence of covariates xt 1 is the prefix of some x T 1 A(Σ) if P s Πs for all s t, where Ps corresponds to the forward recursion (3) defined by intuition condition P0 = Σ and covariates xt 1. Hence, it is equivalent to constrain the adversary to play xt satisfying this condition at every round, and we do not require the adversary to fix the covariate sequence in advance; it is equivalent to define A(Σ) = n x T 1 X : P 0 = Σ and P t Πt t 1 o , and (6) A(Σ) = n x T 1 X : P 0 = Σ, P t Πt t 1, and P T = ΠT o . (7) 4 Expanding the Minimax Conditions The strategy (MMS) is minimax optimal for any covariate sequence x T 1 A(Σ) if the adversary plays covariates that meet the Σ constraint with equality, which is quite restrictive. In this section, we identify a much broader set of constraints on the adversary s actions where (MMS) remains the best learner response. These conditions allow for adversarial design; the data may be chosen in response to the learner s actions. A natural relaxation is to remove the equality constraints; this results in a set of constraints on the adversary where the labels {yt} are in L({Bt}) := {yt : |yt| Bt t 1}, and the covariates {xt} are in A (Σ) B (Σ), where B (Σ) = n {xt} : Bt Pt 1 s=1 x t Ptxs t 1 o . The B(Σ) condition is necessary for an efficient algorithm [Bartlett et al., 2015], and without the A(Σ) condition, the adversary could choose xt to be a scaled version of st 1 and yt = θ t 1xt, where θ t 1 is the best least squares predictor of xt 1 t and yt 1 1 . The comparator will never suffer more regret, the algorithm will suffer some regret, and we can scale xt such that the B(Σ) conditions are satisfied. To summarize, without the A constraint, the adversary can cause arbitrary regret. However, the A and B constraints are not sufficient to guarantee a solvable game: Lemma 3. Fix any Σ and any {Bt} with Bt b > 0 for all t. Then, for any M > 0, there exists x T 1 A(Σ) B(Σ) and y T 1 L(BT 1 ) such that the minimax regret is larger than M. A covariate budget is not sufficient for a minimax algorithm; it is not even clear how to define minimax when the regrets are not bounded. Hence, we will introduce continuation constraints (the name will become clear soon). Let γ0 > 0 be some initial label budget and define γt = γt 1 B2 t x t Ptxt, with Pt defined by the forward recursion (3). Let B (Bt 1) := {ξ Rt : |ξi| Bi, i = 1, . . . , t} be the hypercube with sides of length B1, . . . , Bt and Xt be the matrix with columns x1, . . . , xt. For a given covariate budget Σ and label budget γ0, define the continuation condition C (Σ, γ0) := n x T 1 : γt ξ X t Π t Pt Xtξ ξ B (Bt) and t = 1, . . . , T o , (8) which is equivalent to requiring that s t Π t Pt st γt for all possible st. The rest of this section proves the main result of this paper: if the adversary plays in ABC(Σ, γ0) := A(Σ) B(Σ) C(Σ, γ0), then (MMS) is minimax optimal. Theorem 3. Consider the two player game defined in Figure 1. For any {Bt} > 0, Σ 0 and γ0 0, the player strategy (MMS) has minimax regret γ0 and is horizon-independent minimax optimal for x T 1 X = ABC(Σ, γ0) and y T 1 Y = L(Bt). That is, sup x T 1 X,y T 1 Y RT ((MMS), x T 1 , y T 1 ) min s sup x T 1 X,y T 1 Y RT (s, x T 1 , y T 1 ) We will prove Theorem 3 by first considering adversarial strategies under A(Σ) with a fixed game length. We show that, somewhat counterintuitively, the adversary may cause more regret by not using the entire Σ budget. Then, we show that the C condition eliminates these troublesome cases and the adversary exhausts the budget; therefore, the adversary plays x T 1 A(Σ) which implies that that (MMS) is minimax optimal by results of the previous section. Finally, we note that all the previous arguments apply uniformly across T, and since (MMS) is ignorant of T, it must be horizon-independent minimax optimal. The Σ constraint, not the game length, seems to be the correct notion of game size. 4.1 Limiting T Consider a fixed T > 0 and define AT (Σ) := n x T 1 Rd T : P 0 = Σ and P t Πt 1 t T o , the restriction of A(Σ) to sequences of length T. This goal of this section is to show i) that it is possible for the adversary to cause more regret by not using up the covariance budget, i.e. P T ΠT , and ii) that the C conditions are sufficient to stop this. We cannot calculate the minimax solution of AT (Σ) directly. Section G in the appendix explicitly evaluates the first backwards induction step; it is quite complicated and has no closed form solution, and this suggests that efficient backwards induction is unlikely. Instead, we will study the related fixed-design early-stopping game. For some fixed x T 1 , the game protocol is: at round t, the learner predicts ˆyt, the adversary chooses et {0, 1} and yt L(BT 1 ). If et = 1, the learner incurs loss (ˆyt yt)2 and the game continues, but if et = 0, the game ends. Intuitively, the adversary may be able to cause more regret because the learner is regularizing for a covariance budget corresponding to x T 1 , and therefore ending the game early causes the learner to over-regularize. We will derive C as a condition where the adversary always continues to T. In turn, this implies that the adversary will use up the Σ budget in the AT game: any x T 1 with remaining Σ budget has a continuation x T +k 1 A(Σ) by Lemma 2, and the C condition implies that the adversary will continue until T + k and use up the budget. We will make this argument formal. We begin by defining an incremental version of regret. Define t := minθ Rd Pt s=1(θ xs ys)2 minθ Rd Pt 1 s=1(θ xs ys)2, the additional loss suffered by the comparator from playing t rounds instead of t 1 rounds. We have t 0 and L T = Pt t=1 t . The regret of the game with early stopping can be written as RT = PT t=1 Qt s=1 es (yt ˆyt)2 t . One might notice that δ t = 0 for the choice yt = θ t 1 xt, where θ t 1 is the ordinary least squares solution on data through time t 1, and the regret always increases. However, this choice of yt may violate the label constraints, in particular, for Bt = 1 and xt R increasing. Additionally, we want a constraint where the adversary wants to play all remaining rounds, not just the next one, and hence the constraint on yt will depend on the future covariates. The value-to-go definition also needs to be adapted to the incremental setting. To this end, we define the instantaneous value-to-go W(st, σ2 t , t, x T 1 ) by W(s T , σ2 T , T, x T 1 ) = 0 and W(st 1, σ2 t 1, t 1, x T 1 ) = max et {0,1} et min ˆyt max yt (ˆyt yt)2 t + W(st, σ2 t , t, x T 1 ) , where the statistics are updated as st = st 1 + ytxt and σ2 t = σ2 t 1 + y2 t . It is easy to check that W0 is the minimax regret for this game and that it equals the regret of the fixed design game when the adversary plays every round. 4.2 Calculating the Instantaneous Value-to-go This section derives C as the condition where et = 1 for all t and evaluates Wt. Throughout, R(M) denotes the row space of matrix M. Proofs from this section are heavy on calculation and have been collected in Appendix B. We begin by explicitly calculating t . Lemma 4. The marginal loss for the comparator of playing another round with covariate x = x + x , where x R(Πt 1) and x is its orthogonal complement, is t = y2 t 1 x t Π txt 2yts t 1Π txt + s t 1Π txt 2 x t Π t 1xt x t Π txt . Theorem 4. Consider the fixed-design game with early stopping, with covariates x T 1 . Define the Pt by the backwards recursion (1) and define γt = PT s=t+1 B2 sx s Psxs. Suppose that, for all t, γt s t Π t Pt st. Then the instantaneous value-to-go is W(st, σ2 t , t, x T 1 ) = s t Pt Π t st +γt, the adversary causes more regret by continuing the game, and the optimal learner strategy is (MMS). Proof outline. The proof is by induction, where the base case is easily established with γT = 0 and PT = Π T . Now, assuming that W(st, σ2 t , t, x T 1 ) = s t Pt Π t st + γt, we wish to calculate the t 1 case by evaluating W(st 1, σ2 t 1, t 1, x T 1 ) = maxet {0,1} et minˆyt maxyt(ˆyt yt)2 t + Wt(st, σ2 t , x T 1 ) . We use our expression for t , perform elementary calculations to evaluate the saddle-point, and show that the above evaluates to ( s t 1Ptxt 2 + B2 t x t Ptxt s t 1Π txt 2 x t Π t 1xt x t Π txt + s t 1 Pt Π t st 1 + γt, 0 which can be shown to always take the first value so long as γt 1 s t 1 Π t 1 Pt 1 st 1. In this case, the induction hypothesis is verified with the Pt update described in the theorem. This implies that the instantaneous value-to-go is always positive and that an optimal adversary will always continue. As a consequence, the covariate sequence x T 1 A(P 0 ), which confirms that (MMS) using the forward recursion is minimax optimal via Theorem 2. All the ingredients are in place to prove our main result. For convenience, define ABC(Σ, γ0) := {x T 1 ABC(Σ, γ0) : PT = Π T , γT = 0}, the set of sequences that deplete the Σ and γ0 budgets. Roughly, we will argue that, under C(Σ, γ0), the adversary causes the most regret by playing x T 1 A(Σ), which implies that x T 1 ABC(Σ, γ0) and the regret is γ0. The first step in the analysis is to check that the constraint set is non-trivial. Lemma 5. Consider the game defined by Σ 0, γ0 Bt and a Bt sequence. If there exists some T such that PT t=1 B2 t t+log(T +1) γ0, then there exists a covariate sequence x T 1 ABC(Σ, γ0). In particular, any Bt that are bounded below satisfy this condition. In reasoning about optimal strategies, Theorem 4 allows us to easily establish conditions when the learner is playing suboptimally and could be causing more regret. However, Theorem 4 applies to a fixed design game that is allowed to stop early, and we wish to reason about the adversarial covariate case. The next lemma makes the crucial connection. Lemma 6. Suppose xt 1 ABC(Σ, γ0) but γt > 0. Then there exists an extension xt+1, . . . , x T in ABC(Σ, γ0) with x T 1 A(Σ, γ0) and W(st, σ2 t , t, x T 1 ) = s t (Pt Π t)st + γt equal to the instantaneous value-to-go. The proof is a simple consequence of checking that the extension Lemma 2 is compatible with condition C. We can now prove the minimax optimality of (MMS) on the ABC game. Proof of Theorem 3. We will show something stronger: the optimal adversary strategy for the game in Figure 1 plays an x T 1 sequence in ABC and causes exactly γ0 regret against (MMS). First, assume that the game stops before round T + 1 and x1, . . . , x T have been played. There are four possible scenarios depending on whether the Σ or γ0 budgets are exhausted. Case: both budgets exhausted. In this case, x T 1 ABC(Σ, γ0) and optimal holds by results from Section 3. Case: neither budget exhausted. We apply Lemma 2 to conclude that there exists a covariate sequence x T +k T +1 that uses up the Σ budget. The C(Σ, γ0) constraint guarantees that the adversary can cause more regret by playing these rounds. Hence, an adversary that exhausts neither budget is suboptimal. Case: only Σ budget exhausted. Since Pt Π t 0, we cannot exhaust the γ0 before the Σ budget and still satisfy the C constraint. Case: only γ0 budget exhausted. If the Σ budget is exhausted, then x T 1 A and hence the minimax regret is PT t=1 B2 t x t Ptxt by Theorem 2. Since γT = γ0 PT t=1 B2 t x t Ptxt, the adversary strategy is suboptimal if γT > 0 since it is possible to cause γ0 regret. These arguments cover all four cases, we can conclude that the adversary can cause at most γ0 regret and that any strategy that causes γ0 regret must exhaust the Σ and γ0 budgets. In all cases, the adversary can cause at most γ0 regret and it is necessary for the adversary to play x T 1 ABC(Σ, γ0), which implies that (MMS) is optimal. In other words, for x T 1 X = ABC(Σ, γ0) and y T 1 Y = L(Bt), we have sup x T 1 X,y T 1 Y RT ((MMS), x T 1 , y T 1 ) min s sup x T 1 X,y T 1 Y RT (s, x T 1 , y T 1 ) = 0 for all T > 0, which implies the result. The Necessity of a γ0 Bound Requiring a γ0 bound may seem artificial at first, especially since it translates directly into a bound on the regret. However, it is a reasonable constraint to impose, for several reasons. First, recall that Lemma 3 argues that the regret of just the A(Σ) B(Σ) game is infinite. Second, the restriction on the adversary is mild: if x T 1 ABC(Σ, γ0), then x T 1 ABC(Σ, γ ) for γ γ0, and so the budget can be adjusted online. Finally, we emphasize that the learner does not need to know γ0 to play (MMS). 5 Follow the Regularized Leader The minimax strategy (MMS) can be interpreted as playing follow-the-regularized-leader with a certain data-dependent regularizer. Lemma 7. The minimax strategy (MMS) is exactly follow-the-regularized-leader, predicting ˆyt = θ xt at round t, where regularization matrices Rt are R0 := P 1 0 , and Rt := Rt 1 + 1 1 + x t Ptxt xtx t xt 1x t 1, (9) and θ is the solution to minθ Pt 1 s=1(θ xs ys)2 + θ Rtθ. It is also possible to derive a Rt recursion without referring to Pt; see Lemma 11. For comparison, the last step minimax algorithm [Azoury and Warmuth, 2001] plays ˆyt = Pt s=1 xsx s 1 st 1, so we can also view the minimax algorithm as last step minimax with a regularization of PT s=t+1 x s Psxs 1+x s Psxs xsx s . We have shown that for the adversarial covariates protocol with X = ABC(Σ, γ0), (MMS) is the minimax optimal strategy and receives γ0 regret. Our last result helps quantify this regret by proving a O(log(T)) regret bound for the games analyzed in Section 3. Theorem 5. For any fixed T and BT 1 , the minimax regret of the box-constrained game has the bound sup x T 1 A(Σ) sup y T 1 L(BT 1 ) RT (s , x T 1 , y T 1 ) d BT 1 Σ 2 1 + 2 ln 1 + ||Σ||2 2 2 BT 1 2 ||BT 1 ||2 2 6 Conclusion We have presented the minimax optimal strategy for online linear regression where the covariate and label sequence are chosen adversarially and the measure of game length is a covariance budget instead of the number of rounds. Because the strategy has access to a more informative measure of game size, Σ, it can compete with strategies that know the number of rounds. The minimax strategy is efficient and only needs to update Pt and st. One could interpret the results of our paper as finding a more natural way to measure the length of the game that admits a tractable minimax strategy. What other game protocols can be reparameterized to admit efficient minimax strategies? As a general method, one could start with minimax algorithms for constrained cases then search for parameterizations which preserve the optimality. We have also provided an intuitive view of the algorithm as follow-the-regularized-leader with a specific data-dependent regularizer. This interpretation can be used to bound the excess regret when the budget Σ is misspecified, perhaps allowing for adaptation to Σ. Acknowledgements We gratefully acknowledge the support of the NSF through grant IIS-1619362. Katy S. Azoury and Manfred K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211 246, 2001. Peter L. Bartlett, Wouter M. Koolen, Alan Malek, Manfred K. Warmuth, and Eiji Takimoto. Minimax fixed-design linear regression. In P. Grünwald, E. Hazan, and S. Kale, editors, Proceedings of The 28th Annual Conference on Learning Theory (COLT), pages 226 239, 2015. Nicolo Cesa-Bianchi, Philip M. Long, and Manfred K. Warmuth. Worst-case quadratic loss bounds for prediction using linear functions and gradient descent. Neural Networks, IEEE Transactions on, 7(3):604 619, 1996. Jürgen Forster. On relative loss bounds in generalized linear regression. In Fundamentals of Computation Theory, pages 269 280. Springer, 1999. Dean P. Foster. Prediction in the worst case. Annals of Statistics, 19(2):1084 1090, 1991. David A Harville. Matrix algebra from a statistician s perspective, volume 1. Springer, 1997. Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, 1997. Wouter M. Koolen, Alan Malek, and Peter L. Bartlett. Efficient minimax strategies for square loss games. In Advances in Neural Information Processing Systems, pages 3230 3238, 2014. Wouter M. Koolen, Alan Malek, Peter L. Bartlett, and Yasin Abbasi. Minimax time series prediction. In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2548 2556. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5730-minimax-time-series-prediction.pdf. Edward Moroshko and Koby Crammer. Weighted last-step min max algorithm with improved sub-logarithmic regret. Theoretical Computer Science, 558:107 124, 2014. Eiji Takimoto and Manfred K. Warmuth. The minimax strategy for Gaussian density estimation. In 13th COLT, pages 100 106, 2000. Volodimir G. Vovk. Aggregating strategies. In Proc. Third Workshop on Computational Learning Theory, pages 371 383. Morgan Kaufmann, 1990. Volodya Vovk. Competitive on-line linear regression. Advances in Neural Information Processing Systems, pages 364 370, 1998.