# online_learning_in_the_randomorder_model__51f3904d.pdf

Online Learning in the Random-Order Model

Martino Bernasconi * 1 Andrea Celli * 1 Riccardo Colini Baldeschi * 2 Federico Fusco * 3 Stefano Leonardi * 3

Matteo Russo * 3

In the random-order model for online learning, the sequence of losses is chosen upfront by an adversary and presented to the learner after a random permutation. Any random-order input is asymptotically equivalent to a stochastic i.i.d. one, but, for finite times, it may exhibit significant nonstationarity, which can hinder the performance of stochastic learning algorithms. While algorithms for adversarial inputs naturally maintain their regret guarantees in random order, simple no-regret algorithms exist for the stochastic model that fail against random-order instances.

In this paper, we propose a general template to adapt stochastic learning algorithms to the random-order model without substantially affecting their regret guarantees. This allows us to recover improved regret bounds for prediction with delays, online learning with constraints, and bandits with switching costs. Finally, we investigate online classification and prove that, in random order, learnability is characterized by the VC dimension rather than the Littlestone dimension, thus providing a further separation from the general adversarial model.

1. Introduction

Random order is a natural input model for online algorithms: the input is generated upfront by an adversary but is presented to the algorithm after a uniform random permutation. The random-order model is extensively studied across various areas of computer science and economics, such as optimal stopping problems (Dynkin, 1963; Krengel & Sucheston, 1977; Peng & Tang, 2022), online matching

*Equal contribution 1Department of Computing Sciences, Bocconi University, Milan, Italy 2Central Applied Science, Meta, London, UK 3Department of Computer, Control and Management Engineering Antonio Ruberti , Sapienza University, Rome, Italy. Correspondence to: Federico Fusco <fuscof@diag.uniroma1.it>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

and matroid selection (Babaioff et al., 2018; Korula & P al, 2009; Dimitrov & Plaxton, 2012; Ezra et al., 2022), and online approximation algorithms (Gupta & Singla, 2020). However, it has received little attention in online learning, where the standard input regimes are either (i.i.d.) stochastic or adversarial (see, e.g., Lattimore & Szepesv ari, 2020). The study of online learning in the random-order model is part of a broader research agenda that explores meaningful models beyond worst-case analysis (Roughgarden, 2020). Examples of other intermediate input models for online learning can be found in Ben-David et al. (1997); Slivkins & Upfal (2008); Lykouris et al. (2018); Sachs et al. (2022); Haghtalab et al. (2022); Mazzetto & Upfal (2023).

In random-order online learning, the T loss vectors are generated arbitrarily by some adversary and then undergo a uniform shuffling before being presented to the learner. This input model is to some extent intermediate between the stochastic model, where losses are drawn i.i.d. from a fixed but unknown distribution, and the adversarial one, where the adversary decides both the losses and the order in which they are presented to the learner.

A classical probabilistic result by De Finetti (1929) (see also De Finetti, 1937; Hewitt & Savage, 1955; Diaconis & Freedman, 1980) on sequences of exchangeable random variables implies that sampling without replacement (i.e., the distribution over random-order loss sequences) is asymptotically indistinguishable from i.i.d. sampling from some fixed distribution over losses1. However, in the standard finite-time online learning framework, the non-stationarity of the random-order model can undermine learning algorithms that perform well in the stochastic setting. This observation raises several questions about the random-order model and its relationship with other input models.

1.1. Our Results

In this paper, we initiate the systematic study of the randomorder input model in online learning. Since any realization of random-order sequences of losses can be naturally captured by the standard adversarial model, it is immediate to see that any algorithm for the adversarial model maintains its guarantees in random order. Conversely, any i.i.d. se-

1For more details, we refer to Appendix A.

Online Learning in the Random Order Model

quence can be simulated by a random-order one by first sampling T losses and then applying a uniform permutation. These two considerations show that the random-order model lies between the i.i.d. and adversarial input model2:

i.i.d. random-order adversarial.

The hierarchy between the three models implies that investigating the random-order model is especially compelling in problems that exhibit a performance gap between the stochastic and adversarial models. In such problems, a natural question is: What happens when we apply an algorithm designed for the stochastic setting to the random-order model? One might expect such algorithms to perform reasonably well since the adversary can only select the support of the distribution. However, we show an algorithm that behaves nicely against stochastic inputs, but fails in a carefully designed random-order instance (Section 3). We complement this negative result with a positive one. While we cannot directly apply an algorithm designed for the stochastic setting to the random-order model, we propose a general template, SIMULATION, to adapt learning algorithms from the i.i.d. setting to the random-order model, without significantly affecting their regret guarantees (Section 4). This implies that the minimax regret regimes in the random-order model essentially collapse to stochastic i.i.d. ones. We demonstrate the applicability of our template across multiple problems that exhibit a stark separation between the adversarial and stochastic regimes.

Prediction with Delayed Feedback. The interplay between online learning and delayed feedback has been studied extensively. While many delay models exist, the general finding is that the delay parameter d influences regret bounds additively in the i.i.d. model and multiplicatively in the adversarial one (Desautels et al., 2012; Joulani et al., 2013; Masoudian et al., 2022). In Corollary 4.4, we show an algorithm with e O(

T + d) regret rate in the random-order model, which qualitatively matches the stochastic result.

Online Learning with Constraints. In online learning problems with time-varying constraints there is a strong separation between the adversarial case, where it is impossible to attain sublinear regret and constraints violations (Mannor et al., 2009), and the stochastic case, where e O(

T) regret is achievable (Yu et al., 2017). We present an algorithm (Corollary 4.8), based on SIMULATION, that guarantees regret of order e O(

T) and zero constraints violation, matching the results for stochastic environments.

Bandits with Switching Costs. In the bandits with switching costs problem, the learning algorithm incurs an additional unitary loss every time it changes action. The adver-

2For a formal proof, see Appendix B.

sarial minimax regret is Θ(T 2/3), while the stochastic one is Θ(

T) (Cesa-Bianchi et al., 2013; Dekel et al., 2014). We explore a simplified version of SIMULATION based on SUCCESSIVE-ELIMINATION, similar to the algorithm by Agrawal et al. (1988), and show that this approach recovers the

T rate in the random-order model.

Online Classification. For (binary) classification, there is a well-known separation between statistical (offline) learning3 and online learning. While the learnability in the former setting is characterized by the Vapnik Chervonenkis (VC) dimension, the latter is determined by the Littlestone dimension (Ben-David et al., 2009). In general, the Littlestone dimension dominates the VC dimension, and there are simple examples (e.g., one-dimensional thresholds) of families with constant VC dimension and infinite Littlestone dimension. In Appendix H, we prove that the VC dimension characterizes learnability also in the random-order model, thus providing a further separation between this model and the general adversarial one. Notably, results similar in spirit to ours have been recently obtained in Haghtalab et al. (2022); Raman & Tewari (2024), where they prove that online binary classification is characterized by the VC dimension even in a non-stationary setting, as long as the adversarial input is smooth (Haghtalab et al., 2022) or the learner has access to good predictions (Raman & Tewari, 2024).

Discussion. On a theoretical level, our results have two main implications: (i) the minimax regret rates for randomorder and the stochastic model are typically the same, and (ii) it is fairly easy to construct the desired random-order algorithm starting from a stochastic one. These results also have important practical implications: if the order in which the input is presented can be controlled, the same regret rate achievable in the stochastic setting can be obtained by simply shuffling the dataset.

1.2. Our Techniques

The Birthday Paradox. We show that algorithms designed for the stochastic setting may fail in the random-order model, by leveraging the birthday paradox (from a pool of n elements, a duplicate appears after about n samples). In a pool of T loss vectors, a random-order instance corresponds to uniform sampling without replacement, whereas sampling with replacement mimics an i.i.d. distribution. When the support is finite, these processes are statistically distinct (Diaconis & Freedman, 1980).

SIMULATION. In contrast to the previous negative result, we develop a template, SIMULATION, to adapt stochastic algorithms to the random-order model. It partitions the

3Note, here statistical (offline) learning is equivalent to stochastic online learning, due to the standard online-to-offline reduction.

Online Learning in the Random Order Model

time horizon into log T geometrically increasing windows. At the start of each window, past data is used to simulate an i.i.d. distribution matching the expected values of the random-order instance, allowing the algorithm to train in an i.i.d. setting without incurring real loss. The action frequencies during training are then used on the actual instance within the current block. The analysis leverages that sampling without replacement, even if statistically distinct from i.i.d. sampling, concentrates around the same mean.

1.3. Further Related Work

Random-order Online Convex Optimization. Randomorder inputs in online convex optimization has been studied in Garber et al. (2020) and Sherman et al. (2021). Their main result is that such random-order instances allow for logarithmic regret even if the single losses are not strongly convex, as long as the cumulative loss is strongly convex.

Online Combinatorial Problems in Random Order. Random-order inputs have also been studied in the context of online combinatorial optimization in Dong & Yoshida (2023). They consider approximate follow-the-leader algorithms and prove that when the offline optimization algorithm has low average sensitivity (i.e., the average impact of single points on the output is small, a notion introduced in Varma & Yoshida (2023)), then the offline-to-online reduction carries over also in the random-order model.

Online Learning with Long-Term Constraints Online learning under time-varying constraints was first studied by Mannor et al. (2009), who showed that achieving both sublinear regret and constraint violations is impossible when costs and rewards are linear but adversarially generated. Therefore, a significant gap exists between the stochastic setting, where e O(

T) regret and constraint violation are achievable (Yu et al., 2017; Badanidiyuru et al., 2013; Agrawal & Devanur, 2014), and the adversarial setting, where guarantees typically take the form of no-α-regret bounds, with α (0, 1) representing the competitive ratio (Immorlica et al., 2022; Castiglioni et al., 2022a;b; Kesselheim & Singla, 2020; Bernasconi et al., 2024). Some studies achieve improved results under adversarially generated constraints, but at the cost of using a weaker baseline, such as static regret (Sun et al., 2017). Another line of work (Balseiro et al., 2023; Fikioris & Tardos, 2023) bridges the gap between stochastic and adversarial settings by making assumptions about the environment s evolution over time.

A learner repeatedly chooses one of k actions over a time horizon T. At each time step t, the loss associated to action a is denoted by ℓt(a) [0, 1]. The loss vectors in the

random-order model are generated by an adversary and are presented to the learner in random order. The adversary generates a (multi-)set of T loss vectors ht [0, 1]k, then they are presented to the learner uniformly at random, i.e., a permutation π is drawn and the loss at time t is hπ(t) = ℓt. We adopt the notation [n] to denote the set {1, 2, . . . , n}, and we denote by n the n-dimensional probability simplex. The model and results for the online classification are deferred to the Supplementary Materials (Appendix H).

2.1. Prediction with Delayed Feedback

In Prediction with Delayed Feedback, the loss vectors are revealed to the learner after a known and fixed delay d. More precisely, for any t, the learner only observes the loss vector ℓt after d time steps. We have the following definition of (expected) regret of algorithm A choosing actions at against the random-order input S of losses ℓ1, . . . , ℓT :

RT (A, S) = E

t=1 ℓt(at) min a [k]

We denote with regret of the algorithm A its worst-case regret, i.e., sup S RT (A, S). Note, the benchmark i.e., the performance of the best fixed action in hindsight is not influenced by the realization of the random permutation (in fact, we can equivalently write the benchmark as mina PT t=1 ht(a)). There is a plethora of delayed feedback models studied in the literature; for the sake of simplicity, here we adopt arguably the simplest one with a fixed and known delay d, as in, e.g., Masoudian et al. (2022). With minimal effort, our results also apply to other models, e.g., random delay or unknown delay.

2.2. Online Learning with Constraints

We consider a setting similar to Balseiro et al. (2023) in which the learner has m resource-consumption constraints. At each t, the learner plays xt k and subsequently observes a reward vector rt [0, 1]k and m cost vectors (c1,t, . . . , cm,t) [0, 1]k m, one for each available resource.4 The objective of the learner is to maximize the cumulative rewards PT t=1 r t xt while guaranteeing that, for each j [m], PT t=1 c j,txt B, where B is the available budget for each resource.5 The per-iteration budget is defined as ρ = B/T. We assume that there is a known action such that cj,t( ) = 0 for all j and t (this can be thought as a skipping turn action, usually employed in the Bandits with Knapsack literature, see e.g., Badanidiyuru et al. (2013); Immorlica et al. (2022)).

4In the literature on Online Learning with Constraints and Bandits with Knapsacks, it is customary to use rewards instead of losses, and we follow the same notation here. 5Assuming that all budgets are the same comes without loss of generality. See discussion in Immorlica et al. (2022).

Online Learning in the Random Order Model

In the random-order input model, an instance S is represented by a multi-set of T tuples (r, c1, . . . , cm) presented to the learner in uniform random order. We denote with r the average reward of the tuples in S, and, for each resource j, we let cj be its average consumption. The natural benchmark for the problem is given by the solution to the following LP (see, e.g., Immorlica et al. (2022)):

OPTLP( r, c1, . . . , cm) =

( maxx k r, x s.t.

cj, x ρ j [m] .

With OPTLP, we denote the solution to the above LP instantiated with the average rewards and costs computed from S. Then, we define the regret in this setting as

RT (A, S) = T OPTLP

t=1 r t xt,

where τ [T] is the stopping time of the algorithm (i.e., the time in which the first of the m resources is fully depleted). Once τ is reached, the algorithm plays for the remainder of the time horizon. Therefore, our algorithm enforces resource consumption constraints strictly, similar to what happens in bandits with knapsacks (Immorlica et al., 2022). This is a stronger requirement than ensuring sublinear constraint violations, which is the typical approach in online learning with long-term constraints (Mannor et al., 2009; Yu et al., 2017; Castiglioni et al., 2022b).

2.3. Bandits with Switching Costs

In the bandit with switching costs problem, at the end of each time t, the algorithm only observes its loss ℓt(at) and suffers an additional unitary loss every time it changes action. For instance, this is the same model studied in Cesa Bianchi et al. (2013); Dekel et al. (2014). The goal is to devise a learning strategy that minimizes the suffered loss or, equivalently, minimizes the regret with respect to the bestfixed action in hindsight. We then have the following definition of (expected regret) of algorithm A choosing actions at6 against the random-order input S of losses ℓ1, . . . , ℓT :

RT (A, S)=E

t=1 ℓt(at)+I{at =at+1} min a [k]

We denote with regret of the algorithm A its worst-case regret, i.e., sup S RT (A, S). Note, the benchmark is not influenced by the switching cost or by the random permutation (in fact, we can equivalently write the benchmark as mina [k] PT t=1 ht(a)).

3. The Birthday Paradox

In this section, we analyze a learning algorithm that exhibits the no-regret property against any stochastic input but fails

6For simplicity, we define a T +1 = a T .

against a random-order one. We consider the basic Prediction with Experts problem, where the learner immediately observes the loss vector ℓt upon playing at. The regret in the stochastic setting is defined with respect to the expected performance of the best fixed arm, while in the random-order model it is defined as in Equation (1) (see also Appendix B for a comparison on the regret definitions).

Consider the following simple BIRTHDAY-TEST algorithm: it repeatedly plays the first action until a certain stopping time τ is realized, after which it starts to run some no-regret algorithm such as FOLLOW-THE-LEADER (Hannan, 1957) from scratch.7 The stopping time τ is defined as the first time step when one of two things happens: either ℓt(1) / {i/T, for i = 1, 2, . . . , T}, or ℓs(1) = ℓt(1) for some s = t. Intuitively, up to time τ, the algorithm s behavior while playing action 1 is essentially a test to determine whether the losses for this action are drawn uniformly at random from the set {i/T | i = 1, 2, . . . , T}.

If the underlying distribution is i.i.d., then τ realizes pretty soon, while it never does under the random-order model. This result is based on folklore calculations related to the birthday paradox, but we report a self-contained proof for completeness in Appendix D.

Lemma 3.1. For any i.i.d. input and T sufficiently large, it holds that E [τ] 2

Consider now the performance of BIRTHDAY-TEST against any i.i.d. input: it suffers at most constant instantaneous regret up to the stopping time τ (for an overall expected regret of 4

T, as proved in Lemma 3.1) and then run FOLLOWTHE-LEADER, for an overall regret rate of O( T log T).

It turns out that there exists a random-order instance that fools BIRTHDAY-TEST. The instance has only two actions: action 1, whose losses are given by a permutation of the set {i/T, for i = 1, 2, . . . , T}, and action 2, which always yields 0 loss. BIRTHDAY-TEST on this instance always plays the first action, accumulating regret at each time step (as τ will never realize by definition), for a cumulative regret of Ω(T). All in all, we have proved the following result.

Theorem 3.2. Consider the Prediction with Experts problem. BIRTHDAY-TEST exhibits O( T log T) regret against any stochastic i.i.d. instance, but there exists a random-order instance against which it suffers Ω(T) regret.

4. A General Template: SIMULATION

In this Section, we present a general reduction template: SIMULATION. We then show how this general idea can be applied to prediction with delayed feedback, in online learning with constraints, and bandits with switching costs.

7We only need the algorithm to be no-regret on stochastic i.i.d. inputs.

Online Learning in the Random Order Model

Algorithm 1 *

Input: stochastic algorithm A for i = 0, 1, 2, . . . , log T do

(i) iid-ify past data

Construct a distribution on past data Di (ii) Train A on a simulated past

Run algorithm A over 2i i.i.d. samples of Di (iii) Test over new data

Let ni(a) be the times that A plays action a on Di for t = 2i + 1, 2i + 2, . . . , 2i+1 do

Play action at (ni(1)/2i, . . . , ni(k)/2i) end for The SIMULATION Template

SIMULATION. The core idea of our approach (see the pseudocode for details) is to divide the time horizon into blocks of geometrically increasing length, performing three key operations within each block. At the beginning of the i-th block, which has a length of roughly 2i, we look at the feedback received during the previous time steps and construct a distribution Di (for instance, by uniform sampling on the previous 2i samples). Next, we train our stochastic algorithm A on Di (without incurring real losses). Finally, we use A s observed performance over Di to play against the actual losses in the time block. We observe that SIMULA-

TION is not a black-box reduction but a general recipe for adapting stochastic algorithms to the random-order setting.

To build a high-level intuition on the effectiveness of SIMULATION, we can look at its performance when the underlying stochastic algorithm is BIRTHDAY-TEST. By fictitiously letting play SIMULATION against distribution Di we trigger the stopping time τ so that BIRTHDAY-TEST starts playing as FOLLOW-THE-LEADER on the testing part.

4.1. Predictions with Delayed Feedback

In the prediction with delayed feedback model, the learner has access to the loss vector after d time steps. To specialize the SIMULATION template to this model, we need to account for the delayed feedback by adding a buffer of d time steps between the blocks, allowing enough time to receive the feedback corresponding to the previous block (we refer to the pseudocode for further details).

At the beginning of the generic time block i, the distribution Di that is used to simulate past data is simply the uniform distribution over all the loss vectors observed in the blocks before the ith which are contained in the multiset O (the losses observed in i d time steps corresponding to the previous buffers are discarded). To simplify the notation, we denote with Ti the time steps in the ith time block, i.e., Ti = {ti, . . . , ti + 2i 1}, where ti = 2i + i d.

Fix any sequence of losses and any block i. At the beginning of Ti, the multiset O contains all the losses observed in the

Algorithm 2 *

Input: stochastic algorithm A Environment: K actions, time horizon T Model: full feedback model with delay d. Play any action at time 1 Initialize multiset O = {ℓ1} for i = 0, 1, 2, . . . , log T do

Let Di be the uniform distribution on O {iid-ify past} Run A over 2i i.i.d. samples of Di {Training} Let ni(a) be the number of times that A plays a ti i d + 2i {Beginning of ith block} for t = ti, ti + 2, . . . , ti + 2i 1 do

Play action at (ni(1)/2i, . . . , ni(k)/2i) {Test} Add loss ℓt to O when revealed if t = T, then terminate end for Play arbitrarily for the next d time steps end for SIMULATION for delayed feedback

previous time blocks, which are 2i. Therefore, sampling according to Di yields an unbiased estimator of the average loss in the previous blocks. This is formalized in the following Lemma. Lemma 4.1. Fix any sequence of losses and time block i. Then we have:

Eℓ Di [ℓ(a)] = 1

t Ti ℓt(a) a [k].

In the analysis of SIMULATION for prediction with delayed feedback, we employ a generic stochastic routine A which is guaranteed an i.i.d. regret bound of Riid T (A) against any stochastic input of length T . Theorem 4.2. Consider the problem of online prediction with delayed feedback in the random-order model, and let d be the delay parameter. Running SIMULATION for delayed feedback with stochastic routine A (SIM-A) yields the following regret bound:

RT (SIM-A) 5 T log T + d log T + Plog T i=0 Riid 2i(A).

Proof. Fix any input sequence S = {h1, . . . , h T }, any realization of the permutation {ℓ1, . . . , ℓT }, and denote with a

the corresponding best action (a only depends on S, not on the specific permutation). For any block i, the stochastic algorithm A is trained over 2i i.i.d. samples from Di, and the empirical frequencies ni(a)/2i are then used to play in Ti. For any action i, denote with i(a) the gap between the loss of action a and that of action a according to Di:

i(a) = Eℓ Di [ℓ(a) ℓ(a )] .

From the guarantees on the regret bound of A, we get X

a E [ni(a)] i(a) Riid 2i(A), (2)

Online Learning in the Random Order Model

where the expectation is with respect to the randomness in the i.i.d. sampling from Di.

The gaps i(a) induced by the distribution Di are related to the empirical gaps p i (a) experienced in the previous time blocks by Lemma 4.1. In particular, for any p i (a) (the superscript stands for past ) we have:

p i (a) = 1

t Ti (ℓt(a) ℓt(a )) (By definition)

= Eℓ Di [ℓ(a) ℓ(a )] (By Lemma 4.1)

= i(a). (By definition of i(a))

Plugging in the above equality into the stochastic regret guarantees as in Equation (2), we get: X

a E [ni(a)] p i (a) Riid 2i(A). (3)

While the above inequality characterizes the performance of SIMULATION if run on past losses, we need to relate it to the future or actual ones, i.e., the ones appearing in time block i. We denote with i(a) the gaps in the current time block. Let s now look at the actual regret suffered by the algorithm during Ti: X

t Ti (E [ℓt(at)] ℓt(a ))

a E [ni(a)] X

t Ti (ℓt(a) ℓt(a )) (By design)

a E [ni(a)] i(a) (By definition of i(a))

a E [ni(a)] p i (a) + X

a E [ni(a)] ( i(a) p i (a))

Riid 2i(A) + 2i max a | i(a) p i (a)|, (4)

where the last inequality follows because of Equation (3) and the fact that P

a [k] ni(a) = 2i. So far, our argument works for any input model, as the sequence of losses was arbitrary; now it s time to exploit the fact that random-order sequences exhibit concentration. In particular, we want to argue that, for any block i and action a, both the past and future empirical gaps p i (a) and i(a) are close to the actual gap (a), where

i [T ] (hi(a) hi(a )) = 1

t [T ] (ℓt(a) ℓt(a )).

We introduce the precision ϵi = 2 p

log T/2i and the corresponding clean event Ei:

Ei = {max{| i(a) (a)|, | p i (a) (a)| ϵi, a}.

We can show that such event is realized with high probability (see Appendix E for omitted proofs).

Claim 4.3. For any block i, the corresponding clean event is realized with probability at least 1 1/T.

Then, conditioning on the clean event Ei, we can improve the bound in Equation (4): X

t Ti (E [ℓt(at)] ℓt(a )) Riid 2i(A)

2i max a | i(a) p i (a)|

2i+1 max a {| (a) p i (a)|, | (a) i(a)|}

By removing the conditioning w.r.t. the clean event (and using the probability bound as in Claim 4.3 and the fact that the regret suffered under the bad event is at most 2i), we get: X

t Ti (E [ℓt(at) ℓt(a )]) Riid 2i(A) + 5 p

We are ready to bound the overall regret: for any time block, we have the above inequality, while for the log T buffers, we suffer an overall regret of at most d log T. All in all, we have that the regret RT of SIMULATION for delayed feedback with routine A is at most:

Riid 2i(A) + 5 p

log T 2i + d

T log T + d log T +

i=0 Riid 2i(A).

This concludes the proof.

In particular, if we instantiate algorithm A to be FOLLOWTHE-LEADER, we get the following guarantees.

Corollary 4.4. There exists an algorithm for online prediction with delayed feedback in the random-order model that enjoys an O( T log T + d log T) regret bound.

4.2. Online Learning with Constraints

In the Online Learning with Constraints model, the learner has to play against an environment in which each action a [k] has an associated reward and a vector of costs. We analyze the problem within the random-order model and show that, in this setting, it is possible to achieve regret guarantees comparable to those in the stochastic case, despite impossibility results for the adversarial setting (Mannor et al., 2009).

Here, we show how we can adapt SIMULATION in the presence of long-term constraints. The main difficulty in adapting SIMULATION in this setting is the fact that the costs

Online Learning in the Random Order Model

Algorithm 3 *

Input: stochastic algorithm A, budget B, parameter δ Environment: K actions, time horizon T Play any action at time 1 Initialize O = {r1, c1 1, . . . , cm 1 } for i = 0, 1, 2, . . . , log T do

Let Di be the uniform distribution over O{iid-ify past} Run algorithm A over 2i i.i.d. samples of Di with budget ρ e O(2 i/2) and δ = δ/3 log T {Training} Let ni(a) be the number of times that A plays action a. Compute xi(a) as ni(a)/2i for each action a for t = 2i +1, 2i +2, . . . , 2i+1{Play against the actual sequence} do

Play xi and subsequently add (rt, c1 t, . . . , cm t ) to O end for end for SIMULATION for Online Learning with Constraints

impose constraints on the entire time horizon, while SIMULATION naturally works on each block independently.

To run SIMULATION in this setting, we need access to an algorithm A for stochastic online learning with constraints (see, e.g., Castiglioni et al., 2022a; Bernasconi et al., 2024). In particular, in a stochastic environment with mean reward r and mean costs ( c1, . . . , cm), A guarantees a regret upper bound, with probability at least 1 δ, of

T OPTLP( r, c1, . . . , cm)

t=1 r xt Riid,δ T (A, ρ),

where τ denotes the first time in which a resource j is fully depleted, that is Pτ t=1 c j,txt > ρT.8

In each block i we train A on i.i.d. samples (riid,i t , ciid,i t,1 , . . . , ciid,i t,m )t Di.9 Then, we take the empirical frequency of play xi obtained in the training step and use it against the actual environment in the time block 2i + 1, . . . , 2i+1.

Crucially, during each training phase, we instantiate A with a slightly reduced per-iteration budget of ρ e O(

2 i). This allows us to control concentration terms in the analysis. Then, one of the main challenges is proving that this budget rescaling does not significantly affect the guarantees of the algorithm A during the simulation phase.

Theorem 4.5. Consider the problem of online learning with long-term constraints in the random-order model. Running SIMULATION with stochastic routine A (SIM-A) yields the

8We assume that A guarantees that ρ 7 Riid,δ T (A, ρ) is monotone decreasing. Note that regret itself respects this property (a lower budget implies weaker performance). We assume that the upper bound follows the same behavior, which is true for all known algorithms for the problem. 9To simplify the notation we drop the block index i when clear.

following regret bound with probability at least 1 δ,

RT (SIM-A) O

2i (A, ρ/2).

where δ e O(δ),

Proof. Fix any sequence S [0, 1]T k m+1 and any permutation {(rt, ct,1, . . . , ct,m)}t [T ]. Let x be the solution to OPTLP. For each time block i we are going to define the following quantities: ri = 1 2i P2i+1

t=2i+1 rt is the empirical

mean of the rewards on the next block, rp i = 1

t=1 rt is the empirical mean of rewards in the past block, and ri = EDi[r] is the expected reward w.r.t. Di. Similarly we define ci,j, cp i,j and ci,j for each resource j [m]. Clearly, ri = rp i and ci,j = cp i,j.

We can define the benchmark OPTi LP on the past block as

max x k x rp i s.t. max j [m] x cp i,j ρ 2 ϵi = ρi,

where ϵi = p

6 2 i log(mk log(T)/δ) e O(

2 i). We define the event Ei = { ri r ϵi, ri r ϵi, ci,j cj ϵi, ci,j cj ϵi j [m]}. From now on, we condition on the event Ei, which we show to hold with high probability with respect to the permutation.

The regret of SIMULATION can be written as

RT (SIM-A, S) =

t=2i+1 r t xt

i=1 2i r x r i xi , (5)

where the second equality holds thanks to the fact that we are playing the fixed learned distribution xi in the i-th block, which is the one learned while playing against Di.

We only consider blocks with index i > i where we set i = Ω(log(log(mk log(T)/δ)/ρ2)), otherwise we would have ρi < 0 which is the first block for which ρi = ρ 2ϵi ρ/2. For each block i < i we suffer linear regret (since we are forced to play the null action ), and this contributes to a Pi

i=1 2i e O(1/ρ2) additional regret term.

The no-regret properties of A guarantee that, with probability at least 1 δ :

t=1 x t ri = 2i x , i rp i x i ri

2i (A, ρ 2 ϵi),

Online Learning in the Random Order Model

where x i is the solution to OPTi LP and

Pt s=1 x s ciid s,j (ρ ϵi) 2i (6)

which holds for all t [2i] and for all j [m].

Consider one term of the regret decomposition of Equation (5):

r x r i xi = ( r x rp, i x i ) + (rp, i x i r i xi)

( r x rp, i x i ) + Riid,δ

2i + ( ri ri) xi (7)

where the last inequality holds by the regret properties of A. Now, we have to analyze the first and last terms of the inequality above. The last term is easy to bound. Indeed, conditioning on Ei it holds that

( ri ri) xi ri r + r ri 2ϵi.

Bounding the first term is trickier as it is equivalent to bounding the difference between the values of two LPs. Indeed, it can be rewritten as OPTLP OPTi LP. The following claim (whose proof can be found in Appendix F) shows that this difference is not too large.

Claim 4.6. For all i > i , under the event Ei, we have that

OPTLP OPTi LP ϵi

Plugging everything back into Equation (7) we get that, conditioning on the events Ei,

+ 2ϵi + Riid,δ

where we used that ρ 7 Riid,δ t (A, ρ) has to be monotone decreasing and positive. Moreover, note that the probability measure is the one induced by the random permutation and does not depend on the algorithm.

Next, we analyze the violations, which for each t and each

resource j [m] are given by:

s=1 x s cs,j

s=2i+1 x i cs,j =

i=1 2i c i,jxi

i=1 2i c i,jxi +

i=1 2i (ci,j ci,j) xi

i=1 2i c i,jxi +

i=1 2iϵi, (8)

where the last inequality holds by conditioning on all Ei. To analyze the first term, we can consider the following sequence of random variables Zi,j t = Pt s=1(ciid s,j ci,j) xs. This is a martingale sequence (since EDi[ciid s,j] = ci,j and has differences bounded by 1) with Zi,j 0 = 0 and thus, by Azuma-Hoeffding, we have that with probability at least 1 δ 3m log T with respect to the randomness of Di:

|Zi,j 2i | =

s=1 (ciid s,j ci,j) xs

and thus for all i [log T] and j [m]:

2i c i,jxi 2iϵi +

s=1 ciid, s,j xs 2iϵi + 2i(ρ 2 ϵi)

with probability at least 1 δ/3 (by union bound on all i [log T] and j [m]), and the second inequality holds thanks to the properties of A (Equation (6)). Plugged back into Equation (8) we get that

s=1 x s cs,j

i=1 (2iϵi + 2iϵi + 2i(ρ 2 ϵi)) ρt.

Finally, we prove that all of the previous results hold with probability at least 1 δ. We have three sources of randomness; the first is due to the permutation, which is encoded by the events Ei. The following claim proves that these hold with high probability.10

Claim 4.7. The event log T i=1 Ei holds with probability at least 1 δ/3.

Second, we have the randomness of the i.i.d. sampling. By Azuma-Hoeffding inequality, we proved that this holds with probability at least 1 δ/3. Finally, the algorithm A has guarantees with probability δ = 1 δ/3 log(T) each time it is run on a block i. By a further union bound, we have that these events hold jointly with probability 1 δ. This concludes the proof.

10The proof is similar to the proof of Claim 4.3 by relying on Theorem C.1 with additional union bounds on the constraints.

Online Learning in the Random Order Model

In particular, we can instantiate A to be the algorithm of Castiglioni et al. (2022a) that guarantees that Riid,δ T (A, ρ) O(1/ρ p

T log(km T/δ)), and we can prove the following:11

Corollary 4.8. There exists an algorithm for online learning with long-term constraints for the random-order model with regret e O(1/ρ2 +

T/ρ) with high probability.

Our reduction maintains the same regret guarantees as A, but for an extra 1/ρ2 instead of 1/ρ. However, this term does not depend on T, excluding poly-logarithmic factors.

4.3. Bandits With Switching Costs

We present an algorithm for the Bandits with Switching Costs problem that suffers O(

T) regret in the randomorder model, as opposed to the adversarial setting, where the minimax regret rate is Θ(T 2/3) (Dekel et al., 2014).

Our algorithm SIMULATION-SUCCESSIVE-ELIMINATION maintains a set of active actions and proceeds in geometrically increasing time blocks. In each time block, it simply plays in a low-switching round-robin way the actions that are still active; then it updates the confidence bounds on the actions played and discards the ones that are proved to be, with high probability, suboptimal. We refer to the pseudocode for more details. Note, SIMULATIONSUCCESSIVE-ELIMINATION is an application of the SIMULATION paradigm to this problem, using as stochastic subroutine SUCCESSIVE-ELIMINATION (e.g., Chapter 1 of Slivkins, 2019): running SUCCESSIVE-ELIMINATION on the i.i.d. version of past data quickly converges to playing round-robin on the active actions. For simplicity, we analyze this simpler version of the algorithm, already specialized for the routine SUCCESSIVE-ELIMINATION. We note that this algorithm is similar to a low-switching algorithm in the literature (Agrawal et al., 1988) that achieves

T regret in the stochastic case. It shares the geometrically increasing blocks and the idea of playing each active arm in contiguous intervals in each block; the main difference is that they use UCB as a routine to choose the next action, while our algorithm uses SUCCESSIVE-ELIMINATION to simplify the analysis for the random-order input model.

We state here the main result concerning SIMULATIONSUCCESSIVE-ELIMINATION. We defer the complete proof to the Supplementary Materials (Appendix G). Theorem 4.9. There exists an algorithm for bandits with switching costs in the random-order model with regret O( p

k T log3 T).

Proof Sketch. SIMULATION-SUCCESSIVE-ELIMINATION switches actions at most O(k) times during each one of

11The algorithm of Castiglioni et al. (2022a) is, in its turn, instantiated with ONLINE-MIRROR-DESCENT with negative entropy as both the primal and dual algorithm.

Algorithm 4 *

Environment: k actions and time horizon T Model: bandit feedback with switching costs

ˆℓ(a) 0, n(a) 0, for all actions a A [k] set of active actions i0 log k for t = 1, . . . 2i0 do

Play at =t 1 modulo k {Play each action once} n(at) n(at) + 1

ˆℓ(at) 1 n(at) Pt s=1 I{as=at}ℓs(as) end for for i = i0, . . . , log T do

10k log3 T 2 i {Precision for block i} for a A do

Play action a action for 2i/|A| rounds Let ˆℓ(a) 1 n(a) Pt s=1 I{as=a}ℓs(a) end for Play last active action a up to time min{2i+1 1, T} Update accordingly n(a) and ˆℓ(a) Remove from A all actions a such that

ˆℓ(a) ϵi > min a A(ˆℓ(a ) + ϵi)

end for SIMULATION-SUCCESSIVE-ELIMINATION

the O(log T) phases. Therefore, the overall switching cost is always O(k log T). We move our attention to the regret incurred by playing suboptimal actions. We can argue by concentration for random-order sequences (see Appendix C) that in the generic phase i only actions with overall suboptimality gap O(ϵi 1) are still active, with high probability. Moreover, once again by concentration, we can argue that these actions actual (permuted) losses during the generic block i are close to their actual suboptimality gap. Summing up for all blocks and plugging in our choice of the precision parameters ϵi yields the desired regret bound.

5. Conclusion and Further Directions

We study online learning in the random-order model, which lies between the adversarial and i.i.d. settings, contributing to the analysis of problems beyond worst-case instances. We show that stochastic algorithms may fail in the random-order model and, to address this, we propose SIMULATION, a general template for designing effective random-order algorithms. We study various relevant online learning problems and prove that random-order and stochastic inputs share the same minimax regret, improving on the adversarial case. While SIMULATION successfully provides black-box reductions for full feedback, the bandit model still requires a more specialized approach. Providing black-box constructions for broader feedback settings, including bandit feedback, is an interesting avenue for future research.

Online Learning in the Random Order Model

Acknowledgements

The work of MB, AC, FF, SL, and MR was partially funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

AC is partially supported by MUR - PRIN 2022 project 2022R45NBB funded by the Next Generation EU program and by the ERC grant Project 101165466 PLA-STEER. FF, SL, MB and MR are also partially supported by the FAIR (Future Artificial Intelligence Research) project PE0000013, funded by the Next Generation EU program within the PNRR-PE-AI scheme (M4C2, investment 1.3, line on Artificial Intelligence). FF, SL, and MR are supported by the Meta/Sapienza project on Online Constrained Optimization and Multi-Calibration in Algorithm and Mechanism Design and by the PNRR MUR project IR0000013So Big Data.it. SL and MR are also partially supported by the MUR PRIN grant 2022EKNE5K (Learning in Markets and Society).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Agrawal, R., Hedge, M., and Teneketzis, D. Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control, 33(10):899 906, 1988.

Agrawal, S. and Devanur, N. R. Bandits with concave rewards and convex knapsacks. In EC, pp. 989 1006. ACM, 2014.

Alvarez-Melis, D. and Broderick, T. A translation of the characteristic function of a random phenomenon by bruno de finetti. ar Xiv e-prints, pp. ar Xiv 1512, 2015.

Babaioff, M., Immorlica, N., Kempe, D., and Kleinberg, R. Matroid secretary problems. J. ACM, 65(6):35:1 35:26, 2018.

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 207 216. IEEE, 2013.

Balseiro, S. R., Lu, H., and Mirrokni, V. The best of many worlds: Dual mirror descent for online allocation problems. Operations Research, 71(1):101 119, 2023.

Ben-David, S., Kushilevitz, E., and Mansour, Y. Online learning versus offline learning. Machine Learning, 29: 45 63, 1997.

Ben-David, S., P al, D., and Shalev-Shwartz, S. Agnostic online learning. In COLT, 2009.

Bernasconi, M., Castiglioni, M., Celli, A., and Fusco, F. Beyond primal-dual methods in bandits with stochastic and adversarial constraints. In Neur IPS, 2024.

Bousquet, O., Boucheron, S., and Lugosi, G. Introduction to statistical learning theory. In Advanced Lectures on Machine Learning, volume 3176 of Lecture Notes in Computer Science, pp. 169 207. Springer, 2003.

Castiglioni, M., Celli, A., and Kroer, C. Online learning with knapsacks: the best of both worlds. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 2767 2783. PMLR, 17 23 Jul 2022a.

Castiglioni, M., Celli, A., Marchesi, A., Romano, G., and Gatti, N. A unifying framework for online optimization with long-term constraints. In Neur IPS, 2022b.

Cesa-Bianchi, N., Dekel, O., and Shamir, O. Online learning with switching costs and other adaptive adversaries. In NIPS, pp. 1160 1168, 2013.

De Finetti, B. Funzione caratteristica di un fenomeno aleatorio. In Atti del Congresso Internazionale dei Matematici, Tomo VI, pp. 179 190, 1929.

De Finetti, B. La pr evision: ses lois logiques, ses sources subjectives. In Annales de l institut Henri Poincar e, volume 7, pp. 1 68, 1937.

de Finetti, B. Sulla proseguibilit a di processi aleatori scambiabili. Rendiconti dell Istituto di Matematica dell Universit a di Trieste. An International Journal of Mathematics, 1969.

Dekel, O., Ding, J., Koren, T., and Peres, Y. Bandits with switching costs: T 2/3 regret. In STOC, pp. 459 467. ACM, 2014.

Desautels, T., Krause, A., and Burdick, J. W. Parallelizing exploration-exploitation tradeoffs with gaussian process bandit optimization. In ICML, 2012.

Diaconis, P. and Freedman, D. Finite Exchangeable Sequences. The Annals of Probability, 8(4):745 764, 1980.

Online Learning in the Random Order Model

Dimitrov, N. B. and Plaxton, C. G. Competitive weighted matching in transversal matroids. Algorithmica, 62(1-2): 333 348, 2012.

Dong, J. and Yoshida, Y. A batch-to-online transformation under random-order model. In Neur IPS, 2023.

Dynkin, E. B. The optimum choice of the instant for stopping a markov process. Sov. Math. Dokl., 4(01):627 629, 1963.

Ezra, T., Feldman, M., Gravin, N., and Tang, Z. G. General graphs are easier than bipartite graphs: Tight bounds for secretary matching. In EC, pp. 1148 1177. ACM, 2022.

Fikioris, G. and Tardos, E. Approximately stationary bandits with knapsacks. In COLT, volume 195 of Proceedings of Machine Learning Research, pp. 3758 3782. PMLR, 2023.

Garber, D., Korcia, G., and Levy, K. Y. Online convex optimization in the random order model. In ICML, volume 119 of Proceedings of Machine Learning Research, pp. 3387 3396. PMLR, 2020.

Gupta, A. and Singla, S. Random-order models. In Beyond the Worst-Case Analysis of Algorithms, pp. 234 258. Cambridge University Press, 2020.

Haghtalab, N., Han, Y., Shetty, A., and Yang, K. Oracleefficient online learning for smoothed adversaries. In Neur IPS, 2022.

Hannan, J. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3(2):97 139, 1957.

Hewitt, E. and Savage, L. J. Symmetric measures on cartesian products. Transactions of the American Mathematical Society, 80(2):470 501, 1955.

Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13 30, 1963.

Immorlica, N., Sankararaman, K., Schapire, R., and Slivkins, A. Adversarial bandits with knapsacks. Journal of the ACM (JACM), 69(6):1 47, 2022.

Joulani, P., Gy orgy, A., and Szepesv ari, C. Online learning under delayed feedback. In ICML (3), volume 28 of JMLR Workshop and Conference Proceedings, pp. 1453 1461. JMLR.org, 2013.

Kesselheim, T. and Singla, S. Online learning with vector costs and bandits with knapsacks. In COLT, volume 125 of Proceedings of Machine Learning Research, pp. 2286 2305. PMLR, 2020.

Korula, N. and P al, M. Algorithms for secretary problems on graphs and hypergraphs. In ICALP (2), volume 5556 of Lecture Notes in Computer Science, pp. 508 520. Springer, 2009.

Krengel, U. and Sucheston, L. Semiamarts and finite values. Bulletin of The American Mathematical Society, 83(10): 627 629, 1977.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Littlestone, N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Mach. Learn., 2(4):285 318, 1987.

Lykouris, T., Mirrokni, V. S., and Leme, R. P. Stochastic bandits robust to adversarial corruptions. In STOC, pp. 114 122. ACM, 2018.

Mannor, S., Tsitsiklis, J. N., and Yu, J. Y. Online learning with sample path constraints. Journal of Machine Learning Research, 10(3), 2009.

Masoudian, S., Zimmert, J., and Seldin, Y. A best-of-bothworlds algorithm for bandits with delayed feedback. In Neur IPS, 2022.

Mazzetto, A. and Upfal, E. An adaptive algorithm for learning with unknown distribution drift. In Neur IPS, 2023.

Peng, B. and Tang, Z. G. Order selection prophet inequality: From threshold optimization to arrival time design. In FOCS, pp. 171 178. IEEE, 2022.

Raman, V. and Tewari, A. Online classification with predictions. In Neur IPS, 2024.

Roughgarden, T. (ed.). Beyond the Worst-Case Analysis of Algorithms. Cambridge University Press, 2020.

Sachs, S., Hadiji, H., van Erven, T., and Guzm an, C. Between stochastic and adversarial online convex optimization: Improved regret bounds via smoothness. In Neur IPS, 2022.

Sauer, N. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145 147, 1972.

Serfling, R. J. Probability inequalities for the sum in sampling without replacement. The Annals of Statistics, pp. 39 48, 1974.

Shelah, S. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41(1):247 260, 1972.

Sherman, U., Koren, T., and Mansour, Y. Optimal rates for random order online optimization. In Neur IPS, pp. 2097 2108, 2021.

Online Learning in the Random Order Model

Slivkins, A. Introduction to multi-armed bandits. Found. Trends Mach. Learn., 12(1-2):1 286, 2019.

Slivkins, A. and Upfal, E. Adapting to a changing environment: the brownian restless bandits. In COLT, pp. 343 354. Omnipress, 2008.

Sun, W., Dey, D., and Kapoor, A. Safety-aware algorithms for adversarial contextual bandit. In ICML, volume 70 of Proceedings of Machine Learning Research, pp. 3280 3288. PMLR, 2017.

Vapnik, V. N. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264 280, 1971.

Varma, N. and Yoshida, Y. Average sensitivity of graph algorithms. SIAM J. Comput., 52(4):1039 1081, 2023.

Yu, H., Neely, M. J., and Wei, X. Online convex optimization with stochastic constraints. In NIPS, pp. 1428 1438, 2017.

Online Learning in the Random Order Model

A. The De Finetti Theorem

The De Finetti Theorem states that any (infinite) sequence of exchangeable12 random variables behaves (in distribution) as a mixture of i.i.d. distribution. This result has been initially proven for binary random variables in De Finetti (1929) (a translated version is available on the ar Xiv Alvarez-Melis & Broderick (2015)), and then generalized to real random variables in De Finetti (1937); Hewitt & Savage (1955). However, this representation theorem does not hold (de Finetti, 1969; Diaconis & Freedman, 1980) for finite sequences. In particular, Diaconis & Freedman (1980) provides tight quantitative bounds on the difference between exchangeable sequences and their closest mixture of i.i.d. distributions. We report here a lower bound on such distance.

Let U2n be an urn containing n red and n black balls. Let H2n,2k, respectively M2n,2k, represent the distribution of the number of red balls out of 2k draws made at random without, respectively with, replacement from U2n. Furthermore, for any random variable P in [0, 1], consider the random experiment that consists of first sampling p according to P and then sampling k i.i.d. Bernoulli with parameter p; denote with PP,k the distribution of the number of successes according to this experiment. We have the following Theorem (Proposition 31 and Theorem 40 of Diaconis & Freedman (1980)):

Theorem A.1. Let n tend to , and k o(n). Then, for any random variable P in [0, 1], the following inequality holds:

||H2n,2k PP,k||TV ||H2n,2k M2n,2k||TV =

2 πe k n + O k

Stated differently, sampling with replacement and without replacement induce different distributions whose total variation distance roughly scales linearly in the sampled fraction of the urn.

B. The Hierarchy between the Three Models

The random-order model is intermediate between the i.i.d. and adversarial input model. We provide here a formal proof of this fact for the standard online learning setting with k actions and loss minimization.

Stochastic i.i.d. Input Model. The input S is defined by a distribution D over loss vectors, from which T i.i.d. samples are drawn. The regret suffered by algorithm A on input S is defined as follows:

Ri.i.d. T (A, S) = max a [k] E

t=1 ℓt(at) ℓt(a)

The stochastic regret Ri.i.d. T (A) of an algorithm A is its worst-case regret against all stochastic inputs.

Random-Order Model. The input S is defined by a set {h1, . . . , h T } of T loss vectors, that are presented in random order to the learner, so that the generic ℓt is hπ(t) for a random permutation π. The regret suffered by algorithm A on input S is defined as follows:

RRO T (A, S) = max a [k] E

t=1 ℓt(at) ℓt(a)

The random-order regret RRO T (A) of an algorithm A is its worst-case regret against all random-order inputs.

(Oblivious) Adversarial Model. The input S is defined by a sequence of T loss vectors. The regret suffered by algorithm A on input S is defined as follows:

RADV T (A, S) = max a [k] E

t=1 ℓt(at) ℓt(a)

The adversarial regret RADV T (A) of an algorithm A is its worst-case regret against all adversarial inputs.

12A sequence X1, X2, . . . , Xn of random variables is exchangeable if for any finite set of indices I = (i1, . . . , iℓ), and any permutation π over I it holds that (Xi1, . . . , Xiℓ) is distributed as (Xπ(i1), . . . , Xπ(iℓ)).

Online Learning in the Random Order Model

No-regret. An algorithm is said to enjoy the no-regret property in a certain input model if it exhibits sublinear regret (in the time horizon T) uniformly over all possible inputs in that model.

The informal statement contained in the main body that i.i.d. random-order adversarial is formalized in the following theorem. Theorem B.1. For any learning algorithm A, the following regret hierarchy holds:

Ri.i.d. T (A) RRO T (A) RADV T (A).

Proof. We start with the first inequality of the statement. Consider any random instance S, and let X [0, 1]k be the support of the random variable underlying S,13 and denote with ΣX the multiset of all the subsets of T elements in X (possibly with repetitions). Finally, for any σ ΣX, denote with Eσ the event that the set of the realized losses is σ. We have the following:

max a [k] E

t=1 ℓt(at) ℓt(a)

= max a [k]

σ ΣX P [Eσ] E

t=1 ℓt(at) ℓt(a)|Eσ

max a [k] E

t=1 ℓt(at) ℓt(a)|Eˆσ

t=1 ℓt(at) ℓt(a)|Eˆσ

= RRO T (A, Sπ),

where the first inequality follows by an averaging argument (i.e., there exists a ˆσ ΣX that dominates the average stochastic regret), and the last equality holds because once the multiset of realized values ˆσ is fixed (we need multisets because there may be repeated elements), then all possible permutations are equally likely and thus we are back at the random-order model. Moreover, note that the maxi [k] only depends on the multiset ˆσ and not on the specific realizations (so that we can safely move it into the expectation). Taking the sup with respect to the stochastic inputs yields the first inequality.

The argument for the second inequality is similar. Fix any random-order instance S, characterized by a set of T losses {h1, . . . , h T } that are randomly permuted according to π, and denote with Sπ the corresponding adversarial input. Denote with Π the set of all permutations over the T rounds. We have the following:

max a [k] E

t=1 ℓt(at) ℓt(a)

= max a [k] 1 T!

t=1 hπ(t)(at) hπ(t)(a)

max π Π RADV T (A, Sπ)

= RADV T (A, Sπ ),

where π is the permutation that maximizes the regret. Taking the sup with respect to the random-order inputs yields the second inequality.

The above results tell that the minimax regret for the i.i.d. setting is at most that for the random-order one, which, in turn, is upper bounded by that of the adversarial input model.

C. Concentration for Sampling without Replacement

In the original paper by Hoeffding (Hoeffding, 1963), a concentration bound for sampling without replacement is provided. Theorem C.1. Let y1, y2, . . . , y T be a sequence of numbers in [0, 1], and consider an uniform random subset S {y1, . . . , y T }, where s = |S|. Then, the following inequality holds for any λ > 0:

2 exp( sλ2),

13Out of simplicity, we assume such support finite. The general statement can be proved with minimal arrangement.

Online Learning in the Random Order Model

where µ = 1

t [T ] yt is the average over the whole sequence.

Stated differently, if we have access to s samples in the random-order model, we have that, with probability at least (1 δ), the following concentration holds: 1

For completeness, we note that random-order sequences typically concentrate faster than i.i.d. ones. There is, in fact, a refined analysis of Hoeffding inequalities that holds in the random-order model, due to Serfling (1974), that provides the following tighter concentration bound:

D. Missing Proof from Section 3

Lemma 3.1. For any i.i.d. input and T sufficiently large, it holds that E [τ] 2

Proof. It is immediate to observe that E [τ] is maximized when the distribution of the losses on arm 1 is the uniform distribution on the support {i/T, for i = 1, 2, . . . , T}, so we analyze that case. As a first step in the analysis, we find an integral formula for the expected value of τ (when ℓt(1) is drawn i.i.d. from {i/T, for i = 1, 2, . . . , T}), and then we study its asymptotic behaviour. We have the following chain of inequalities:

t=0 P [τ > t]

T T (Avoid collision up to time t)

t e xdx ( R xje xdx = j!)

We then focus on the integral term, at the right-most side of Equation (11), and study the convergence of E[τ]

T . Denote the integrand with f T (y), and consider its limit for large T:

lim T + f T (y) = lim T + exp h T log 1 + y

Moreover, f T (y) is monotonically decreasing in T and f1(y) is integrable; therefore, we can apply the dominated convergence theorem and argue that

lim T + E [τ]

0 lim T + f T (y)dy = Z +

which concludes the proof.

Online Learning in the Random Order Model

E. Missing Proof from Section 4.1

Claim 4.3. For any block i, the corresponding clean event is realized with probability at least 1 1/T.

Proof of Claim 4.3. i(a) and p i (a) have the same distribution, so we focus only on the former. By Hoedffding inequality for sampling without replacement (Theorem C.1), we have that, for each a [k]:

| i(a) (a)|

with probability at least 1 1 2T 2 . The statement follows by union bounding over all 2k events (and k T).

F. Missing Proof from Section 4.2

Claim 4.6. For all i > i , under the event Ei, we have that

OPTLP OPTi LP ϵi

Proof of Claim 4.6. First, we consider the following inequalities:

OPTLP OPTi LP = r x rp, i x i = ( r rp i ) x + rp, i (x x i )

ϵi + rp, i (x x i ) (Under Ei)

Now, we consider the second term. We will show that x is approximately feasible for the LP for which x i is optimal and, thanks to the strict feasibility of action , we obtain that x cannot be too much better than x i . Under the event Ei we have that

cp, j x c j x + ϵi ρ + ϵi,

which shows that x is 3ϵi-feasible for the constraints of OPTi LP. Then consider x = βx + (1 β) with β = ρ 2 ϵi

clearly cp, j x ρ ϵi and thus it is feasible for OPTi LP and rp, i x = β rp, i x . Then

rp, i (x x i ) = rp, i ( x x i ) + rp, i (x x)

0 + rp, i (x x) ( x is feas. OPTi LP)

(1 β)rp, i x 3ϵi

ρ ( rp i 1)

which concludes the proof.

G. Missing Proofs from Section 4.3

We start analyzing the algorithm by counting the number of switches. Regardless of the instance and the randomness of the algorithm, it switches action at most k + 1 times in each of the O(log T) time blocks. This is summarized in the following Lemma.

Lemma G.1 (Low-Switching). For any realization of SIMULATION-SUCCESSIVE-ELIMINATION, the total switching cost is at most 2k log T.

We now restate and prove the regret guarantees of SIMULATION-SUCCESSIVE-ELIMINATION.

Theorem 4.9. There exists an algorithm for bandits with switching costs in the random-order model with regret O( p

k T log3 T).

Online Learning in the Random Order Model

Proof. Fix any random-order instance S, on a set of T vectors {h1, . . . , h T }, and denote with ℓ(a) the average loss associated with each action a: ℓ(a) = P

t [T ] ℓt(a)/T. Denote with a any action with maximum ℓ, and introduce the gap

(a) = ℓ(a) ℓ(a ). For any time block i, let ˆℓi(a) denote the value of the estimator of ˆℓ(a) at the end of that time block, and denote with Ni(a) the number of blocks (included the ith) in which action a has been active. Note that once an action is deactivated, it stays like this for the rest of the algorithm.

For each time block Ti, there are at most k ways in which it can be partitioned into the sub-blocks used by the round-robin phase of the algorithm, depending on the cardinality of the active set A. We make here simplifying assumptions for the calculations: we assume that the time blocks are perfectly divisible into sub-blocks. This is to avoid considering the suffix of the time block corresponding to the remainder of the division in sub-blocks. Note, this assumption is to simplify calculations: having more samples regarding the last active action only improves concentration.

For each action a and block i, we define the following clean event that requires that ℓ(a) is well estimated across all possible sub-blocks. In block i we denote with Bj i (g) the gth sub-block of 2i/j time steps, i.e., Bj i (g) = {2i(1 + g/j) + 1, . . . , 2i(1 + (g+1)/j)}:

2i/j j [k] and g [j]

Claim G.2. For any block i and action a, the clean even Ea i has probability at least 1 1/T 3.

Proof. Fix any Bj i (g), by applying Hoeffding for sampling with replacement (Theorem C.1), we get that the following inequality holds with probability at least 1 T 5 :

Union bounding over all k choices of j and (at most) k choices of g yields the desired statement.

It is now natural to look at the overall clean event E by intersecting the Ea i over all log T time blocks and k actions. By union bounding, we hence get that the clean event holds with high probability.

Claim G.3 (Clean event). The clean event has probability at least 1 1/T.

Before proceeding, we underline that the clean event does not depend in any way on the algorithm, but is only a probabilistic statement over the random permutation, which considers fixed intervals. The crucial observation is that, under the clean event, we are sure that SIMULATION-SUCCESSIVE-ELIMINATION maintains precise estimates throughout. More precisely, denote with ˆℓi(a) the estimate of ℓ(a) maintained at the end of block i, we have the following claim:

Claim G.4. For all block i and active action a, conditioning on the clean event, we have that

|ˆℓi(a) ℓ(a)|

Proof. If the action a is active at the beginning of block i, then it has been explored in all previous blocks. Denote with T(a) the number of times action a has been played up to block i included, and with Tj(a) the times in the generic block Tj when action a has been played; clearly |Tj| is at least 2j/k and T(a) = P

Online Learning in the Random Order Model

We have the following result:

ˆℓi(a) ℓ(a) =

t Tj(a) ℓt(a) ℓ(a)

ℓt(a) Tj(a) ℓ(a)

Tj(a) (By definition of clean event)

10 Tj(a) log T

T(a) (By Jensen Inequality)

where in the last inequality we used that an active action is played with frequency at least 1/k.

The fact that all active actions are well estimated (and that there always exists at least one active action) implies that the best action a is always active, under the clean event. In fact, for any block i and active action a it holds that

ˆℓi(a ) ϵi ℓ(a ) ℓ(a) ˆℓ(a) + ϵi,

which contradicts the deactivation rule. We have then proved that the instantaneous regret in the generic block i, under the clean event, is at most 2ϵi. We know that the clean event is realized with high probability (Claim G.3), and that the number of switches is O(k log T) (Lemma G.1). All in all, we have the following bound on the regret:

RT (SSE) 2k log T + 4

i=1 2iϵi O( q

k T log3 T),

thus concluding the proof.

H. Classification in the Random-Order Input Model

We study the classical model in binary classification: we have a family H of hypotheses h defined on a set X for which a bounded loss function ℓis defined. Formally, for any hypothesis h and pair (x, y) with x X and y {0, 1}, ℓ(h(x), y) [0, 1] denotes the cost incurred by predicting label h(x) on (x, y) (see , e.g., Bousquet et al., 2003). Note, to be uniform with the literature, in this section we use ℓto denote the loss function (as opposed to the loss vectors as in the rest of the paper) and h for classifiers (as opposed to the non-shuffled input loss vectors as in the rest of the paper).

PAC Model. In the PAC learning model, the learner has access to i.i.d. samples from a joint distribution D from which pairs (X, Y ) are drawn, and its goal is to quickly estimate or approximate the best classifier, i.e., the classifier that minimizes E [ℓ(h(X), Y )]. It is a standard result in learning theory that a class H is learnable, i.e., the best classifier is arbitrarily approximable if and only if the d VC dimension of H is bounded, and that the convergence rate also depends on d VC (see, e.g., Vapnik & Chervonenkis, 1971).

(Adversarial) Online Model. In the online model (Littlestone, 1987), the input-label pairs are generated adversarially, and the performance of a learning algorithm is measured with respect to the best hypothesis in H. Surprisingly, the online learnability of a family H is governed by a stricter notion of dimension, the Littlestone dimension.

Online Learning in the Random Order Model

Random-Order Model. In this paper, we study the random-order model: the input is provided by a multiset S of (x, y) pairs, which are presented by the learner in uniform random order. The performance measure of a learning algorithm is its regret with respect to the best classifier on S. As the input arrives according to some random permutation π, we let ℓπ t (h) = ℓ(h(xπ(t)), yπ(t)) be the loss of hypothesis h H at time π(t). Similarly, we let ˆℓπ t (h) = 1

τ:π(τ) π(t) ℓπ τ (h)

be the empirical loss accrued by time π(t), and ˆℓπ t:T (h) = 1 T t+1 P

τ:π(τ)>π(t) ℓπ τ (h). Moreover, we denote by ℓ(h) =

1 T PT τ=1 ℓ(h(xτ), yτ) the average loss of hypothesis h on the whole dataset composed of T input-label pairs.

H.1. Sample Complexity with Random-Order Input

Lemma H.1. Consider the problem of online binary classification in the random-order model. For a hypothesis class H with VC-dimension d VC, given the first t 1 out of T samples from a uniform permutation π, it holds that, for all h H,

| ℓ(h) ˆℓπ t 1(h)| 1

8d VC t 1 log 2e(t 1)

8 t 1 log 2

with probability at least 1 δ.

The proof of Lemma H.1 implements a version of the classical symmetrization argument for uniform convergence in the i.i.d. setting (Bousquet et al., 2003). We consider a ghost sample made of t 1 out of T samples, that is the set of input-label pairs (xπ (τ), yπ (τ))τ:π(τ)<π(t) extracted according to an independent uniform permutation π . Claim H.2. Fix any time t > 1 and precision ϵt > 0 such that (t 1)ϵ2 t 8, then the following inequality holds:

sup h H | ℓ(h) ˆℓπ t 1(h)| ϵt

2Pπ,π sup h H |ˆℓπ t 1(h) ˆℓπ t 1(h)| ϵt

Proof. For any realization of π and π , and any classifier h H, we have that

I{| ℓ(h) ˆℓπ t 1(h)|>ϵt}I{| ℓ(h) ˆℓπ t 1(h)|< ϵt

2 } I{|ˆℓπ t 1(h) ˆℓπ t 1(h)|> ϵt

since for any three reals a, b, c and ϵ > 0, if |a b| > ϵ and |a c| < ϵ

|a b| |a c| + |b c| = |b c| > ϵ

We take expectations with respect to the second sample extracted according to π and get

I{| ℓ(h) ˆℓπ t 1(h)|>ϵt}Pπ h | ℓ(h) ˆℓπ t 1(h)| < ϵt

i Pπ h |ˆℓπ t 1(h) ˆℓπ t 1(h)| > ϵt

Since ℓ(h) = Eπ h ˆℓπ t 1(h) i for all h H and t > 1, we can apply Chebyˇsev s inequality and obtain

Pπ h | ℓ(h) ˆℓπ t 1(h)| ϵt

i 4Eπ [(ˆℓπ t 1(h))2] ϵ2 t 4(t 1)

(t 1)2ϵ2 t = 4 (t 1)ϵ2 t .

The second inequality follows because the second moment is bounded in [0, 1], as losses are bounded in [0, 1]. Therefore,

I{| ℓ(h) ˆℓπ t 1(h)|>ϵt}

1 4 (t 1)ϵ2 t

Pπ h |ˆℓπ t 1(h) ˆℓπ t 1(h)| > ϵt

i Pπ sup h H |ˆℓπ t 1(h) ˆℓπ t 1(h)| > ϵt

We can now apply the sup over all h H on the left-hand side of the above inequality and then take the expectation also with respect to the randomness of π. This implies the statement, as (t 1)ϵ2 t 8.

Proof of Lemma H.1. Claim H.2 allows us to replace ℓ(h) by an empirical average over the ghost sample. As a result, the right-hand side only depends on the projection of the class H on the double sample drawn from π, π , H (xπ(τ), yπ(τ))τ [t], (xπ (τ), yπ (τ))τ [t] , denoted succinctly as Hπ,π t , which gives

Pπ,π sup h H |ˆℓπ t 1(h) ˆℓπ t 1(h)| ϵt

h Hπ,π t |ˆℓπ t 1(h) ˆℓπ t 1(h)| ϵt

Online Learning in the Random Order Model

We now apply Hoeffding s inequality (as in Theorem C.1) on ˆℓπ t 1(h) ˆℓπ t 1(h), for a fixed h H. To see why it is possible, we construct an equivalent process to extracting a sample and a ghost sample according to π and π respectively. Consider losses presented in an arbitrary ordering ℓ1(h), . . . , ℓT (h) and an ordering induced by the uniform random permutation π , ℓπ 1 (h), . . . , ℓπ T (h). Let wτ(h) = (ℓτ(h), ℓπ τ (h)) be the pair of variables matched by index, f(wτ(h)) = ℓτ(h) ℓπ τ (h), and F(h) = {f(wτ(h)) | τ [T]}. Note that, for every π ,

τ [T ] f(wτ(h)) = X

τ [T ] ℓτ(h) X

τ [T ] ℓπ τ (h) = 0.

We extract a uniform random subset F composed of t 1 functions f(wτ(h)) from F(h), and get

τ F f(wτ(h))

2 exp (t 1)ϵ2 t .

by Hoeffding s inequality (as in Theorem C.1). Therefore, we have

Pπ,π h |ˆℓπ t 1(h) ˆℓπ t 1(h)| ϵt

i 2 exp (t 1)ϵ2 t 4

Combining everything, we obtain

sup h H | ℓ(h) ˆℓπ t 1(h)| ϵt

2Pπ,π sup h H |ˆℓπ t 1(h) ˆℓπ t 1(h)| ϵt

h Hπ,π t |ˆℓπ t 1(h) ˆℓπ t 1(h)| ϵt

d VC Pπ,π h |ˆℓπ t 1(h) ˆℓπ t 1(h)| ϵt

d VC exp (t 1)ϵ2 t 4

The second inequality is an application of the Sauer-Shelah lemma (Vapnik & Chervonenkis, 1971; Sauer, 1972; Shelah, 1972), and the last by the expression in (13). The proof concludes by setting

8d VC t 1 log 2e(t 1)

8 t 1 log 2

H.2. Regret of an Empirical Risk Minimizer with Random-Order Input

Theorem H.3. Consider the problem of online binary classification in the random-order model. Fo a hypothesis class H with VC-dimension d VC, given the first t 1 out of T samples from a uniform permutation π, the algorithm ARO that returns the minimizer ht arg min h H

τ:π(τ)<π(t) ℓπ τ (h),

achieves a regret of at most

RT (ARO, S) 8

d VCT log T

Proof. Define, for any t [T], the clean event in the past as Epast t = {π | | ℓ(h) ˆℓπ t 1(h)| ϵt h H} and in the future as Efuture t = {π | | ℓ(h) ˆℓπ t:T (h)| ϵT t+1 h H}. The clean event at time t is Et = Epast t Efuture t .

Online Learning in the Random Order Model

The instantaneous regret at time π(t) under the clean event is

ℓπ t (ht) ℓπ t (h ) ℓπ t (ht) ˆℓπ t 1(ht) + ˆℓπ t 1(h ) ℓπ t (h )

= (ℓπ t (ht) ℓ(ht)) + ( ℓ(ht) ˆℓπ t 1(ht)) + (ˆℓπ t 1(h ) ℓ(h )) + ( ℓ(h ) ℓπ t (h ))

(ℓπ t (ht) ℓ(ht)) + (ℓπ t (h ) ℓ(h )) + 2ϵt.

The first inequality above follows by definition of ht, the empirical loss minimizer, which gives ˆℓπ t 1(h ) ˆℓπ t 1(ht) 0. Note that the clean event on the order in which the elements arrive but only on the following two multisets: Spast, Sfuture, i.e., the multisets that represent the past (from time π(1) until π(t 1)) and future (from time π(t) until π(T)). We now take the expectation conditioning on the clean event and the two multisets Spast, Sfuture. Therefore,

E ℓπ t (ht) ℓπ t (h ) | Et, Spast, Sfuture 2ϵt + E (ℓπ t (ht) ℓ(ht)) + (ℓπ t (h ) ℓ(h )) | Et, Spast, Sfuture

(x,y) Sfuture

1 T t + 1 E (ℓ(ht(x), y) ℓ(ht)) + (ℓ(h (x), y) ℓ(h )) | Et, Spast, Sfuture

2ϵt + 2 E sup h H | ℓ(h) ˆℓπ t:T (h)| | Et, Spast, Sfuture

2(ϵt + ϵT t+1).

The equality above holds because once Spast, Sfuture are fixed, the loss at time π(t) is uniformly sampled from the future. By the tower property of conditional expectation (on all the possible values of Spast, Sfuture), we have proven that the expected instantaneous regret conditioned on the clean event Et is at most:

E [ℓπ t (ht) ℓπ t (h ) | Et] 2(ϵt + ϵT t+1)

32d VC log 2e T

where the last inequality holds by taking δ = 1/T in Lemma H.1. By definition of regret and under this choice of δ, we have

RT (ARO, S) = X

t [T ] E [ℓπ t (ht) ℓπ t (h ) | Et] P [Et] + E ℓπ t (ht) ℓπ t (h ) | Et P Et

32d VC log 2e T

d VCT log T

which concludes the proof.