# smoothed_analysis_of_sequential_probability_assignment__cbd35eae.pdf

Smoothed Analysis of Sequential Probability Assignment

Alankrita Bhatt California Institute of Technology abhatt@caltech.edu

Nika Haghtalab University of California, Berkeley nika@berkeley.edu

Abhishek Shetty University of California, Berkeley shetty@berkeley.edu

We initiate the study of smoothed analysis for the sequential probability assignment problem with contexts. We study information-theoretically optimal minmax rates as well as a framework for algorithmic reduction involving the maximum likelihood estimator oracle. Our approach establishes a general-purpose reduction from minimax rates for sequential probability assignment for smoothed adversaries to minimax rates for transductive learning. This leads to optimal (logarithmic) fast rates for parametric classes and classes with finite VC dimension. On the algorithmic front, we develop an algorithm that efficiently taps into the MLE oracle, for general classes of functions. We show that under general conditions this algorithmic approach yields sublinear regret.

1 Introduction

Sequential probability assignment also known as online learning under the logarithmic loss is a fundamental problem with far-reaching impact on information theory, statistics, finance, optimization, and sequential decision making [Rissanen, 1983, 1984, Cover, 1991, Feder et al., 1992, Xie and Barron, 1997, Merhav and Feder, 1998, Xie and Barron, 2000, Yang and Barron, 1999, Jiao et al., 2013, Orabona and P al, 2016, Foster et al., 2018]. In recent years, methods for incorporating contexts or side information into sequential probability assignment have gained much attention [Rakhlin and Sridharan, 2015, Fogel and Feder, 2017, 2018, Foster et al., 2018, Bhatt and Kim, 2021, Bilodeau et al., 2021, Wu et al., 2022a], in part due to their newly forged connection to sequential decision making applications, the contextual bandit problem, and learning in Markov Decision Processes (MDPs) (see e.g. [Foster and Krishnamurthy, 2021] and [Foster et al., 2021a]). In this setting, a forecaster who has access to historical data x1:t 1, y1:t 1 consisting of contexts xτ (e.g., day τ s meteorological information) and the outcomes yτ {0, 1} (e.g., whether τ was a rainy day) wishes to predict yt given a new context xt. The forecaster uses a probability assignment pt to estimate the probability of yt = 1 outcome and incurs the logarithmic loss, i.e., log pt(yt), which rewards the forecasters for having assigned high probability to the realized outcome. The goal of the forecaster is to suffer low regret against a chosen reference class of predictors.

A large body of prior work on sequential probability assignment with contexts has focused on settings where contexts are presented i.i.d. from an unknown distribution (see [Fogel and Feder, 2017, Bhatt and Kim, 2021, Bilodeau et al., 2021, Wu et al., 2022a] and the references within); this problem is also referred to as conditional density estimation. In these cases, sequential probability assignment is known to enjoy small regret for several reference classes such as Vapnik Chervonenkis (VC) classes. On the other hand, attempts to consider context distributions that evolve unpredictably and

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

adversarially have faced strong impossibility results even for simple reference classes of predictors. For example, for the reference class of simple one-dimensional thresholds assigning pt = θ01{xt a} + θ11{xt > a} for a [0, 1], regret is bounded by O(log T) in the i.i.d. case [Fogel and Feder, 2017] but is lower bounded by Ω(T) when the sequence of contexts is chosen adversarially (folklore e.g. Littlestone [1988]). In the face of the increasing need to adapt to evolving contexts in modern applications, these impossibility results indicate that new models of adversarial behavior must be considered for obtaining rigorous guarantees that guide the design of sequential probability assignment in practical applications.

In recent years, smoothed analysis of adaptive adversaries [Haghtalab et al., 2021, Rakhlin et al., 2011] has emerged as a framework for going beyond the worst-case adversaries while making minimal assumptions about the adaptive process that generates a sequence. In this setting, contexts are chosen from an evolving sequence of so-called σ-smooth distributions, whose density is bounded above by 1/σ times that of a base measure (such as the uniform distribution). Remarkably, these methods, established by Haghtalab et al. [2021] for 0-1 loss and extended to regression by Haghtalab et al. [2022], Block et al. [2022], have established performance guarantees for the sequential prediction problem that matches the optimal performance in the i.i.d setting. This raises the question as to whether the sequential probability assignment problem may similarly enjoy improved minmax regret bounds for smoothed adaptive sequences.

Beyond minmax rates, an important feature of probability assignment and, its analogue, density estimation is the availability of fundamental and natural estimation techniques such as maximum likelihood estimation (MLE). For i.i.d. sequences, under general conditions, MLE is known to be optimal asymptotically and often serves as a starting point for designing more sophisticated estimators. Going beyond i.i.d. sequences, we ask whether MLE can be made to achieve good statistical behavior on adaptive sequences. More generally, algorithmic perspective is increasingly important for the sequential probability assignment problem and its applications to contextual bandits and reinforcement learning (where algorithm design is as fundamental a consideration as minmax rates [Agarwal et al., 2014, Simchi-Levi and Xu, 2022, Foster and Rakhlin, 2020, Langford et al., 2007, Foster et al., 2021b]). In this space, oracle-efficient sequential decision making algorithms that repurpose existing offline algorithmic routines have received special interest [Kalai and Vempala, 2005, Dud ık et al., 2020, Wang et al., 2022, Kakade et al., 2007, Simchi-Levi and Xu, 2022]. Here again, recent progress on smoothed analysis for sequential prediction with 0-1 loss and regression loss [Haghtalab et al., 2021, Block et al., 2022] has shown promise in bridging the computational and information-theoretical gaps between what is obtainable in the i.i.d. case and for smoothed adaptive sequences.

In this paper, we initiate the study of smoothed analysis for sequential probability assignment and seek to understand fundamental information-theoretic limits on the minmax regret as well as design natural and oracle-efficient algorithms for this problem. Additionally, we investigate whether in the smoothed analysis setting, maximum likelihood estimation can efficiently address sequential probability assignment while achieving small regret. To the best of our knowledge, our work is the first to consider oracle-efficient algorithms (and particularly the MLE) for the sequential probability assignment problem.

1.1 Main results

Reduction to transductive learning. Our first main result is a reduction from regret bounds against a smoothed adversary to regret bounds against (a generalized version of) a transductive adversary. That is, we show that the minimax regret in the smoothed analysis setting is upper bounded by the minimax regret in the setting where a set of contexts is provided to the learner and the adversary is constrained to picking the contexts from this set. For F, a class of hypotheses mapping contexts to [0, 1], let us define the minmax regret in the transductive case over T times steps when a context set of size M is provided to the learner to be RM T (F). We establish in Theorem 3.1 that for all σ-smooth sequences the minmax regret RT (F, σ) satisfies, for any k > 1,

RT (F, σ) Rk T T (F) + T 2(1 σ)k.

Furthermore, in Theorem 3.4 we upper bound Rk T T (F) by connecting the worst case adversarial regret in this setting to the scale-sensitive VC dimension of F which is a prototypical offline complexity of the class. Our results obtain a logarithmic dependence on 1/σ. In particular, in

Corollary 3.4.1, we show that for VC classes (and parametric classes) the regret is bounded by

RT (F, σ) O d log T σ , where d is the VC dimension of class F.

Efficient Reduction from Sequential Probability Assignment to MLE. Our second contribution is initiating the study of oracle-efficient algorithm design for sequential probability assignment. In particular, for small alphabet size, we design a natural algorithm (Algorithm 1) that efficiently uses an MLE oracle and achieves sublinear regret in the smoothed setting. Our Theorem 4.1 gives a general regret bound in terms of the statistical complexity of the class F and the smoothness parameter σ.

For VC classes, this achieves regret rate of T 4/5 q

d σ. To the best of our knowledge, this is the first oracle-efficient algorithm and analysis of the follow-the-perturbed-leader style algorithms for the logarithmic loss.

Probability assignment for VC classes. For VC classes F, we explicitly construct sequential probability assignments and establish their regret guarantees in the smoothed setting. That is, we construct a probability assignment based on a Bayesian mixture over F that satisfies RT (F, σ)

Cd log T σ where d is the VC dimension of class F. While this approach is not oracle-efficient, it indeed achieves regret bound with optimal dependence on T and σ. This motivates a natural direction for future work as to whether such mixture-based methods can be implemented oracle efficiently or if there is a tradeoff between the regret and the computational complexity of sequential probability assignment.

2 Preliminaries

Let X be a set of contexts and Y = {0, 1}. Then, the problem being studied entails a sequential game where at each timestep t, based on the history of contexts x1:t := (x1, . . . , xt) where xi X and associated bits y1:t 1 {0, 1}t 1, the player must assign a probability q( |x1:t, y1:t 1) to what the upcoming bit yt will be. Once the bit yt is revealed (possibly in an adversarial fashion) the player incurs loss log q(yt|x1:t, y1:t 1) and the game proceeds to the next step. For a hypothesis class F {X [0, 1]}, the associated regret for a probability assignment strategy Q = {q( |x1:t, y1:t 1)}n t=1 for a fixed x1:T , y1:T is

RT (F, x1:T , y1:T , Q) =

t=1 log 1 q(yt|x1:t, y1:t 1) inf f F

t=1 log 1 pf(yt|xt) (1)

where pf(1|xt) = f(xt); i.e. the function f assigns probability Bern(f(xt)) to the upcoming bit given the context xt.

Our statistical results apply to a general loss function ℓand general actions of the learner at. For a set of inputs (xi, yi) T i=1 specified by the adversary and a set of actions {ai}T i=1 of the learner, regret is defined by

RT (F, x1:T , y1:T , a1:T ) =

t=1 ℓ(at, (xt, yt)) inf f F

t=1 ℓ(f(xt), (xt, yt)),

where for log-loss at = q( |x1:t, y1:t 1); the action is a probability mass function (pmf) over {0, 1}.

The regret in (1) is often studied under various adversary models; i.e. various different probabilistic assumptions (or lack thereof) on the model generating xt and yt. In this work, we consider worst-case yt (in contrast to the realizable setting where Yt Bern(f (xt)) for a fixed unknown f F) and Xt Dt where Dts form an adaptive sequence of smooth distributions.

Definition 2.1 (Smooth distribution and adversary [Haghtalab et al., 2020]). Consider a fixed and known base distribution µ on X (such as the uniform distribution if X supports it). A distribution D on X is said to be σ-smooth if for all measurable sets A X, D(A) µ(A)

σ . We denote the set of all σ-smooth distributions by σ (µ). An adversary, characterized by a joint distribution D with Xt Dt (where Dt may possibly depend on the history) is said to be a σ-smooth adaptive adversary if Dt σ (µ) for all t {1, . . . , T}.

The minmax regret for σ-smooth adaptive adversaries is then given by

RT (F, σ) = inf Q sup σ-smoothed D EX1:T D

max y1:T RT (F, X1:T , y1:T , Q) ,

where Q is the set of all probability assignment strategies.

We are particularly interested in how geometric properties of the function class F affect RT (F, σ). There are several notions of covering numbers and combinatorial dimensions that quantify the richness and complexity of a class, but the scale-sensitive VC dimension will be of particular interest to us and is invoked in our results.

Definition 2.2 (Scale-sensitive VC dimension). Let F be a function class. For any α > 0 and points x1, . . . , xm X, we say that F shatters the set x1, . . . , xm at scale α if there exist s1 . . . sm R such that for each ϵ { 1, 1}n there exists a function f F such that ϵi f(xi) si α

2 . The scale sensitive VC dimension at scale α of F, denoted by VC (F, α) is defined as the largest m such that there is a set of m points x1, . . . , xm X such that F shatters the set at scale α. The (traditional) VC dimension of a binary class F is defined as VC (F) = limα 0+ VC (F, α).

Throughout, we use the following result of Haghtalab et al. [2021] about σ-smooth distributions. This results aids us in reduction from smoothed learning to transductive learning. Theorem 2.1 (Coupling Lemma of Haghtalab et al. [2021]). Let Dσ be an adaptive sequence of t σ-smooth distributions on X. There is a coupling Π such that

(X1, Z1,1, . . . , Z1,K, . . . , Xt, Zt,1, . . . , Zt,K) Π

1. X1, . . . , Xt is distributed according Dσ,

2. For every j t, {Zi,k}i j,k [K] are uniformly and independently distributed on X, conditioned on X1, . . . , Xj 1.

3. For any t, with probability at least 1 (1 σ)K, Xt {Zt,k}k=1:K.

3 General reduction to transductive learning

In this section, we will consider the minimax regret for the smoothed online learning game with respect to the loss function1 ℓagainst a general class of functions F. In Section 3.1, we will show that the minimax regret can be reduced to the minimax regret for a version of transductive learning with respect to the same loss function and class of functions. In Section 3.2, we give general upper bounds for the transductive setting. There is a subtle but important difference between reduction that directly involve regret compared to recent efforts (such as Haghtalab et al. [2021], Block et al. [2022], Haghtalab et al. [2022]) using reductions between proxies of regret, such as covering numbers and sequential complexities. This is particularly important for log loss since its complexity is not captured by covering numbers. We discuss this point further in Section 3.2.2.

3.1 Regret-to-regret Reduction

We work with a general loss function ℓand general actions of the learner at. We note that, we can write the minmax value of the smoothed setting in extensive form as

RT (F, σ) = sup D1 σ(µ) EX1 D1 inf a1 sup y1 sup D2 σ(µ) EX2 D2 inf a2 sup y2 . . . (2)

. . . sup DT σ(µ) EXT DT inf a T sup y T R(F, X1:T , y1:T , a1:T ).

In order to bound this, we consider a generalization of the notion of online learning that is referred to as transductive learning. In this setting, at the start of the interaction the adversary chooses a set of contexts X = {Xi}M i=1 for some M T and provides this to the player. The game proceeds as

1The reduction works for general loss functions with the property that the worst-case regret for horizon T is bounded by T, but one can think about ℓbeing the log-loss throughout this section for concreteness.

before with the adversary picking (xt, yt) at time t and the learner picking an action at and suffering a loss ℓ(at, (xt, yt)). However, the adversary is now constrained to pick xt X at all times t. We can then define the minmax regret indexed by X as

RT (F, X) :=

max x1 X inf a1 sup y1 . . . max x T X inf a T sup y T R(F, x1:T , y1:T , a1:T )

Furthermore, define the worst-case transductive learning regret for sets of size M as RM T (F) = max X X,|X|=M RT (F, X). In the following theorem, we show that the regret against σ-smoothed adversaries is bounded by the regret in the transductive learning setting when the set of contexts is drawn from the base distribution µ.

Theorem 3.1. Let F be any class of functions from X to R and let σ (0, 1]. Then, for any T and k, we have

RT (F, σ) Rk T T (F) + T 2(1 σ)k.

Theorem 3.1 shows that we can reduce the problem of evaluating the minimax regret for smoothed adversaries to evaluating the minimax regret for transductive learning. Note that the second term (1 σ)k e kσ and thus, in order to get bounds that are sublinear one needs to consider k = c log T/σ for an appropriate absolute constant c. As we will see in the next section, this leads to logarithmic dependence on σ 1. Moreover, by Theorem 3.1 and since EX µT RT (F, X) RT (F, σ), we can see that the transductive learning regret exactly captures the smoothed regret up to

polylog T σ factors.

3.2 Bounds for Transductive Learning

In this section, we discuss ways to upper bound transductive learning regret RM T (F) so as to achieve bounds on RT (F, σ) via Theorem 3.1.

3.2.1 Using Covering Numbers

One of the approaches common in online learning is to characterize the regret in terms of geometric properties (such as covering numbers) of the function class F. The notion of covering required varies depending on the loss function and the stochastic properties of the data typically completely adversarial problems require stronger notions of sequential coverings [Rakhlin et al., 2015a,b] while for stochastic problems usually weaker offline coverings suffice. In our smoothed case, we show that the offline complexity notion of scale-sensitive VC dimension as defined in Definition 2.2 is adequate. Similar ideas were considered for the case of regression and convex Lipshitz losses in Haghtalab et al. [2022], Block et al. [2022].

Let us first define the notion of approximation according to which a cover will be constructed; we will consider a pointwise approximation. This notion is similar to the notion of global sequential covering in Wu et al. [2022b].

Definition 3.1. Let F be a function class. A set of functions F is said to be a ϵ-covering of F if for any f F there exists g F such that supx X |f(x) g(x)| ϵ. We will use N (F, ϵ) to denote the size of the minimal ϵ-covering of F.

Note that while the metric in Definition 3.1 is quite stringent, using this cover in the transductive learning case requires us to only consider function classes with bounded domain size. We capture this using the following theorem.

Theorem 3.2 (Upper bound on transductive learning). Let F be a function class and ϵ > 0. Then,

Rk T T (F) inf ϵ

sup Z X,|Z|=k T log N F|Z, ϵ + 2ϵT

where F|Z is the projection of hypothesis class F on the set Z.

The proof of this theorem follows from relating transductive learning to the worst sequential prediction on a finite set of points using formalism presented in Wu et al. [2022b]. The proof of this theorem is deferred to the Appendix C.

Next, we recall that the covering number N(F, ϵ) is bounded as a function of the scale sensitive VC dimension of the class F and the number of points in the domain. Theorem 3.3 (Rudelson and Vershynin [2006]). There exist universal constants c, C such that for all α > 0, any function class F defined on a finite set X, and ϵ > 0, we have

log N(F, ϵ) C VC(F, cαϵ) log1+α C|X| VC(F, cϵ)ϵ

Finally, putting together Theorem 3.1, Theorem 3.2 and Theorem 3.3 we get the following. Theorem 3.4 (Minimax smoothed regret and scale-sensitive VC dimension).

RT (F, σ) inf k,α,ϵ>0

C VC(F, cαϵ) log1+α Ck T VC(F, cϵ)ϵ

+ 2ϵT + T 2 (1 σ)k )

We can instantiate the bound in Theorem 3.4 in terms of T and σ for two particularly interesting cases: when VC(F, ϵ) scales as d log 1/ϵ (often referred to as parametric classes) and when VC(F, ϵ) scales as ϵ p (often referred to as nonparametric classes). A canonical example of the former are VC classes; for a class with VC dimension d, VC(FVC, ϵ) = Cd log 1

ϵ (see for example [Vershynin, 2018, Theorem 8.3.18]). A canonical example of the latter are functions of bounded variation, FBV which have VC(FBV, ϵ) = C

ϵ (see for example [Musayeva, 2020, Bartlett et al., 1997]). This class is known to have unbounded sequential covering numbers [Rakhlin et al., 2010] and therefore is not learnable with a worst-case adversary this can be seen as a simple consequence of the fact that FBV contains all one-dimensional thresholds. Corollary 3.4.1 (Rates for parametric and nonparametric classes). If VC(F, ϵ) = d log 1/ϵ , then for a large enough T

RT (F, σ) O

d poly log T

If VC(F, ϵ) = ϵ p, then

RT (F, σ) O

p p+1 poly log T

In particular, note that Corollary 3.4.1 shows that for VC classes RT (FVC, σ) = e O d log T σ

(tight, see also concurrent work [Wu et al., 2023] for a similar bound) and for functions of bounded variation RT (FBV, σ) = e O

σ ; note that the minmax rates for the worst-case adversary

scale as Ω(T) for both these cases. We note that the above bound may be loose for general nonparametric classes, but should be improvable using a multiscale (chaining) version of Theorem 3.4 but we do not focus on this here.

Though the above results give satisfactory bounds in the minimax sense for many classes F of interest, it is useful to consider explicit constructions of probability assignment rules. For the case of finite VC dimension, we give an explicit probability assignment rule by considering a discretization of the class and using a mixture probability assignment rule. In particular, this strategy (denoted by QVC)

yields optimal regret RT (FVC, σ, QVC) O d log T σ . For the formal statements, proofs and

detailed discussion see Appendix D.

3.2.2 Examples without Covering numbers

This reduction approach to characterizing minmax regret in the log-loss is interesting since all previous approaches have used covering numbers of some kind either sequential covering numbers

or stronger notions of global covering. However, in stark contrast to the 0/1 loss and several other loss functions [Rakhlin et al., 2015b], covering numbers cannot capture the minmax regret for the log-loss, at least in the adversarial case. Consider the following class of functions on context set

X = B2 (where B2 denotes the unit ℓ2 Euclidean ball) FLin :=

this class, Rakhlin and Sridharan [2015] construct a follow-the-regularized leader (FTRL) based algorithm achieving regret O(

T). However, Bilodeau et al. [2020] show an upper bound on the regret in terms of sequential covering numbers which is not improvable in general this shows that sequential covering numbers are not adequate to capture the minmax regret rates for the log-loss. Wu et al. [2022b] further consolidate this by considering the following class, closely resembling FLin,

FAbs Lin := x 7 x, w w B2

. [Wu et al., 2022b, Example 2, Theorem 6] establish that the

minmax regret for FAbs Lin is eΘ(T 2/3), demonstrating the surprising fact that by a simple linear transformation of the hypothesis class (which does not change its covering number) one can obtain minmax rates that differ by a polynomial factor! On the other hand, our reduction-based approach bypasses the need for using any covering based arguments and therefore would lead to tight (at least up to poly log(T/σ)) rates.

We remark that exact characterizations of the minmax regret with log-loss (often referred to as the minmax redundancy in the information theory literature) in the no-context (adversarial) case is most often calculated by studying the so-called stochastic complexity of the class F [Rissanen, 1996]. This can be extended to (worst-case) transductive learning with contexts x1, . . . , x T ; in this case the minmax optimal regret for a fixed horizon is achieved by the normalized maximum likelihood (NML) probability assignment [Shtar kov, 1987], and can be expressed as RT T (F) = maxx1:T log P

y1:T {0,1}T maxf F QT t=1 pf(yt|xt) . This expression has been evaluated previously for online logistic regression [Jacquet et al., 2021] and more general hypothesis classes [Wu et al., 2022b]. It is an intriguing question to understand what properties of F the stochastic complexity depends on, given that the above examples illustrates that covering numbers do not capture it. Our reduction provides a technique to use such fine-grained understanding of the regret to directly lift the bounds to the more general smoothed adversary setting.

4 Oracle-Efficient Smoothed Sequential Probability Assignment

In the previous section, we consider a purely statistical perspective on the minimax value of the sequential probability assignment problem for smoothed adversaries. In this section, we will focus on an algorithmic perspective and design an algorithm that is efficiently implemented using calls to an MLE oracle. We will focus on the setting when the base measure is the uniform measure on the input space X and the label space Y = {0, 1}. In this setting, we are given access to an oracle OPT which given a data set S = {xi, yi}m i=1 outputs a hypothesis that minimizes the loss on S. That is,

OPT (S) = argmin h F

i=1 ℓ h, (xi, yi) .

In the context of the logarithmic loss, this corresponds to maximum likelihood estimation. Most of the analysis holds for a general loss function ℓ(with regret scaling appropriate bounds on the values and derivatives), but for clarity one can think of ℓas the log-loss. In particular, Algorithm 1 is written for the log-loss.

The main framework we work in is the follow-the-perturbed-leader (FTPL) framework. Here, our algorithm uses the oracle on a data set consisting of the historical samples and a set of hallucinated samples. The hallucinated samples are intended to stabilize the predictions of the algorithm. This gives us a probability assignment QFTPL = {q FTPL( |x1:t, y1:t 1)}T t=1.

In order to state the regret bound, we need the following notions. For any class F, define the truncated class Fα as Fα = n fα : fα(x) = f(x)+α

1+2α where f F o . We will also need the notion of

Rademacher complexity Rad (F, T) = sup X X,|X|=T Eϵ h supf F 1 T P

x X ϵxf (x) i .

Algorithm 1: Probability assignment QFTPL

Input: time horizon T, smoothness parameter σ, VC dimension d, Samples per step Parameter n, Truncation parameter α

1 for t 1 to T do

2 Generate N (t) Poi(n) fresh hallucinated samples (ex(t) 1 , ey(t) 1 ), , (ex(t) N , ey(t) N ), which are i.i.d. conditioned on N with ex(t) i U(X) and ey(t) i U({0, 1}) Call the oracle to compute ht OPT {(ex(t) i , ey(t) i )}i [N(t)] {xτ, yτ}τ [t 1]

3 Observe xt

4 Assign probability q FTPL(1|x1:t, y1:t 1) = ht(xt)+α

1+2α 5 Receive yt 6 end

Theorem 4.1 (Main Regret Bound). For any hypothesis class F and parameters n, α, we have that the regret of Algorithm 1 for σ-smoothed adversaries is bounded as

RT (F, σ, QFTPL) n log 1

+ T inf m n

( 1 αRad Fα, n/m + n (1 σ)m log 1/α

m + e n/8 )

We will instantiate this bound for case when the class F has bounded VC dimension. For such classes, it is known that the Rademacher complexity is bounded. We state this in the following corollary. Corollary 4.1.1. Let F be a hypothesis class such that the Rademacher complexity is bounded as

Rad (Fα, T) = c T ω, then we have RT (F, σ, QFTPL) T 2 2+ω q

1 σ poly log T d

Note that, in particular, for VC classes we have RT FVC, σ scales as T 4/5. Improving this to achieve the minimax rate discussed in Corollary 3.4.1 is an interesting open question. Remark. A slightly improved regret scaling as T 3/4 can be achieved by assuming access to an oracle that can optimize a mixed objective function involving the log-loss and signed sum of the functions in class. However, these oracles do not have the natural interpretation in terms of maximum likliehood estimation.

4.1 Analysis

The main challenge in designing algorithms in the follow-the-perturbed-leader framework is designing the distribution of the hallucinated samples so as to balance the tradeoff between the stability of the algorithm i.e. how little the algorithm changes its prediction from time step to time step, and the perturbation i.e. how much the addition of the hallucinated samples changes the prediction of the algorithm from the outputting the best hypothesis on the historical samples. This is captured by the following lemma. Lemma 4.2 (Follow the Perturbed Leader bound). Let ℓbe a convex loss function, and let F be a hypothesis class. Let Dt denote the distribution of the adversary at time t and let Qt denote the distribution of the hypothesis ht output by Algorithm 1. Then, we have that the regret of Algorithm 1 (where we use est = (ext, eyt) to denote a hallucinated data point) 2 is bounded by

i=1 E st Dt

E ht Qt[ℓ(ht, st)] E ht+1 Qt+1[ℓ(ht+1, st)]

| {z } Stability

i=1 ℓ(h, st) ℓ(h , st)

| {z } Perturbation

where h = argminh Fα PT i=1 ℓ(h, st).

2With some abuse of notation, we consider Dt to be over X {0, 1}.

We provide a proof in Appendix E for completeness. Given this decomposition of the regret, we need to handle both terms carefully. Just to appreciate the tradeoff, note that as we increase the number of hallucinated examples, the perturbation term generally increases, but the stability term generally decreases. First, let us focus on the stability term which is harder to deal with. The main approach we will use is a generalization of the decomposition of the stability term introduced in Haghtalab et al. [2022] even when the losses are unbounded, as is the case with the log-loss. The main idea is to decompose the stability term in terms of the distance between the distribution of the average prediction at the next time step and the distribution of the current time step, as captured by the χ2 distance and a term that captures how different the predictions of the algorithm are when the sample is resampled from the same distribution. The proof can be found in Appendix F. 3

Lemma 4.3 ( χ2 + Generalization Stability). Let Qt denote learner s distribution over F in at round t, Dt be adversary s distribution at time t (given the history s1, , st 1), st Dt be the realized adversarial instance at time t, and s t be an independent copy s t Dt. Let R(t+1) refers to the randomness used by the algorithm in round t + 1. Then,

E ht Qt[ℓ(ht, st)] E ht+1 Qt+1[ℓ(ht+1, st)]

1 2χ2( E st Dt[Qt+1], Qt) log 1

+ E st,s t Dt;R(t+1)[ℓ(ht+1, s t) ℓ(ht+1, st)].

Given this lemma, we move on to bounding the χ2 divergence between the distribution of the average prediction at the next time step and the distribution of the current time step. This is done using the Ingster method to bound the divergence of mixtures. We include a proof in Appendix G for completeness. Lemma 4.4 (Bound on χ2). χ2 Est Dt[Qt+1], Qt 2 σn.

Next, we move on second term in Lemma 4.3. Note that this term involves the difference between the loss of the hypothesis output at time t + 1 evaluated on two independent points st and s t drawn from Dt. The main idea to bound the term is to use a stronger version of the coupling lemma Theorem 2.1 which allows us to extract subsequences of points sampled according to smooth distributions from iid samples from the base measure. This allows us to relate the required generalization bound to the Rademacher complexity of the class composed with the loss. Using the truncation and the contraction principle, we get the desired bound. The proof can be found in Appendix H. Lemma 4.5 (Generalization). Let ht+1 denote the hypothesis output by the Algorithm 1 at time t + 1. Then, for any m n, we have

E st,s t Dt;R(t+1) ℓ(ht+1, s t) ℓ(ht+1, st) 1

αRAD Fα, n/m + n (1 σ)m log 1/α

The final term that we want to bound is the perturbation term. In order to bound this note that we set our truncation parameter α such that the loss our the prediction made by our algorithm is bounded and consequently the perturbation term is bounded by n log 1

α, see Appendix I. Theorem 4.1 follows by combining the above results.

5 Conclusions and Open Problems

In this paper, we initiated the study of sequential probability assignment with smoothed adversaries. We characterize the minimax regret in terms of the minimax regret for transductive learning and use this to provide tight regret bounds, e.g., for VC classes. Furthermore, we initiate the study of oracle efficiency in this setting and show that sublinear regret can be achieve for general classes.

Our work motivates several directions for future work. An interesting direction is whether the optimal O(log(T)) regret is achievable for some classes, such as VC classes, using oracle-efficient algorithms. More generally, are there computational barriers to obtaining fast rates in prediction with log-loss in this setting?

3For the particular use in our analysis a simpler version of the lemma similar to Haghtalab et al. [2022] suffices but we prove a general version since we believe such a version is useful in providing improved regret bounds for the problem.

Acknowledgments and Disclosure of Funding

This work was supported in part by the National Science Foundation under grant CCF-2145898, a C3.AI Digital Transformation Institute grant, and Berkeley AI Research Commons grants. This work was partially done while authors were visitors at the Simons Institute for the Theory of Computing.

Jorma Rissanen. A universal data compression system. IEEE Transactions on information theory, 29 (5):656 664, 1983.

Jorma Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information theory, 30(4):629 636, 1984.

Thomas M Cover. Universal portfolios. Mathematical finance, 1(1):1 29, 1991.

Meir Feder, Neri Merhav, and Michael Gutman. Universal prediction of individual sequences. IEEE transactions on Information Theory, 38(4):1258 1270, 1992.

Qun Xie and Andrew R Barron. Minimax redundancy for the class of memoryless sources. IEEE Transactions on Information Theory, 43(2):646 657, 1997.

Neri Merhav and Meir Feder. Universal prediction. IEEE Transactions on Information Theory, 44(6): 2124 2147, 1998.

Qun Xie and Andrew R Barron. Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transactions on Information Theory, 46(2):431 445, 2000.

Yuhong Yang and Andrew Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, pages 1564 1599, 1999.

Jiantao Jiao, Haim H Permuter, Lei Zhao, Young-Han Kim, and Tsachy Weissman. Universal estimation of directed information. IEEE Transactions on Information Theory, 59(10):6220 6242, 2013.

Francesco Orabona and D avid P al. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016.

Dylan J Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik Sridharan. Logistic regression: The importance of being improper. In Conference On Learning Theory, pages 167 208. PMLR, 2018.

Alexander Rakhlin and Karthik Sridharan. Sequential probability assignment with binary alphabets and large classes of experts. ar Xiv preprint ar Xiv:1501.07340, 2015.

Yaniv Fogel and Meir Feder. On the problem of on-line learning with log-loss. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 2995 2999. IEEE, 2017.

Yaniv Fogel and Meir Feder. Universal batch learning with log-loss. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 21 25. IEEE, 2018.

Alankrita Bhatt and Young-Han Kim. Sequential prediction under log-loss with side information. In Algorithmic Learning Theory, pages 340 344. PMLR, 2021.

Blair Bilodeau, Dylan J Foster, and Daniel M Roy. Minimax rates for conditional density estimation via empirical entropy. ar Xiv preprint ar Xiv:2109.10461, 2021.

Changlong Wu, Mohsen Heidari, Ananth Grama, and Wojciech Szpankowski. Expected worst case regret via stochastic sequential covering. ar Xiv preprint ar Xiv:2209.04417, 2022a.

Dylan J Foster and Akshay Krishnamurthy. Efficient first-order contextual bandits: Prediction, allocation, and triangular discrimination. Advances in Neural Information Processing Systems, 34: 18907 18919, 2021.

Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. ar Xiv preprint ar Xiv:2112.13487, 2021a.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2:285 318, 1988.

Nika Haghtalab, Tim Roughgarden, and Abhishek Shetty. Smoothed analysis with adaptive adversaries. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 942 953. IEEE, 2021.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Stochastic, constrained, and smoothed adversaries. In Advances in neural information processing systems, pages 1764 1772, 2011.

Nika Haghtalab, Yanjun Han, Abhishek Shetty, and Kunhe Yang. Oracle-efficient online learning for beyond worst-case adversaries. Advances in Neural Information Processing Systems, 2022.

Adam Block, Yuval Dagan, Noah Golowich, and Alexander Rakhlin. Smoothed online learning is as easy as statistical learning. In Conference on Learning Theory, pages 1716 1786. PMLR, 2022.

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638 1646. PMLR, 2014.

David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 47(3):1904 1931, 2022.

Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199 3210. PMLR, 2020.

John Langford, Lihong Li, and Alexander Strehl. Vowpal wabbit open source project. 2007.

Dylan J Foster, Alexander Rakhlin, David Simchi-Levi, and Yunzong Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. 2021b.

Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005.

Miroslav Dud ık, Nika Haghtalab, Haipeng Luo, Robert E Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracle-efficient online learning and auction design. Journal of the ACM (JACM), 67(5):1 57, 2020.

Guanghui Wang, Zihao Hu, Vidya Muthukumar, and Jacob Abernethy. Adaptive oracle-efficient online learning. Advances in Neural Information Processing Systems, 2022.

Sham M Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 546 555, 2007.

Nika Haghtalab, Tim Roughgarden, and Abhishek Shetty. Smoothed analysis of online and differentially private learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9203 9215. Curran Associates, Inc., 2020.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform martingale laws of large numbers. Probability Theory and Related Fields, 161(1-2):111 153, 2015a.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities. J. Mach. Learn. Res., 16(1):155 186, 2015b.

Changlong Wu, Mohsen Heidari, Ananth Grama, and Wojciech Szpankowski. Precise regret bounds for log-loss via a truncated bayesian algorithm. Advances in Neural Information Processing Systems, 2022b.

Mark Rudelson and Roman Vershynin. Combinatorics of random processes and sections of convex bodies. Annals of Mathematics, pages 603 648, 2006.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018.

Khadija Musayeva. Sample complexity result for multi-category classifiers of bounded variation. ar Xiv preprint ar Xiv:2003.09176, 2020.

Peter L Bartlett, Sanjeev R Kulkarni, and S Eli Posner. Covering numbers for real-valued function classes. IEEE transactions on information theory, 43(5):1721 1724, 1997.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages, combinatorial parameters, and learnability. Advances in Neural Information Processing Systems, 23, 2010.

Changlong Wu, Ananth Grama, and Wojciech Szpankowski. Online learning in dynamically changing environments. ar Xiv preprint ar Xiv:2302.00103, 2023.

Blair Bilodeau, Dylan Foster, and Daniel Roy. Tight bounds on minimax regret under logarithmic loss via self-concordance. In International Conference on Machine Learning, pages 919 929. PMLR, 2020.

Jorma J Rissanen. Fisher information and stochastic complexity. IEEE transactions on information theory, 42(1):40 47, 1996.

Yurii Mikhailovich Shtar kov. Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3):3 17, 1987.

Philippe Jacquet, Gil Shamir, and Wojciech Szpankowski. Precise minimax regret for logistic regression with categorical feature values. In Algorithmic Learning Theory, pages 755 771. PMLR, 2021.

Raphail Krichevsky and Victor Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199 207, 1981.

Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In Conference on Learning Theory, pages 1066 1100. PMLR, 2015.

Daniel A Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385 463, 2004.

Adam Block and Max Simchowitz. Efficient and near-optimal smoothed online learning for generalized linear functions. Advances in neural information processing systems, 2022.

Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for universal portfolios. Journal of Machine Learning Research, pages 423 440, 2002.

Haipeng Luo, Chen-Yu Wei, and Kai Zheng. Efficient online portfolio with logarithmic regret. Advances in neural information processing systems, 31, 2018.

Julian Zimmert, Naman Agarwal, and Satyen Kale. Pushing the efficiency-regret pareto frontier for online learning of portfolios and quantum states. In Conference on Learning Theory, pages 182 226. PMLR, 2022.

R emi J ez equel, Dmitrii M Ostrovskii, and Pierre Gaillard. Efficient and near-optimal online portfolio selection. ar Xiv preprint ar Xiv:2209.13932, 2022.

Tim Van Erven, Dirk Van der Hoeven, Wojciech Kotlowski, and Wouter M Koolen. Open problem: Fast and optimal online portfolio selection. In Conference on Learning Theory, pages 3864 3869. PMLR, 2020.

Thomas M Cover and Erik Ordentlich. Universal portfolios with side information. IEEE Transactions on Information Theory, 42(2):348 363, 1996.

Jason E Cross and Andrew R Barron. Efficient universal portfolios for past-dependent target classes. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics, 13(2):245 276, 2003.

L aszl o Gy orfi, G abor Lugosi, and Frederic Udina. Nonparametric kernel-based sequential investment strategies. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics, 16(2):337 357, 2006.

Alankrita Bhatt, J Jon Ryu, and Young-Han Kim. On universal portfolios with continuous side information. In International Conference on Artificial Intelligence and Statistics. PMLR, 2023.

Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

A Related work

Sequential probability assignment is a classic topic in information theory with extensive literature, see the survey by Merhav and Feder [1998] and the references within. In particular, the idea of probability assignments that are Bayesian mixtures over the reference class of distributions [Krichevsky and Trofimov, 1981] is of central importance such mixture probability assignments arise as the optimal solution to several operational information theoretic and statistical problems [Kamath et al., 2015]. It is also known that the Bayesian mixture approach often outperforms the plug-in approach of estimating a predictor from the reference class and then playing it [Merhav and Feder, 1998]. A similar Bayesian mixture probability assignment in the contextual probability assignment problem was used by Bhatt and Kim [2021], where the covering over the VC function class was obtained in a data-dependent manner. This idea of using a mixture over an empirical covering along with a so-called add-β probability assignment was then used by Bilodeau et al. [2021]. Combining this with the key idea of discretizing the class of functions as per the Hellinger divergence induced metric, they obtained matching rates for several interesting classes in the realizable case (i.e. yt xt generated according to a fixed unknown distribution in the reference class); see also Yang and Barron [1999] for more intuition behind usage of Hellinger coverings for stochastic data. Recent work of Wu et al. [2022a,b] has also employed an empirical covering with an add-β probability assignment for both stochastic and adversarial adversaries.

A complementary approach, more common in the online learning literature is to study fundamental limits of sequential decision making problems non-constructively (i.e. providing bounds on the minmax regret without providing a probability assignment that achieves said regret). This sequential complexities based approach of Rakhlin et al. [2015b,a] has been employed for the log-loss by Rakhlin and Sridharan [2015] and Bilodeau et al. [2020]; however the latter suggests that sequential complexities might not fully capture the log-loss problem.

Smoothed analysis, initiated by Spielman and Teng [2004] for the study of efficiency of algorithms such as the simplex method, has recently shown to be effective in circumventing both statistical and computatonal lower bounds in online learning for classification and regression Haghtalab et al. [2021], Rakhlin et al. [2011], Block et al. [2022], Haghtalab et al. [2020], Block and Simchowitz [2022]. This line of work establishes that smoothed analysis is a viable line of attack to construct statistically and computationally efficient algorithms for sequential decision making problems and has developed tools such as the coupling lemma to understand smoothed sequential decision-making. But handling the unbounded and non-Lipschitz log-loss and MLE oracle is the main conceptual novelty introduced in this work.

Due to the fundamental nature of the problem, the notion of computational efficiency for sequential probability assignment and the closely related problem of portfolio selection has been considered in the literature. Kalai and Vempala [2002] presents an efficient implementation of Cover s universal portfolio algorithm using techniques from Markov chain Monte Carlo. Recently, there has been a flurry of interest in using follow the regularized leader (FTRL) type techniques to achieve low regret and low complexity simultaneously [Luo et al., 2018, Zimmert et al., 2022, J ez equel et al., 2022], see also Van Erven et al. [2020] and the references within. However, none of these methods consider the contextual version of the problem and are considerably different from the oracle-efficient approach. On the other hand, work studying portfolio selection with contexts [Cover and Ordentlich, 1996, Cross and Barron, 2003, Gy orfi et al., 2006, Bhatt et al., 2023] does not take oracle-efficiency into account.

Concurrent Work: Wu et al. [2023] also study the problem of sequential probability assignment (and general mixable losses) and for VC classes achieve the optimal regret of O(d log(T/σ)). In addition to the smooth adversaries, they also studied general models capturing the setting where the base measures are not known. They work primarily in the information theoretical setting and do not present any results regarding efficient algorithms. A major contribution of our work is the study of oracle-efficient online learning for the log-loss, which has not been considered by Wu et al. [2023] Indeed, to the best of knowledge, oracle efficiency has not been considered at all for losses such as the log loss even for worst-case settings. Considering oracle-efficiency for log-loss is an important and natural line of research as the ERM oracle corresponds to the maximum likelihood estimator (MLE) that is commonly used in practice. Our paper presents sublinear regret oracle efficient algorithms for the log loss (which does not have an analog in Wu et al. [2023]).

B Deferred Proof from Section 3

In order to obtain an upper bound on RT (F, σ) in terms of Rk T T (F) for some k, we will consider (2) and proceed inductively. The main idea is to note that since Di is σ-smoothed, conditioned on the history thus far, we can invoke the coupling lemma given in Theorem 2.1.

For the sake of illustration, first consider the simple case of T = 1. Let X1, Z1 . . . Zk denote the coupling alluded to in Theorem 2.1. Recall that X1 D1 and Z1:k µk. Defining the event E1 := {X1 Z1:k}, we have R1(F, D) = EX1 D1 inf a1 sup y1 R1(F, X1, y1, a1)

inf a1 sup y1 R1(F, X1, y1, a1)

1{E1} inf a1 sup y1 R1(F, X1, y1, a1)

1{EC 1 } inf a1 sup y1 R1(F, X1, y1, a1)

1{E1} inf a1 sup y1 R1(F, X1, y1, a1)

+ P(Ec 1) (3)

max X1 Z1:k inf a1 sup y1 R1(F, X1, y1, a1)

+ (1 σ)k (4)

= Rk T T (F) + (1 σ)k, (5)

where (3) uses that infa1 supy1 R1(F, X1, y1, a1) 1 4, (4) follows by the coupling lemma and (5) follows from the definition of transductive learning regret. The next step is to generalize this to arbitrary T. The key aspect that makes this possible is that for all t T, we have Dt σ (µ), even conditioned on the past, allowing us to apply the coupling lemma. Furthermore, we need that RT T for arbitrary sequences which is indeed guaranteed for reasonable losses such as the log-loss as noted above.

We now move to general case. We will prove this inductively. Assume that we have used the coupling lemma till time t 1 and replaced the samples from the smooth distributions with samples from the uniform. That is assume the induction hypothesis, for time t as RT E {Z1:k} µ max X1 Zk 1 inf a1 sup y1 . . . sup Dt E Xt Dt inf at sup yt . . .

. . . sup DT E XT DT inf a T sup y T R(F, X1:T , y1:T , a1:T ) + T(t 1)(1 σ)k

Using the coupling lemma, we have that there exists a coupling Πt such that Xt, Zt,1 . . . Zt,k Πt and an event Et = n Xt Zt,1 . . . Zt,k o that occurs with probability 1 (1 σ)k. Using

Zt := Zt,1 . . . Zt,k we have E Z1 µk max X1 Z1 inf a1 sup y1 . . . sup Dt E Xt Dt inf at sup yt . . . sup DT E XT DT inf a T sup y T R(F, X1:T , y1:T , a1:T )

E Z1 µk max X1 Z1 inf a1 sup y1 . . . sup Dt E Xt,Zt Πt inf at sup yt . . . sup DT E XT DT inf a T sup y T R(F, X1:T , y1:T , a1:T )

E Z1 µk max X1 Z1 inf a1 sup y1 . . . sup Dt E Xt,Zt Πt

inf at sup yt . . . sup DT E XT DT inf a T sup y T R(F, X1:T , y1:T , a1:T )

+ E Z1 µk max X1 Z1 inf a1 sup y1 . . . sup Dt E Xt,Zt Πt

inf at sup yt . . . sup DT E XT DT inf a T sup y T R(F, X1:T , y1:T , a1:T )

4This holds for the log-loss by using the trivial strategy of using a uniform probability assignment at each step.

E Z1 µk max X1 Z1 inf a1 sup y1 . . . sup Dt E Xt,Zt Πt

inf at sup yt . . . sup DT E XT DT inf a T sup y T R(F, X1:T , y1:T , a1:T )

E Z1 µk max X1 Z1 inf a1 sup y1 . . . sup Dt E Zt Πt max Xt Zt

inf at sup yt . . . sup DT E XT DT inf a T sup y T R(F, X1:T , y1:T , a1:T )

= E Z1 µk max X1 Z1 inf a1 sup y1 . . . E Zt µk max Xt Zt

inf at sup yt . . . sup DT E XT DT inf a T sup y T R(F, X1:T , y1:T , a1:T )

Combining with the induction hypothesis, gives us the induction hypothesis for the next t as required. The desired result follows by upper bounding the average with the supremum over all subsets of size k T.

C Proof of Theorem 3.2

First, recall the notion of the global sequential covering for a class Wu et al. [2022b]. Definition C.1 (Global Sequential Covering Wu et al. [2022b]). For any class F, we say that F α X [0, 1] is a global sequential α-covering of F at scale T if for any sequence x1:T and h F, there is a h F such that for all i, h(xi) h (x1:i) α.

Theorem C.1 (Wu et al. [2022b]). If F α is a global sequential α-covering of F at scale T, then

RT (F) inf α>0

n 2αT + log F α o .

To finish the proof note that a ϵ-cover in the sense of Definition 3.1 gives a global sequential cover in the sense of Definition C.1.

D VC Classes

In this section, we construct a probability assignment for the case when F {X [0, 1]} is a VC class. To motivate this probability assignment, consider the no-context case, which is a classic problem in information theory, where the (asymptotically) optimal probability assignment is known to be the Krichevsky and Trofimov [1981] (KT) probability assignment which is a Bayesian mixture of the form

q KT(y1:T ) = Z 1

0 pθ(y1:T )w(θ)dθ

for a particular prior w(θ). This can be written sequentially as q KT(1|y1:t 1) =

Pt 1 i=1 yi+1/2

t 1+1 leading to it sometimes being called the add-1/2 probability assignment; by choosing w(θ) to be Beta(β, β) prior one can achieve a corresponding add-β probability assignment. We extend the mixture idea to the contextual case. In particular, for functions f1, . . . , fm F, one can choose a mixture probability assignment as 5

i=1 q(yi|x1:i, y1:i 1) =: q(y1:t x1:t) = 1

pfj(yi|xi) + β

5Note that once a mixture q(y1:t x1:t) has been defined for arbitrary x1:t, y1:t , the probability assignment at time t (or equivalently, the predicted probability with which the upcoming bit is 1) can be defined as q(1|x1:t, y1:t 1) = q(y1:t 11 x1:t) q(y1:t 1 x1:t 1); in particular, this prediction depends only on the observed history x1:t, y1:t 1 and not the future yt.

This is the approach employed presently with a carefully chosen f1, . . . , fm. We remark that for VC classes this mixture approach may be extended to any mixable [Cesa-Bianchi and Lugosi, 2006, Chapter 3] loss.

First consider VC classes more carefully: i.e. each f FVC is characterized by three things: a set A X, where A A 2X with the VC dimension of the collection A being d < ; as well as two numbers θ0, θ1 [0, 1]. Then, we have

f A,θ0,θ1(x) = θ01{x A} + θ11{x AC}. The following equivalent representation of this hypothesis class is more convenient to use. We consider each f to be characterized by a tuple f = (g, θ0, θ1) where

1. θ0, θ1 [0, 1] 2. g G {X {0, 1}}.

In other words, g belongs to a class G of binary functions this is simply the class of functions {x 7 1{x / A} A A} in the original notation; so that clearly VCdim(G) = d. Then, we have pf( |x) = pg,θ0,θ1( |x) = Bernoulli(θ0) if g(x) = 0; and pg,θ0,θ1( |x) = Bernoulli(θ1) otherwise.

Recalling the definition of regret against a particular f = (g, θ0, θ1) for a sequential probability assignment strategy Q = {q( |x1:t, y1:t 1)}T t=1

RT (f, x1:T , y1:T , Q) =

t=1 log 1 q(yt|x1:t, y1:t 1)

t=1 log 1 pf(yt|xt) (6)

= log pf(y1:T |x1:T )

q(y1:T x1:T )

where q(y1:T x1:T ) := QT t=1 q(yt|x1:t, y1:t 1).

In the smoothed analysis case, we have Xt Dt where Dt for all t is σ-smoothed. Recall that in this case, we are concerned with the regret

RT (F, σ, Q) = max D:σ-smoothed EX1:T

max y1:T sup f F

pf(y1:T |X1:T )

q(y1:T X1:T )

= max D:σ-smoothed EX1:T

max y1:T sup g G max θ0,θ1 pg,θ0,θ1(y1:T |X1:T )

q(y1:T X1:T )

D.1 Proposed probability assignment

Let µ be the dominating measure for the σ-smoothed distribution of X1:T . Let g1, . . . , gmϵ G be an ϵ-cover of the function class G under the metric δµ(g1, g2) = Pr X µ(g1(X) = g2(X)). The following lemma bounds mϵ. Lemma D.1 (Covering number of VC classes under the metric δ, Vershynin [2018]).

for an absolute constant c.

Following the idea of using a mixture probability assignment, we take a uniform mixture over g1, . . . , gmϵ and θ0, θ1 so that

q(y1:t x1:t) = 1

0 pgi,θ0,θ1(y1:t|x1:t)dθ0dθ1

and consequently the sequential probability assignment (or equivalently, the probability assigned to 1) is

q(1|x1:t, y1:t 1) = q(y1:t 11 x1:t)

q(y1:t 1 x1:t 1).

One can observe that q(0|x1:t, y1:t 1), q(1|x1:t, y1:t 1) > 0 and q(0|x1:t, y1:t 1) + q(1|x1:t, y1:t 1) = 1 so that q is a legitimate probability assignment. Let the strategy induced by this uniform mixture be called QVC.

D.2 Analysis of QVC for smoothed adversaries

We note from (6) that for the QVC as defined in the last section, we have

RT ((g , θ 0, θ 1), x1:T , y1:T , QVC) = log mϵ + log pg ,θ 0,θ 1(y1:T |x1:T ) Pmϵ i=1 R 1 0 R 1 0 pgi,θ0,θ1(y1:T |x1:T )dθ0dθ1

log mϵ + log pg ,θ 0,θ 1(y1:T |x1:T ) R 1 0 R 1 0 pgi ,θ0,θ1(y1:T |x1:T )dθ0dθ1 (7)

where gi {g1, . . . , gmϵ} is the function i [m] that minimizes the Hamming distance between the binary strings (gi (x1), . . . , gi (x T )) and (g (x1), . . . , g (x T )).

We now take a closer look at the second term of (7). Firstly, note that for any (g, θ0, θ1) we have pg,θ0,θ1(y1:T |x1:T ) =

t=1 pg,θ0,θ1(yt|xt) =

t=1 θyt g(xt)(1 θg(xt))1 yt = Y

t:g(xt)=0 θyt 0 (1 θ0)1 yt Y

t:g(xt)=1 θyt 1 (1 θ1)1 yt

= θk0(g;x1:T ,y1:T ) 0 (1 θ0)n0(g;x1:T ) k0(g;x1:T ,y1:T )

θk1(g;x1:T ,y1:T ) 1 (1 θ1)n1(g;x1:T ) k1(g;x1:T ,y1:T ),

where for j {0, 1}

kj(g; x1:T , y1:T ) = |{t : yt = 1, g(xt) = j}| n0(g; x1:T ) = |{t : g(xt) = j}|.

Next, we note that for any g G Z 1

0 pg,θ0,θ1(y1:T |x1:T )dθ0dθ1

0 θk0(g;x1:T ,y1:T ) 0 (1 θ0)n0(g;x1:T ) k0(g;x1:T ,y1:T )dθ0

0 θk1(g;x1:T ,y1:T ) 1 (1 θ1)n1(g;x1:T ) k0(g;x1:T ,y1:T )dθ1

= 1 n0(g;x1:T ) k0(g;x1:T ,y1:T ) (n0(g; x1:T ) + 1)

1 n1(g;x1:T ) k1(g;x1:T ,y1:T ) (n1(g; x1:T ) + 1) (8)

n2 n0(g;x1:T ) k0(g;x1:T ,y1:T ) n1(g;x1:T ) k1(g;x1:T ,y1:T )

where (8) follows from properties of the Laplace probability assignment (or that of the Beta/Gamma functions), captured by Lemma D.2. Lemma D.2. For k n N, Z 1

0 tk(1 t)n kdt = Γ(k + 1)Γ(n k + 1)

Γ(n + 2) = 1 (n + 1) n k

where Γ( ) represents the Gamma function.

Putting this back into (7) (and rearranging), we have

RT ((g , θ 0, θ 1), x1:T , y1:T , QVC) log mϵ 2 log n

j {0,1} log

nj(gi ; x1:T ) kj(gi ; x1:T , y1:T )

(θ j )kj(g ;x1:T ,y1:T )(1 θ j )nj(g ;x1:T ) kj(g ;x1:T ,y1:T ) !

j {0,1} log

nj(gi ; x1:T ) kj(gi ; x1:T , y1:T )

nj(g ; x1:T ) kj(g ; x1:T , y1:T )

nj(g ; x1:T ) kj(g ; x1:T , y1:T )

(θ j )kj(g ;x1:T ,y1:T )(1 θ j )nj(g ;x1:T ) kj(g ;x1:T ,y1:T ) !

j {0,1} log

nj(gi ; x1:T ) kj(gi ; x1:T , y1:T )

nj(g ; x1:T ) kj(g ; x1:T , y1:T )

where (9) follows since for any natural numbers k n and θ [0, 1] we have n k θk(1 θ)n k 1. Now, note that

n k = log n!

n ! + log k !

k! + log (n k )!

log (n + |n n |)!

n ! + log (k + |k k |)!

k! + log ((n k) + |n n | + |k k |)!

If |k k |, |n n | δ, and max{n, n } N then by for example [Bhatt and Kim, 2021, Proposition 6] we have that

n k 2δ log(n + 2δ) + 2δ log(k + 2δ) + 4δ log((n k) + 4δ)

16δ log N. (10)

We now wish to use this bound in (9). For this, we will recall the definitions of n0(g; x1:T ) and k0(g; x1:T , y1:T ) for a particular function g and observe that for two functions g, g we have that both |n0(g; x1:T ) n0(g ; x1:T )|, |k0(g; x1:T , y1:T ) k0(g ; x1:T , y1:T )| d H(g(x1:T ), g (x1:T )) where d H( , ) denotes the Hamming distance and g(x1:T ) := (g(x1), . . . , g(x T )) {0, 1}T . Thus, by using (10) in (9) with δ = d H(g (x1:T ), gi (x1:T )), N = T, we get

RT ((g , θ 0, θ 1), x1:T , y1:T , QVC) log mϵ + 2 log T + 32d H(g (x1:T ), gi (x1:T )) log T. (11)

Note that (11) has effectively removed any dependence on y, θ 0, θ 1. We then have for some absolute constant C, (recalling the definition of i and F from earlier)

RT (F, σ, QVC) C log mϵ + C log T max D:σ-smoothed E

sup g G min i [mϵ] d H(g (X1:T ), gi(X1:T ))

Finally, we can control the last term in (12) by the following result, which follows from the coupling lemma and variance sensitive upper bounds on suprema over VC classes. Lemma D.3 (Lemma 3.3 of Haghtalab et al. [2021]).

sup g G min i [m] d H(g (X1:T ), gi(X1:T ))

ϵ σ T log Td log 1

+ T log T ϵ

Plugging the above into (12) and taking ϵ = σ T 2 gives us

RT (F, σ, QVC) O

E Proof of Lemma 4.2

Proof. Note that this proof holds for general loss functions. Let RT denote the regret.

i=1 ℓ(ht, st) inf h F

i=1 ℓ(h, st)

i=1 ℓ(ht, st)

t=1 ℓ(ht+1, st) +

t=1 ℓ(ht+1, st) inf h F

i=1 ℓ(h, st)

i=1 ℓ(ht, st)

t=1 ℓ(ht+1, st)

t=1 ℓ(ht+1, st) inf h F

i=1 ℓ(h, st)

Let us focus on the second term.

t=1 ℓ(ht+1, st) inf h F

i=1 ℓ(h, st)

t=1 ℓ(ht+1, st) inf h Fα

i=1 ℓ(h, st) + inf h Fα

i=1 ℓ(h, st) inf h F

i=1 ℓ(h, st)

t=1 ℓ(ht+1, st) inf h Fα

i=1 ℓ(h, st)

t=1 ℓ(ht, st) ℓ(h , st)

t=1 ℓ(h, st) ℓ(h , st)

where h = infh Fα PT i=1 ℓ(h, st) . (13) follows by comparing the optimal of the truncated class with the whole class, see [Cesa-Bianchi and Lugosi, 2006, Lemma 9.5]. (14) follows from the Be-the-leader lemma Cesa-Bianchi and Lugosi [2006].

F Proof of Lemma 4.3

Denote by R(t) = (N (t), {esi}i N(t)) the fresh randomness generated at the beginning of time t, which is independent of {sτ}τ<t generated by the adversary. Let Qt be the distribution of the learner s action ht H in Algorithm 1, Formally,

i=1 ey (t+1) i 1(ex (t+1) i = x) +

τ=1 yτ 1(xτ = x).

Let Pt be the distribution of rt. The reason why we introduce this notion is that ht in Algorithm 1 only depends on the vector rt 1.

The main step in the proof is to introduce an independent sample from the distribution Dt in order to decouple the dependence of the distribution Qt+1 on the test point st.

E st Dt E ht Qt[ℓ(ht, st)] E st Dt E ht+1 Qt+1[ℓ(ht+1, st)]

= E st Dt E ht Qt[ℓ(ht, st)] E st,s t Dt E ht+1 Qt+1[ℓ(ht+1, s t)]

+ E st,s t Dt E ht+1 Qt+1[ℓ(ht+1, s t)] E st Dt E ht+1 Qt+1[ℓ(ht+1, st)] (15)

= E s t Dt E ht Qt[ℓ(ht, s t)] E s t Dt E ht+1 Est Dt[Qt+1][ℓ(ht+1, s t)] (16)

+ E st,s t Dt E ht+1 Qt+1[ℓ(ht+1, s t)] E st Dt E ht+1 Qt+1[ℓ(ht+1, st)]

where we get (15) by adding and subtracting the middle term corresponding to evaluating the loss on an independent sample s t and (16) by observing that st and s t are equally distributed. Since the second term is the same in the required equation, we can focus on the first term.

E s t Dt[ℓ(ht, s t)]

E ht+1 Qt+1

E s t Dt[ℓ(ht+1, s t)]

Here we use the notation ] Qt+1 = Est Dt[Qt+1] for the mixture distribution. In order to bound this, we look a variational interpretation of the χ2 distance between two distributions P and Q. Lemma F.1 (Hammersley Chapman Robbins bound). For any pair of measures P and Q and any measurable function h : X R, we have EX P [h(X)] EX Q[h(X)] q

χ2 (P, Q) Var X Q h (X)

1 2χ2 (P, Q) EX,X Q h (X) h (X ) 2.

Applying this to (17), we get

E s t Dt[ℓ(ht, s t)]

E ht+1 Qt+1

E s t Dt[ℓ(ht+1, s t)]

1 2χ2( E st Dt[Qt+1], Qt) Eht,h t Qt Est Dt ℓ(ht, st) ℓ(h t, st) 2 .

as required. As noted before, for the particular use in our analysis a simpler version of the lemma similar to Haghtalab et al. [2022] suffices but we include the general version since we believe such a version is useful in providing improved regret bounds for the problem.

G Upper Bounding χ2 Distance: Proof of Lemma 4.4

In this section, we will focus on bounding the χ2 distance between the distribution of actions at time steps. The reasoning in this section closely follows Haghtalab et al. [2022]. We reproduce it here for completeness.

We assume that X is discrete. Define

i=1 1(exi = x, eyi = 0) and n1(x) =

i=1 1(exi = x, eyi = 1).

As each exi is uniformly distributed on X and eyi U({0, 1}), by the subsampling property of the Poisson distribution, the 2|X| random variables {n0(x), n1(x)}x X are i.i.d. distributed as Poi(n/2|X|).

Since the historic data is only a translation, it suffices to consider the distributions at time t = 0 and t = 1. Let n1 0(x) = n0(x) + 1(x1 = x, y1 = 0) with n1 1 definied similarly. Let P and Q be the probability distributions of {n0(x), n1(x)}x X and {n1 0(x), n1 1(x)}x X , respectively. Note that the output of the oracle depends only on this vector and thus by the data processing inequality it suffices to bound χ2(P, Q).

Note that the distribution P is a product Poisson distribution:

P({n0(x), n1(x)}) = Y

y {0,1} P(Poi(n/2|X|) = ny(x)).

As for the distribution Q, it could be obtained from P in the following way: the smooth adversary draws x D, independent of {n0(x), n1(x)}x X P, for some σ-smooth distribution D σ(X). He then chooses a label y = y(x ) {0, 1} as a function of x , and sets

n1 y(x )(x ) = ny(x )(x ) + 1, and n1 y(x) = ny(x), (x, y) = (x , y(x )).

Consequently, given a σ-smooth distribution D and a labeling function y : X {0, 1} used by the adversary, the distribution Q is a mixture distribution Q = Ex DX [Qx ], with

Qx ({n1 0(x), n1 1(x)}) = P(Poi(n/2|X|) = ny(x )(x ) 1) Y

(x,y) =(x ,y(x )) P(Poi(n/2|X|) = ny(x)).

We will use the Ingster method to control the χ2 between the mixture distribution Q and the base distribution P.

Lemma G.1 (Ingster s χ2 method). For a mixture distribution Eθ π[Qθ] and a generic distribution P, the following identity holds:

χ2 E θ π[Qθ], P = E θ,θ π

Qθ(x)Qθ (x)

where θ is an independent copy of θ.

Let x 1, x 2 be an arbitrary pair of instance. Using the closed-form expressions of distributions P and Qx , it holds that

Qx 1Qx 2 P 2 = 2|X|ny(x 1)(x 1) n 2|X|ny(x 2)(x 2) n .

Using the fact that {n0(x), n1(x)}x X are i.i.d. distributed as Poi(n/2|X|) under P, we have

E {n0(x),n1(x)} P

Qx 1({n1 0(x), n1 1(x)})Qx 2({n1 0(x), n1 1(x)}) P({n0(x), n1(x)})2

n 1(x 1 = x 2).

We will use the fact that the probability of collision between two independent draws x 1, x 2 D is small. That is using the Lemma G.1, we have r

χ2(Ex D[Qx ], P)

n E x 1,x 2 D[1(x 1 = x 2)]

x X D(x) 1 σ|X| = 1 σn,

where the last inequality follows from the definition of a σ-smooth distribution.

H Upper Bounding Generalization Error: Proof of Lemma 4.5

The proof of the theorem is similar to the [Haghtalab et al., 2022, Section 4.2.2]. In our setting, we need to deal with general losses. We shall need the following property of smooth distributions which is a slightly strengthened version of the coupling lemma in Theorem 2.1 shown in Haghtalab et al. [2022]. Lemma H.1. Let X1, , Xm Q and P be another distribution with a bounded likelihood ratio: d P/d Q 1/σ. Then using external randomness R, there exists an index I = I(X1, , Xm, R) [m] and a success event E = E(X1, , Xm, R) such that Pr[Ec] (1 σ)m, and

(XI | E, X\I) P.

Fix any realization of the Poissonized sample size N Poi(n). Choose m in Lemma H.1. Since for any σ-smooth Dt, it holds that

Dt(s) U(X {0, 1})(s) = Dt(x) U(X)(x) Dt(y | x) U({0, 1})(y) 2

the premise of Lemma H.1 holds with parameter σ/2 for P = Dt, Q = U(X {0, 1}). Consequently, dividing the self-generated samples es1, , es N into N/m groups each of size m, and running the procedure in Lemma H.1, we arrive at N/m independent events E1, , EN/m, each with probability at least 1 (1 σ/2)m 1 T 2. Moreover, conditioned on each Ej, we can pick an element uj {es(j 1)m+1, , esjm} such that

(uj | Ej, {es(j 1)m+1, , esjm}\{uj}) Dt.

For notational simplicity we denote the set of unpicked samples {es(j 1)m+1, , esjm}\{uj} by vj. As a result, thanks to the mutual independence of different groups and st Dt conditioned on s1:t 1 (note that we draw fresh randomness at every round), for E j [N/m]Ej we have

(u1, , u N/m, st) | (E, s1:t 1, v1, , v N/m) iid Dt.

Let us denote ht+1 = Ot+1(es1, , es N, s1:t 1, st) the output of the algorithm at time t when es1, , es N denotes the hallucinated data points and s1:t 1, st denotes the observed data points. We will use the fact that Ot+1 is a permutation invariant function. Consequently, for each j [N/m] we have

E st Dt,R(t+1)[ℓ(ht+1, st) | E]

= E st Dt,es1, ,es N

ℓ(Ot+1(es1, , es N, s1:t 1, st), st) | E

= E v,s1:t 1|E

E st,u1, ,u N/m

h ℓ(Ot+1(s1:t 1, v, u1, , u N/m, st), st) | E, s1:t 1, v i!

= E v,s1:t 1|E

E st,u1, ,u N/m

h ℓ(Ot+1(s1:t 1, v, u1, , uj 1, st, uj+1, , u N/m, uj), uj) | E, s1:t 1, v i!

= E v,s1:t 1|E

E st,u1, ,u N/m[ℓ(Ot+1(s1:t 1, v, u1, , u N/m, st), uj) | E, s1:t 1, v]

= E st Dt,R(t+1)[ℓ(ht+1, uj) | E],

where (18) follows from the conditional iid (and thus exchangeable) property of (u1, , u N/m, st) after the conditioning, and (19) is due to the invariance of the Ot+1 after any permutation of the inputs. On the other hand, if s t, u 1, , u N/m are independent copies of st Dt, by independence it is clear that

E st,s t Dt,R(t+1)[ℓ(ht+1, s t) | E] = E st,s t Dt,R(t+1)[ℓ(ht+1, u j) | E], j [N/m].

Consequently, using the shorthand u0 = st, u 0 = s t, we have

E st,s t Dt,R(t+1)[ℓ(ht+1, s t) ℓ(ht+1, st) | E]

= 1 N/m + 1 E st,s t Dt,R(t+1)

j=0 (ℓ(ht+1, u j) ℓ(ht+1, uj)) E

1 N/m + 1 E u0, ,u N/m,u 0, ,u N/m Dt

j=0 (ℓ(h, u j) ℓ(h, uj))

2α N/m + 1 E u0, ,u N/m Dt E ϵ1...ϵN/m

j=0 ϵjh(uj)

αRad Fα, N/m .

The last inequality uses the fact that the algorithm always outputs a function in Fα. Further, we have used the Ledoux-Talagrand contraction inequality. Theorem H.2 (Ledoux-Talagrand Contraction). Let g : R R be a L-Lipschitz function. For a function class F, denote by g F the compositions of function in F with g. Then, for all n,

Rad (g F, n) L Rad (F, n) .

Last inequality follows from the fact that the derivative of the log loss is bounded by 1/α when truncated at level α. Note that the union bound gives

j=1 Pr[Ec j] N (1 σ)m

Thus,the law of total expectation gives

E st,s t Dt,R(t+1)[ℓ(ht+1, s t) ℓ(ht+1, st)]

E st,s t Dt,R(t+1)[ℓ(ht+1, s t) ℓ(ht+1, st) | E] + Pr[Ec] log(1/α)

αRad Fα, N/m + N (1 σ)m log 1/α

The last equation follows from the fact that the output of the algorithm has loss always bounded by log 1/α .

We get the desired result by taking the expectation of N Poi(n), and using Pr[N > n/2] 1 e n/8 in the above inequality completes the proof.

I Bound on the Perturbation Term

Lemma I.1 (Perturbation).

i=1 L(ˆh, st) L(h , st)

Proof. Note from the truncation step in Algorithm 1, we have that L(ˆh, st) log (α). We get the desired bound by taking expectations.