# strategic_classification_under_unknown_personalized_manipulation__767bd37f.pdf

Strategic Classification under Unknown Personalized Manipulation

Han Shao Toyota Technological Institute Chicago Chicago, 60637 han@ttic.edu

Avrim Blum Toyota Technological Institute Chicago Chicago, 60637 avrim@ttic.edu

Omar Montasser Toyota Technological Institute Chicago Chicago, 60637 omar@ttic.edu

We study the fundamental mistake bound and sample complexity in the strategic classification, where agents can strategically manipulate their feature vector up to an extent in order to be predicted as positive. For example, given a classifier determining college admission, student candidates may try to take easier classes to improve their GPA, retake SAT and change schools in an effort to fool the classifier. Ball manipulations are a widely studied class of manipulations in the literature, where agents can modify their feature vector within a bounded radius ball. Unlike most prior work, our work considers manipulations to be personalized, meaning that agents can have different levels of manipulation abilities (e.g., varying radii for ball manipulations), and unknown to the learner. We formalize the learning problem in an interaction model where the learner first deploys a classifier and the agent manipulates the feature vector within their manipulation set to game the deployed classifier. We investigate various scenarios in terms of the information available to the learner during the interaction, such as observing the original feature vector before or after deployment, observing the manipulated feature vector, or not seeing either the original or the manipulated feature vector. We begin by providing online mistake bounds and PAC sample complexity in these scenarios for ball manipulations. We also explore non-ball manipulations and show that, even in the simplest scenario where both the original and the manipulated feature vectors are revealed, the mistake bounds and sample complexity are lower bounded by Ω(|H|) when the target function belongs to a known class H.

1 Introduction

Strategic classification addresses the problem of learning a classifier robust to manipulation and gaming by self-interested agents (Hardt et al., 2016). For example, given a classifier determining loan approval based on credit scores, applicants could open or close credit cards and bank accounts to increase their credit scores. In the case of a college admission classifier, students may try to take easier classes to improve their GPA, retake the SAT or change schools in an effort to be admitted. In both cases, such manipulations do not change their true qualifications. Recently, a collection of papers has studied strategic classification in both the online setting where examples are chosen by an adversary in a sequential manner (Dong et al., 2018; Chen et al., 2020; Ahmadi et al., 2021, 2023), and the

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

distributional setting where the examples are drawn from an underlying data distribution (Hardt et al., 2016; Zhang and Conitzer, 2021; Sundaram et al., 2021; Lechner and Urner, 2022). Most existing works assume that manipulation ability is uniform across all agents or is known to the learner. However, in reality, this may not always be the case. For instance, low-income students may have a lower ability to manipulate the system compared to their wealthier peers due to factors such as the high costs of retaking the SAT or enrolling in additional classes, as well as facing more barriers to accessing information about college (Milli et al., 2019) and it is impossible for the learner to know the highest achievable GPA or the maximum number of times a student may retake the SAT due to external factors such as socio-economic background and personal circumstances.

We characterize the manipulation of an agent by a set of alternative feature vectors that she can modify her original feature vector to, which we refer to as the manipulation set. Ball manipulations are a widely studied class of manipulations in the literature, where agents can modify their feature vector within a bounded radius ball. For example, Dong et al. (2018); Chen et al. (2020); Sundaram et al. (2021) studied ball manipulations with distance function being some norm and Zhang and Conitzer (2021); Lechner and Urner (2022); Ahmadi et al. (2023) studied a manipulation graph setting, which can be viewed as ball manipulation w.r.t. the graph distance on a predefined known graph.

In the online learning setting, the strategic agents come sequentially and try to game the current classifier. Following previous work, we model the learning process as a repeated Stackelberg game over T time steps. In round t, the learner proposes a classifier ft and then the agent, with a manipulation set (unknown to the learner), manipulates her feature in an effort to receive positive prediction from ft. There are several settings based on what and when the information is revealed about the original feature vector and the manipulated feature vector in the game. The simplest setting for the learner is observing the original feature vector before choosing ft and the manipulated vector after. In a slightly harder setting, the learner observes both the original and manipulated vectors after selecting ft. An even harder setting involves observing only the manipulated feature vector after selecting ft. The hardest and least informative scenario occurs when neither the original nor the manipulated feature vectors are observed.

In the distributional setting, the agents are sampled from an underlying data distribution. Previous work assumes that the learner has full knowledge of the original feature vector and the manipulation set, and then views learning as a one-shot game and solves it by computing the Stackelberg equilibria of it. However, when manipulations are personalized and unknown, we cannot compute an equilibrium and study learning as a one-shot game. In this work, we extend the iterative online interaction model from the online setting to the distributional setting, where the sequence of agents is sampled i.i.d. from the data distribution. After repeated learning for T (which is equal to the sample size) rounds, the learner has to output a strategy-robust predictor for future use.

In both online and distributional settings, examples are viewed through the lens of the current predictor and the learner does not have the ability to inquire about the strategies the previous examples would have adopted under a different predictor.

Related work Our work is primarily related to strategic classification in online and distributional settings. Strategic classification was first studied in a distributional model by Hardt et al. (2016) and subsequently by Dong et al. (2018) in an online model. Hardt et al. (2016) assumed that agents manipulate by best response with respect to a uniform cost function known to the learner. Building on the framework of (Hardt et al., 2016), Lechner and Urner (2022); Sundaram et al. (2021); Zhang and Conitzer (2021); Hu et al. (2019); Milli et al. (2019) studied the distributional learning problem, and all of them assumed that the manipulations are predefined and known to the learner, either by a cost function or a predefined manipulation graph. For online learning, Dong et al. (2018) considered a similar manipulation setting as in this work, where manipulations are personalized and unknown. However, they studied linear classification with ball manipulations in the online setting and focused on finding appropriate conditions of the cost function to achieve sub-linear Stackelberg regret. Chen et al. (2020) also studied Stackelberg regret in linear classification with uniform ball manipulations. Ahmadi et al. (2021) studied the mistake bound under uniform (possbily unknown) ball manipulations, and Ahmadi et al. (2023) studied regret under a pre-defined and known manipulation. The most relevant work is a recent concurrent study by Lechner et al. (2023), which also explores strategic classification involving unknown personalized manipulations but with a different loss function. In their work, a predictor incurs a loss of 0 if and only if the agent refrains from manipulation and the predictor correctly predicts at the unmanipulated feature vector. In our work, the predictor s loss is 0

if it correctly predicts at the manipulated feature, even when the agent manipulates. As a result, their loss function serves as an upper bound of our loss function.

There has been a lot of research on various other issues and models in strategic classification. Beyond sample complexity, Hu et al. (2019); Milli et al. (2019) focused on other social objectives, such as social burden and fairness. Recent works also explored different models of agent behavior, including proactive agents Zrnic et al. (2021), non-myopic agents (Haghtalab et al., 2022) and noisy agents (Jagadeesan et al., 2021). Ahmadi et al. (2023) considers two agent models of randomized learners: a randomized algorithm model where the agents respond to the realization, and a fractional classifier model where agents respond to the expectation, and our model corresponds to the randomized algorithm model. Additionally, there is also a line of research on agents interested in improving their qualifications instead of gaming (Kleinberg and Raghavan, 2020; Haghtalab et al., 2020; Ahmadi et al., 2022). Strategic interactions in the regression setting have also been studied (e.g., Bechavod et al. (2021)).

Beyond strategic classification, there is a more general research area of learning using data from strategic sources, such as a single data generation player who manipulates the data distribution (Brückner and Scheffer, 2011; Dalvi et al., 2004). Adversarial perturbations can be viewed as another type of strategic source (Montasser et al., 2019).

Strategic classification Throughout this work, we consider the binary classification task. Let X denote the feature vector space, Y = {+1, 1} denote the label space, and H YX denote the hypothesis class. In the strategic setting, instead of an example being a pair (x, y), an example, or agent, is a triple (x, u, y) where x X is the original feature vector, y Y is the label, and u X is the manipulation set, which is a set of feature vectors that the agent can modify their original feature vector x to. In particular, given a hypothesis h YX , the agent will try to manipulate her feature vector x to another feature vector x within u in order to receive a positive prediction from h. The manipulation set u is unknown to the learner. In this work, we will be considering several settings based on what the information is revealed to the learner, including both the original/manipulated feature vectors, the manipulated feature vector only, or neither, and when the information is revealed.

More formally, for agent (x, u, y), given a predictor h, if h(x) = 1 and her manipulation set overlaps the positive region by h, i.e., u Xh,+ = with Xh,+ := {x X|h(x) = +1}, the agent will manipulate x to (x, h, u) u Xh,+1 to receive positive prediction by h. Otherwise, the agent will do nothing and maintain her feature vector at x, i.e., (x, h, u) = x. We call (x, h, u) the manipulated feature vector of agent (x, u, y) under predictor h.

A general and fundamental type of manipulations is ball manipulations, where agents can manipulate their feature within a ball of personalized radius. More specifically, given a metric d over X, the manipulation set is a ball B(x; r) = {x |d(x, x ) r} centered at x with radius r for some r R 0. Note that we allow different agents to have different manipulation power and the radius can vary over agents. Let Q denote the set of allowed pairs (x, u), which we refer to as the feature-manipulation set space. For ball manipulations, we have Q = {(x, B(x; r))|x X, r R 0} for some known metric d over X. In the context of ball manipulations, we use (x, r, y) to represent (x, B(x; r), y) and (x, h, r) to represent (x, h, B(x; r)) for notation simplicity.

For any hypothesis h, let the strategic loss ℓstr(h, (x, u, y)) of h be defined as the loss at the manipulated feature, i.e., ℓstr(h, (x, u, y)) := 1(h( (x, h, u)) = y). According to our definition of ( ), we can write down the strategic loss explicitly as

ℓstr(h, (x, u, y)) =

1 if y = 1, h(x) = +1 1 if y = 1, h(x) = 1 and u Xh,+ = , 1 if y = +1, h(x) = 1 and u Xh,+ = , 0 otherwise.

For any randomized predictor p (a distribution over hypotheses), the strategic behavior depends on the realization of the predictor and the strategic loss of p is ℓstr(p, (x, u, y)) := Eh p [ℓstr(h, (x, u, y))].

1For ball manipulations, agents break ties by selecting the closest vector. When there are multiple closest vectors, agents break ties arbitrarily. For non-ball manipulations, agents break ties in any fixed way.

Online learning We consider the task of sequential classification where the learner aims to classify a sequence of agents (x1, u1, y1), (x2, u2, y2), . . . , (x T , u T , y T ) Q Y that arrives in an online manner. At each round, the learner feeds a predictor to the environment and then observes his prediction byt, the true label yt and possibly along with some additional information about the original/manipulated feature vectors. We say the learner makes a mistake at round t if byt = yt and the learner s goal is to minimize the number of mistakes on the sequence. The interaction protocol (which repeats for t = 1, . . . , T) is described in the following.

Protocol 1 Learner-Agent Interaction at round t

1: The environment picks an agent (xt, ut, yt) and reveals some context C(xt). In the online setting, the agent is chosen adversarially, while in the distributional setting, the agent is sampled i.i.d. 2: The learner A observes C(xt) and picks a hypothesis ft YX . 3: The learner A observes the true label yt, the prediction byt = ft( t), and some feedback F(xt, t), where t = (xt, ft, ut) is the manipulated feature vector.

The context function C( ) and feedback function F( ) reveals information about the original feature vector xt and the manipulated feature vector t. C( ) reveals the information before the learner picks ft while F( ) does after. We study several different settings based on what and when information is revealed.

The simplest setting for the learner is observing the original feature vector xt before choosing ft and the manipulated vector t after. Consider a teacher giving students a writing assignment or take-home exam. The teacher might have a good knowledge of the students abilities (which correspond to the original feature vector xt) based on their performance in class, but the grade has to be based on how well they do the assignment. The students might manipulate by using the help of Chat GPT / Google / Wolfram Alpha / their parents, etc. The teacher wants to create an assignment that will work well even in the presence of these manipulation tools. In addition, If we think of each example as representing a subpopulation (e.g., an organization is thinking of offering loans to a certain group), then there might be known statistics about that population, even though the individual classification (loan) decisions have to be made based on responses to the classifier. This setting corresponds to C(xt) = xt and F(xt, t) = t. We denote a setting by their values of C, F and thus, we denote this setting by (x, ).

In a slightly harder setting, the learner observes both the original and manipulated vectors after selecting ft and thus, ft cannot depend on the original feature vector in this case. For example, if a high-school student takes the SAT test multiple times, most colleges promise to only consider the highest one (or even to "superscore" the test by considering the highest score separately in each section) but they do require the student to submit all of them. Then C(xt) = and F(xt, t) = (xt, t), where is a token for no information , and this setting is denoted by ( , (x, )).

An even harder setting involves observing only the manipulated feature vector after selecting ft (which can only be revealed after ft since t depends on ft). Then C(xt) = and F(xt, t) = t and this setting is denoted by ( , ).

The hardest and least informative scenario occurs when neither the original nor the manipulated feature vectors are observed. Then C(xt) = and F(xt, t) = and it is denoted by ( , ).

Throughout this work, we focus on the realizable setting, where there exists a perfect classifier in H that never makes any mistake at the sequence of strategic agents. More specifically, there exists a hypothesis h H such that for any t [T], we have yt = h ( (xt, h , ut))2. Then we define the mistake bound as follows.

Definition 1. For any choice of (C, F), let A be an online learning algorithm under Protocol 1 in the setting of (C, F). Given any realizable sequence S = ((x1, u1, h ( (x1, h , u1))), . . . , (x T , u T , h ( (x T , h , u T ))) (Q Y)T , where T is any integer and h H, let MA(S) be the number of mistakes A makes on the sequence S. The mistake bound of (H, Q), denoted MBC,F , is the smallest number B N such that there exists an algorithm A such that MA(S) B over all realizable sequences S of the above form.

2It is possible that there is no hypothesis h YX s.t. yt = h(xt) for all t [T].

According the rank of difficulty of the four settings with different choices of (C, F), the mistake bounds are ranked in the order of MBx, MB ,(x, ) MB , MB , .

PAC learning In the distributional setting, the agents are sampled from an underlying distribution D over Q Y. The learner s goal is to find a hypothesis h with low population loss Lstr D(h) := E(x,u,y) D [ℓstr(h, (x, u, y))]. One may think of running empirical risk minimizer (ERM) over samples drawn from the underlying data distribution, i.e., returning arg minh H 1 m Pm i=1 ℓstr(h, (xi, ui, yi)), where (x1, u1, y1), . . . , (xm, um, ym) are i.i.d. sampled from D. However, ERM is unimplementable because the manipulation sets ui s are never revealed to the algorithm, and only the partial feedback in response to the implemented classifier is provided. In particular, in this work we consider using the same interaction protocol as in the online setting, i.e., Protocol 1, with agents (xt, ut, yt) i.i.d. sampled from the data distribution D. After T rounds of interaction (i.e., T i.i.d. agents), the learner has to output a predictor fout for future use.

Again, we focus on the realizable setting, where the sequence of sampled agents (with manipulation) can be perfectly classified by a target function in H. Alternatively, there exists a classifier with zero population loss, i.e., there exists a hypothesis h H such that Lstr D(h ) = 0. Then we formalize the notion of PAC sample complexity under strategic behavior as follows. Definition 2. For any choice of (C, F), let A be a learning algorithm that interacts with agents using Protocol 1 in the setting of (C, F) and outputs a predictor fout in the end. For any ε, δ (0, 1), the sample complexity of realizable (ε, δ)-PAC learning of (H, Q), denoted SCC,F (ε, δ), is defined as the smallest m N for which there exists a learning algorithm A in the above form such that for any distribution D over Q Y where there exists a predictor h H with zero loss, Lstr D (h) = 0, with

probability at least 1 δ over (x1, u1, y1), . . . , (xm, um, ym) i.i.d. D, Lstr D (fout) ε.

Similar to mistake bounds, the sample complexities are ranked in the same order SCx, SC ,(x, ) SC , SC , according to the rank of difficulty of the four settings.

3 Overview of Results

In classic (non-strategic) online learning, the Halving algorithm achieves a mistake bound of log(|H|) by employing the majority vote and eliminating inconsistent hypotheses at each round. In classic PAC learning, the sample complexity of O( log(|H|)

ε ) is achievable via ERM. Both mistake bound and sample complexity exhibit logarithmic dependency on |H|. This logarithmic dependency on |H| (when there is no further structural assumptions) is tight in both settings, i.e., there exist examples of H with mistake bound of Ω(log(|H|)) and with sample complexity of Ω( log(|H|)

ε ). In the setting where manipulation is known beforehand and only t is observed, Ahmadi et al. (2023) proved a lower bound of Ω(|H|) for the mistake bound. Since in the strategic setting we can achieve a linear dependency on |H| by trying each hypothesis in H one by one and discarding it once it makes a mistake, the question arises:

Can we achieve a logarithmic dependency on |H| in strategic classification?

In this work, we show that the dependency on |H| varies across different settings and that in some settings mistake bound and PAC sample complexity can exhibit different dependencies on |H|. We start by presenting our results for ball manipulations in the four settings.

Setting of (x, ) (observing xt before choosing ft and observing t after) : For online learning, we propose an variant of the Halving algorithm, called Strategic Halving (Algorithm 1), which can eliminate half of the remaining hypotheses when making a mistake. The algorithm depends on observing xt before choosing the predictor ft. Then by applying the standard technique of converting mistake bound to PAC bound, we are able to achieve sample complexity of O( log(|H|) loglog(|H|)

ε ). Setting of ( , (x, )) (observing both xt and t after selecting ft) : We prove that, there exists an example of (H, Q) s.t. the mistake bound is lower bounded by Ω(|H|). This implies that no algorithm can perform significantly better than sequentially trying each hypothesis, which would make at most |H| mistakes before finding the correct hypothesis. However, unlike the construction of mistake lower bounds in classic online learning, where all mistakes can be forced to occur in the initial rounds, we demonstrate that we require Θ(|H|2) rounds to ensure that all mistakes occur. In

the PAC setting, we first show that, any learning algorithm with proper output fout, i.e., fout H, needs a sample size of Ω( |H|

ε ). We can achieve a sample complexity of O( log2(|H|)

ε ) by executing Algorithm 2, which is a randomized algorithm with improper output.

Setting of ( , ) (observing only t after selecting ft) : The mistake bound of Ω(|H|) also holds in this setting, as it is known to be harder than the previous setting. For the PAC learning, we show that any conservative algorithm, which only depends on the information from the mistake rounds, requires Ω( |H|

ε ) samples. The optimal sample complexity is left as an open problem.

Setting of ( , ) (observing neither xt nor t) : Similarly, the mistake bound of Ω(|H|) still holds. For the PAC learning, we show that the sample complexity is Ω( |H|

ε ) by reducing the problem to a stochastic linear bandit problem.

Then we move on to non-ball manipulations. However, we show that even in the simplest setting of observing xt before choosing ft and observing t after, there is an example of (H, Q) such that the sample complexity is eΩ( |H|

ε ). This implies that in all four settings of different revealed information, we will have sample complexity of eΩ( |H|

ε ) and mistake bound of eΩ(|H|). We summarize our results in Table 1.

setting mistake bound sample complexity

(x, ) Θ(log(|H|)) (Thm 1) e O( log(|H|)

ε )a (Thm 2), Ω( log(|H|)

( , (x, )) O(min( p

log(|H|)T, |H|)) (Thm 4) O( log2(|H|)

ε ) (Thm 6), Ω( log(|H|)

Ω(min( T |H| log(|H|), |H|))(Thm 3) SCprop = Ω( |H|

ε ) (Thm 5)

( , ) Θ(|H|) (implied by Thm 3) SCcsv = eΩ( |H|

ε ) (Thm 7)

( , ) Θ(|H|) (implied by Thm 3) e O( |H|

ε ) , eΩ( |H|

ε ) (Thm 8)

nonball all eΩ(|H|)(Cor 1) , O(|H|) e O( |H|

ε ) , eΩ( |H|

ε ) (Cor 1)

a A factor of loglog(|H|) is neglected. Table 1: The summary of results. e O and eΩignore logarithmic factors on |H| and 1

ε. The superscripts prop stands for proper learning algorithms and csv stands for conservative learning algorithms. All lower bounds in the non-strategic setting also apply to the strategic setting, implying that MBC,F Ω(log(|H|)) and SCC,F Ω( log(|H|)

ε ) for all settings of (C, F). In all four settings, a mistake bound of O(|H|) can be achieved by simply trying each hypothesis in H while the sample complexity can be achieved as e O( |H|

ε ) by converting the mistake bound of O(|H|) to a PAC bound using standard techniques.

4 Ball manipulations

In ball manipulations, when B(x; r) Xh,+ has multiple elements, the agent will always break ties by selecting the one closest to x, i.e., (x, h, r) = arg minx B(x;r) Xh,+ d(x, x ). In round t, the learner deploys predictor ft, and once he knows xt and byt, he can calculate t himself without needing knowledge of rt by

( arg minx Xft,+ d(xt, x ) if byt = +1 , xt if byt = 1 .

Thus, for ball manipulations, knowing xt is equivalent to knowing both xt and t.

4.1 Setting (x, ): Observing xt Before Choosing ft

Online learning We propose a new algorithm with mistake bound of log(|H|) in setting (x, ). To achieve a logarithmic mistake bound, we must construct a predictor ft such that if it makes a mistake, we can reduce a constant fraction of the remaining hypotheses. The primary challenge is that we do not have access to the full information, and predictions of other hypotheses are hidden. To extract the information of predictions of other hypotheses, we take advantage of ball manipulations, which

induces an ordering over all hypotheses. Specifically, for any hypothesis h and feature vector x, we define the distance between x and h by the distance between x and the positive region by h, Xh,+, i.e., d(x, h) := min{d(x, x )|x Xh,+} . (2)

At each round t, given xt, the learner calculates the distance d(xt, h) for all h in the version space (meaning hypotheses consistent with history) and selects a hypothesis ft such that d(xt, ft) is the median among all distances d(xt, h) for h in the version space. We can show that by selecting ft in this way, the learner can eliminate half of the version space if ft makes a mistake. We refer to this algorithm as Strategic Halving, and provide a detailed description of it in Algorithm 1.

Theorem 1. For any feature-ball manipulation set space Q and hypothesis class H, Strategic Halving achieves mistake bound MBx, log(|H|).

Algorithm 1 Strategic Halving

1: Initialize the version space VS = H. 2: for t = 1, . . . , T do 3: pick an ft VS such that d(xt, ft) is the median of {d(xt, h)|h VS}. 4: if byt = yt and yt = + then VS VS \ {h VS|d(xt, h) d(xt, ft)}; 5: else if byt = yt and yt = then VS VS \ {h VS|d(xt, h) d(xt, ft)}. 6: end for

To prove Theorem 1, we only need to show that each mistake reduces the version space by half. Supposing that ft misclassifies a true positive example (xt, rt, +1) by negative, then we know that d(xt, ft) > rt while the target hypothesis h must satisfy that d(xt, h ) rt. Hence any h with d(xt, h) d(xt, ft) cannot be h and should be eliminated. Since d(xt, ft) is the median of {d(xt, h)|h VS}, we can elimate half of the version space. It is similar when ft misclassifies a true negative. The detailed proof is deferred to Appendix B.

PAC learning We can convert Strategic Halving to a PAC learner by the standard technique of converting a mistake bound to a PAC bound (Gallant, 1986). Specifically, the learner runs Strategic Halving until it produces a hypothesis ft that survives for 1

ε log( log(|H|)

δ ) rounds and outputs this ft. Then we have Theorem 2, and the proof is included in Appendix C.

Theorem 2. For any feature-ball manipulation set space Q and hypothesis class H, we can achieve SCx, (ε, δ) = O( log(|H|)

ε log( log(|H|)

δ )) by combining Strategic Halving and the standard technique of converting a mistake bound to a PAC bound.

4.2 Setting ( , (x, )): Observing xt After Choosing ft

When xt is not revealed before the learner choosing ft, the algorithm of Strategic Halving does not work anymore. We demonstrate that it is impossible to reduce constant fraction of version space when making a mistake, and prove that the mistake bound is lower bounded by Ω(|H|) by constructing a negative example of (H, Q). However, we can still achieve sample complexity with poly-logarithmic dependency on |H| in the distributional setting.

4.2.1 Results in the Online Learning Model

To offer readers an intuitive understanding of the distinctions between the strategic setting and standard online learning, we commence by presenting an example in which no deterministic learners, including the Halving algorithm, can make fewer than |H| 1 mistakes. Example 1. Consider a star shape metric space (X, d), where X = {0, 1, . . . , n}, d(i, j) = 2 and d(0, i) = 1 for all i, j [n] with i = j. The hypothesis class is composed of singletons over [n], i.e., H = {21{i} 1|i [n]}. When the learner is deterministic, the environment can pick an agent (xt, rt, yt) dependent on ft. If ft is all-negative, then the environment picks (xt, rt, yt) = (0, 1, +1), and then the learner makes a mistake but no hypothesis can be eliminated. If ft predicts 0 by positive, the environment will pick (xt, rt, yt) = (0, 0, 1), and then the learner makes a mistake but no hypothesis can be eliminated. If ft predicts some i [n] by positive, the environment will pick (xt, rt, yt) = (i, 0, 1), and then the learner makes a mistake with only one hypothesis 21{i} 1 eliminated. Therefore, the learner will make n 1 mistakes.

In this work, we allow the learner to be randomized. When an (xt, rt, yt) is generated by the environment, the learner can randomly pick an ft, and the environment does not know the realization of ft but knows the distribution where ft comes from. It turns out that randomization does not help much. We prove that there exists an example in which any (possibly randomized) learner will incur Ω(|H|) mistakes. Theorem 3. There exists a feature-ball manipulation set space Q and hypothesis class H s.t. the mistake bound MB ,(x, ) |H| 1. For any (randomized) algorithm A and any T N, there exists a realizable sequence of (xt, rt, yt)1:T such that with probability at least 1 δ (over randomness of A), A makes at least min( T 5|H| log(|H|/δ), |H| 1) mistakes.

Essentially, we design an adversarial environment such that the learner has a probability of 1 |H| of making a mistake at each round before identifying the target function h . The learner only gains information about the target function when a mistake is made. The detailed proof is deferred to Appendix D. Theorem 3 establishes a lower bound on the mistake bound, which is |H| 1. However, achieving this bound requires a sufficiently large number of rounds, specifically T = eΩ(|H|2). This raises the question of whether there exists a learning algorithm that can make o(T) mistakes for any T |H|2. In Example 1, we observed that the adversary can force any deterministic learner to make |H| 1 mistakes in |H| 1 rounds. Consequently, no deterministic algorithm can achieve o(T) mistakes.

To address this, we propose a randomized algorithm that closely resembles Algorithm 1, with a modification in the selection of ft. Instead of using line 3, we choose ft randomly from VS since we lack prior knowledge of xt. This algorithm can be viewed as a variation of the well-known multiplicative weights method, applied exclusively during mistake rounds. For improved clarity, we present this algorithm as Algorithm 3 in Appendix E due to space limitations.

Theorem 4. For any T N, Algorithm 3 will make at most min( p

4 log(|H|)T, |H| 1) mistakes in expectation in T rounds.

Note that the T-dependent upper bound in Theorem 4 matches the lower bound in Theorem 3 up to a logarithmic factor when T = |H|2. This implies that approximately |H|2 rounds are needed to achieve |H| 1 mistakes, which is a tight bound up to a logarithmic factor. Proof of Theorem 4 is included in Appendix E.

4.2.2 Results in the PAC Learning Model

In the PAC setting, the goal of the learner is to output a predictor fout after the repeated interactions. A common class of learning algorithms, which outputs a hypothesis fout H, is called proper. Proper learning algorithms are a common starting point when designing algorithms for new learning problems due to their natural appeal and ability to achieve good performance, such as ERM in classic PAC learning. However, in the current setting, we show that proper learning algorithms do not work well and require a sample size linear in |H|. The formal theorem is stated as follows and the proof is deferred to Appendix F. Theorem 5. There exists a feature-ball manipulation set space Q and hypothesis class H s.t. SCprop , (ε, 7

8) = Ω( |H|

ε ), where SCprop , (ε, δ) is the (ε, δ)-PAC sample complexity achievable by proper algorithms.

Theorem 5 implies that any algorithm capable of achieving sample complexity sub-linear in |H| must be improper. As a result, we are inspired to devise an improper learning algorithm. Before presenting the algorithm, we introduce some notations. For two hypotheses h1, h2, let h1 h2 denote the union of them, i.e., (h1 h2)(x) = +1 iff. h1(x) = +1 or h2(x) = +1. Similarly, we can define the union of more than two hypotheses. Then for any union of k hypotheses, f = k i=1hi, the positive region of f is the union of positive regions of the k hypotheses and thus, we have d(x, f) = mini [k] d(x, hi). Therefore, we can decrease the distance between f and any feature vector x by increasing k. Based on this, we devise a new randomized algorithm with improper output, described in Algorithm 2. Theorem 6. For any feature-ball manipulation set space Q and hypothesis class H, we can achieve SC ,(x, )(ε, δ) = O( log2(|H|)+log(1/δ)

δ )) by combining Algorithm 2 with a standard confidence boosting technique. Note that the algorithm is improper.

Algorithm 2

1: Initialize the version space VS0 = H. 2: for t = 1, . . . , T do 3: randomly pick kt Unif({1, 2, 22, . . . , 2 log2(nt) 1 }) where nt = |VSt 1|; 4: sample kt hypotheses h1, . . . , hkt independently and uniformly at random from VSt 1; 5: let ft = kt i=1hi. 6: if byt = yt and yt = + then VSt = VSt 1 \ {h VSt 1|d(xt, h) d(xt, ft)}; 7: else if byt = yt and yt = then VSt = VSt 1 \ {h VSt 1|d(xt, h) d(xt, ft)}; 8: else VSt = VSt 1. 9: end for 10: randomly pick τ from [T] and randomly sample h1, h2 from VSτ 1 with replacement. 11: output h1 h2

Now we outline the high-level ideas behind Algorithm 2. In correct rounds where ft makes no mistake, the predictions of all hypotheses are either correct or unknown, and thus, it is hard to determine how to make updates. In mistake rounds, we can always update the version space similar to what was done in Strategic Halving. To achieve a poly-logarithmic dependency on |H|, we aim to reduce a significant number of misclassifying hypotheses in mistake rounds. The maximum number we can hope to reduce is a constant fraction of the misclassifying hypotheses. We achieve this by randomly sampling a ft (lines 3-5) s.t. ft makes a mistake, and d(xt, ft) is greater (smaller) than the median of d(xt, h) for all misclassifying hypotheses h for true negative (positive) examples. However, due to the asymmetric nature of manipulation, which aims to be predicted as positive, the rate of decreasing misclassifications over true positives is slower than over true negatives. To compensate for this asymmetry, we output a fout = h1 h2 with two selected hypotheses h1, h2 (lines 10-11) instead of a single one to increase the chance of positive prediction.

We prove that Algorithm 2 can achieve small strategic loss in expectation as described in Lemma 1. Then we can achieve the sample complexity in Theorem 6 by boosting Algorithm 2 to a strong learner. This is accomplished by running Algorithm 2 multiple times until we obtain a good predictor. The proofs of Lemma 1 and Theorem 6 are deferred to Appendix G.

Lemma 1. Let S = (xt, rt, yt)T t=1 DT denote the i.i.d. sampled agents in T rounds and let A(S) denote the output of Algorithm 2 interacting with S. For any feature-ball manipulation set space Q and hypothesis class H, when T 320 log2(|H|)

ε , we have EA,S [Lstr(A(S))] ε.

4.3 Settings ( , ) and ( , )

Online learning As mentioned in Section 2, both the settings of ( , ) and ( , ) are harder than the setting of ( , (x, )), all lower bounds in the setting of ( , (x, )) also hold in the former two settings. Therefore, by Theorem 3, we have MB , MB , MB ,(x, ) = |H| 1.

PAC learning In the setting of ( , ), Algorithm 2 is not applicable anymore since the learner lacks observation of xt, making it impossible to replicate the version space update steps in lines 6-7. It is worth noting that both PAC learning algorithms we have discussed so far fall under a general category called conservative algorithms, depend only on information from the mistake rounds. Specifically, an algorithm is said to be conservative if for any t, the predictor ft only depends on the history of mistake rounds up to t, i.e., τ < t with byτ = yτ, and the output fout only depends on the history of mistake rounds, i.e., (ft, byt, yt, t)t:byt =yt. Any algorithm that goes beyond this category would need to utilize the information in correct rounds. As mentioned earlier, in correct rounds, the predictions of all hypotheses are either correct or unknown, which makes it challenging to determine how to make updates. For conservative algorithms, we present a lower bound on the sample complexity in the following theorem, which is eΩ( |H|

ε ), and its proof is included in Appendix H. The optimal sample complexity in the setting ( , ) is left as an open problem.

Theorem 7. There exists a feature-ball manipulation set space Q and hypothesis class H s.t. SCcsv , (ε, 7

8) = eΩ( |H|

ε ), where SCcsv , (ε, δ) is (ε, δ)-PAC the sample complexity achievable by conservative algorithms.

In the setting of ( , ), our problem reduces to a best arm identification problem in stochastic bandits. We prove a lower bound on the sample complexity of eΩ( |H|

ε ) in Theorem 8 by reduction to stochastic linear bandits and applying the tools from information theory. The proof is deferred to Appendix I. Theorem 8. There exists a feature-ball manipulation set space Q and hypothesis class H s.t. SC , (ε, 7

8) = eΩ( |H|

5 Non-ball Manipulations

In this section, we move on to non-ball manipulations. In ball manipulations, for any feature vector x, we have an ordering of hypotheses according to their distances to x, which helps to infer the predictions of some hypotheses without implementing them. However, in non-ball manipulations, we don t have such structure anymore. Therefore, even in the simplest setting of observing xt before ft and t, we have the PAC sample complexity lower bounded by eΩ( |H|

ε ). Theorem 9. There exists a feature-manipulation set space Q and hypothesis class H s.t. SCx, (ε, 7

8) = eΩ( |H|

The proof is deferred to Appendix J. It is worth noting that in the construction of the proof, we let all agents to have their original feature vector xt = 0 such that xt does not provide any information. Since (x, ) is the simplest setting and any mistake bound can be converted to a PAC bound via standard techniques (see Section A.2 for more details), we have the following corollary. Corollary 1. There exists a feature-manipulation set space Q and hypothesis class H s.t. for all choices of (C, F), SCC,F (ε, 7

8) = eΩ( |H|

ε ) and MBC,F = eΩ(|H|).

6 Discussion and Open Problems

In this work, we investigate the mistake bound and sample complexity of strategic classification across multiple settings. Unlike prior work, we assume that the manipulation is personalized and unknown to the learner, which makes the strategic classification problem more challenging. In the case of ball manipulations, when the original feature vector xt is revealed prior to choosing ft, the problem exhibits a similar level of difficulty as the non-strategic setting (see Table 1 for details). However, when the original feature vector xt is not revealed beforehand, the problem becomes significantly more challenging. Specifically, any learner will experience a mistake bound that scales linearly with |H|, and any proper learner will face sample complexity that also scales linearly with |H|. In the case of non-ball manipulations, the situation worsens. Even in the simplest setting, where the original feature is observed before choosing ft and the manipulated feature is observed afterward, any learner will encounter a linear mistake bound and sample complexity.

Besides the question of optimal sample complexity in the setting of ( , ) as mentioned in Sec 4.3, there are some other fundamental open questions.

Combinatorial measure Throughout this work, our main focus is on analyzing the dependency on the size of the hypothesis class |H| without assuming any specific structure of H. Just as VC dimension provides tight characterization for PAC learnability and Littlestone dimension characterizes online learnability, we are curious if there exists a combinatorial measure that captures the essence of strategic classification in this context. In the proofs of the most lower bounds in this work, we consider hypothesis class to be singletons, in which both the VC dimension and Littlestone dimension are 1. Therefore, they cannot be candidates to characterize learnability in the strategic setting.

Agnostic setting We primarily concentrate on the realizable setting in this work. However, investigating the sample complexity and regret bounds in the agnostic setting would be an interesting avenue for future research.

Acknowledgements

This work was supported in part by the National Science Foundation under grant CCF-2212968, by the Simons Foundation under the Simons Collaboration on the Theory of Algorithmic Fairness, by the

Defense Advanced Research Projects Agency under cooperative agreement HR00112020003. The views expressed in this work do not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred. Approved for public release; distribution is unlimited.

Ahmadi, S., Beyhaghi, H., Blum, A., and Naggita, K. (2021). The strategic perceptron. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 6 25.

Ahmadi, S., Beyhaghi, H., Blum, A., and Naggita, K. (2022). On classification of strategic agents who can both game and improve. ar Xiv preprint ar Xiv:2203.00124.

Ahmadi, S., Blum, A., and Yang, K. (2023). Fundamental bounds on online strategic classification. ar Xiv preprint ar Xiv:2302.12355.

Bechavod, Y., Ligett, K., Wu, S., and Ziani, J. (2021). Gaming helps! learning from strategic interactions in natural dynamics. In International Conference on Artificial Intelligence and Statistics, pages 1234 1242. PMLR.

Brückner, M. and Scheffer, T. (2011). Stackelberg games for adversarial prediction problems. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547 555.

Chen, Y., Liu, Y., and Podimata, C. (2020). Learning strategy-aware linear classifiers. Advances in Neural Information Processing Systems, 33:15265 15276.

Dalvi, N., Domingos, P., Sanghai, S., and Verma, D. (2004). Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 99 108.

Dong, J., Roth, A., Schutzman, Z., Waggoner, B., and Wu, Z. S. (2018). Strategic classification from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 55 70.

Gallant, S. I. (1986). Optimal linear discriminants. Eighth International Conference on Pattern Recognition, pages 849 852.

Haghtalab, N., Immorlica, N., Lucier, B., and Wang, J. Z. (2020). Maximizing welfare with incentiveaware evaluation mechanisms. ar Xiv preprint ar Xiv:2011.01956.

Haghtalab, N., Lykouris, T., Nietert, S., and Wei, A. (2022). Learning in stackelberg games with non-myopic agents. In Proceedings of the 23rd ACM Conference on Economics and Computation, pages 917 918.

Hardt, M., Megiddo, N., Papadimitriou, C., and Wootters, M. (2016). Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pages 111 122.

Hu, L., Immorlica, N., and Vaughan, J. W. (2019). The disparate effects of strategic manipulation. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 259 268.

Jagadeesan, M., Mendler-Dünner, C., and Hardt, M. (2021). Alternative microfoundations for strategic classification. In International Conference on Machine Learning, pages 4687 4697. PMLR.

Kleinberg, J. and Raghavan, M. (2020). How do classifiers induce agents to invest effort strategically? ACM Transactions on Economics and Computation (TEAC), 8(4):1 23.

Lechner, T. and Urner, R. (2022). Learning losses for strategic classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7337 7344.

Lechner, T., Urner, R., and Ben-David, S. (2023). Strategic classification with unknown user manipulations.

Milli, S., Miller, J., Dragan, A. D., and Hardt, M. (2019). The social cost of strategic classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 230 239.

Montasser, O., Hanneke, S., and Srebro, N. (2019). Vc classes are adversarially robustly learnable, but only improperly. In Conference on Learning Theory, pages 2512 2530. PMLR.

Rajaraman, N., Han, Y., Jiao, J., and Ramchandran, K. (2023). Beyond ucb: Statistical complexity and optimal algorithms for non-linear ridge bandits. ar Xiv preprint ar Xiv:2302.06025.

Sundaram, R., Vullikanti, A., Xu, H., and Yao, F. (2021). Pac-learning for strategic classification. In International Conference on Machine Learning, pages 9978 9988. PMLR.

Zhang, H. and Conitzer, V. (2021). Incentive-aware pac learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 5797 5804.

Zrnic, T., Mazumdar, E., Sastry, S., and Jordan, M. (2021). Who leads and who follows in strategic classification? Advances in Neural Information Processing Systems, 34:15257 15269.

A Technical Lemmas

A.1 Boosting expected guarantee to high probability guarantee

Consider any (possibly randomized) PAC learning algorithm A in strategic setting, which can output a predictor A(S) after T steps of interaction with i.i.d. agents S DT s.t. E [Lstr(A(S))] ε, where the expectation is taken over both the randomness of S and the randomness of algorithm. One standard way in classic PAC learning of boosting the expected loss guarantee to high probability loss guarantee is: running A on new data S and verifying the loss of A(S) on a validation data set; if the validation loss is low, outputting the current A(S), and repeating this process otherwise.

We will adopt this method to boost the confidence as well. The only difference in our strategic setting is that we can not re-use validation data set as we are only allowed to interact with the data through the interaction protocol. Our boosting scheme is described in the following.

For round r = 1, . . . , R,

Run A for T steps of interactions to obtain a predictor hr. Apply hr for the following m0 rounds to obtain the empirical strategic loss on m0, denoted as blr = 1 m0 Ptr+m0 t=tr+1 ℓstr(hr, (xt, rt, yt)), where tr + 1 is the starting time of these m0 rounds. Break and output hr if blr 4ε.

If for all r [R], blr > 4ε, output an arbitrary hypothesis.

Lemma 2. Given an algorithm A, which can output a predictor A(S) after T steps of interaction with i.i.d. agents S DT s.t. the expected loss satisfies E [Lstr(A(S))] ε. Let h A denote the output of the above boosting scheme given algorithm A as input. By setting R = log 2

δ and m0 = 3 log(4R/δ)

2ε , we have Lstr(h A) 8ε with probability 1 δ. The total sample size is R(T + m0) = O(log( 1

δ )(T + log(1/δ)

Proof. For all r = 1, . . . , R, we have E [Lstr(hr)] ε. By Markov s inequality, we have

Pr(Lstr(hr) > 2ε) 1

For any fixed hr, if Lstr(hr) 8ε, we will have blr 4ε with probability e m0ε; if Lstr(hr) 2ε, we will have blr 4ε with probability 1 e 2m0ε/3 by Chernoff bound.

Let E denote the event of { r [R], Lstr(hr) 2ε} and F denote the event of {blr > 4ε for all r [R]}. When F does not hold, our boosting will output hr for some r [R].

Pr(Lstr(h A) > 8ε)

Pr(E, F) Pr(Lstr(h A) > 8ε|E, F) + Pr(E, F) + Pr( E)

r=1 Pr(h A = hr, Lstr(hr) > 8ε|E, F) + Pr(E, F) + Pr( E)

Re m0ε + e 2m0ε/3 + 1

by setting R = log 2

δ and m0 = 3 log(4R/δ)

A.2 Converting mistake bound to PAC bound

In any setting of (C, F), if there is an algorithm A that can achieve the mistake bound of B, then we can convert A to a conservative algorithm by not updating at correct rounds. The new algorithm can still achieve mistake bound of B as A still sees a legal sequence of examples. Given any conservative online algorithm, we can convert it to a PAC learning algorithm using the standard longest survivor technique (Gallant, 1986).

Lemma 3. In any setting of (C, F), given any conservative algorithm A with mistake bound B, let algorithm A run A and output the first ft which survives over 1

δ ) examples. A can achieve sample complexity of O( B

Proof of Lemma 3. When the sample size m B

δ ), the algorithm A will produce at most B different hypotheses and there must exist one surviving for 1

δ ) rounds since A is a conservative algorithm with at most B mistakes. Let h1, . . . , h B denote these hypotheses and let t1, . . . , t B denote the time step they are produced. Then we have

Pr(fout = hi and Lstr(hi) > ε) = E Pr(fout = hi and Lstr(hi) > ε|ti, z1:ti 1)

<E h (1 ε) 1 ε log( B

By union bound, we have

Pr(Lstr(fout) > ε)

i=1 Pr z1:T(fout = hi and Lstr(hi) > ε) < δ.

We are done.

A.3 Smooth the distribution

Lemma 4. For any two data distribution D1 and D2, let D3 = (1 p)D1 + p D2 be the mixture of them. For any setting of (C, F) and any algorithm, let PD be the dynamics of (C(x1), f1, y1, by1, F(x1, 1), . . . , C(x T ), f T , y T , by T , F(x T , T )) under the data distribution D. Then for any event A, we have |PD3(A) PD1(A)| 2p T.

Proof. Let B denote the event of all (xt, ut, yt)T t=1 being sampled from D1. Then PD3( B) p T. Then

PD3(A) = PD3(A|B)PD3(B) + PD3(A| B)PD3( B) = PD1(A)PD3(B) + PD3(A| B)PD3( B) = PD1(A)(1 PD3( B)) + PD3(A| B)PD3( B) .

By re-arranging terms, we have

|PD1(A) PD3(A)| = |PD1(A)PD3( B) PD3(A| B)PD3( B)| 2p T .

B Proof of Theorem 1

Proof. When a mistake occurs, there are two cases.

If ft misclassifies a true positive example (xt, rt, +1) by negative, we know that d(xt, ft) > rt while the target hypothesis h must satisfy that d(xt, h ) rt. Then any h VS with d(xt, h) d(xt, ft) cannot be h and are eliminated. Since d(xt, ft) is the median of {d(xt, h)|h VS}, we can eliminate half of the version space.

If ft misclassifies a true negative example (xt, rt, 1) by positive, we know that d(xt, ft) rt while the target hypothesis h must satisfy that d(xt, h ) > rt. Then any h VS with d(xt, h) d(xt, ft) cannot be h and are eliminated. Since d(xt, ft) is the median of {d(xt, h)|h VS}, we can eliminate half of the version space.

Each mistake reduces the version space by half and thus, the algorithm of Strategic Halving suffers at most log2(|H|) mistakes.

C Proof of Theorem 2

Proof. In online learning setting, an algorithm is conservative if it updates it s current predictor only when making a mistake. It is straightforward to check that Strategic Halving is conservative. Combined with the technique of converting mistake bound to PAC bound in Lemma 3, we prove Theorem 2.

D Proof of Theorem 3

Proof. Consider the feature space X = {0, e1, . . . , en, 0.9e1, . . . , 0.9en}, where ei s are standard basis vectors in Rn and metric d(x, x ) = x x 2 for all x, x X. Let the hypothesis class be a set of singletons over {ei|i [n]}, i.e., H = {21{ei} 1|i [n]}. We divide all possible hypotheses (not necessarily in H) into three categories:

The hypothesis 21 1, which predicts all negative.

For each x {0, 0.9e1, . . . , 0.9en}, let Fx,+ denote the class of hypotheses h predicting x as positive.

For each i [n], let Fi denote the class of hypotheses h satisfying h(x) = 1 for all x {0, 0.9e1, . . . , 0.9en} and h(ei) = +1. And let F = i [n]Fi denote the union of them.

Note that all hypotheses over X fall into one of the three categories.

Now we consider a set of adversaries E1, . . . , En, such that the target function in the adversarial environment Ei is 21{ei} 1. We allow the learners to be randomized and thus, at round t, the learner draws an ft from a distribution D(ft) over hypotheses. The adversary, who only knows the distribution D(ft) but not the realization ft, picks an agent (xt, rt, yt) in the following way.

Case 1: If there exists x {0, 0.9e1, . . . , 0.9en} such that Prft D(ft)(ft Fx,+) c for some c > 0, then for all j [n], the adversary Ej picks (xt, rt, yt) = (x, 0, 1). Let Bt 1,x denote the event of ft Fx,+.

In this case, the learner will make a mistake with probability c. Since for all h H, h( (x, h, 0)) = h(x) = 1, they are all consistent with (x, 0, 1).

Case 2: If Prft D(ft)(ft = 21 1) c, then for all j [n], the adversary Ej picks (xt, rt, yt) = (0, 1, +1). Let Bt 2 denote the event of ft = 21 1.

In this case, with probability c, the learner will sample a ft = 21 1 and misclassify (0, 1, +1). Since for all h H, h( (0, h, 1)) = +1, they are all consistent with (0, 1, +1).

Case 3: If the above two cases do not hold, let it = arg maxi [n] Pr(ft(ei) = 1|ft F ), xt = 0.9eit. For radius and label, different adversaries set them differently. Adversary Eit will set (rt, yt) = (0, 1) while other Ej for j = it will set (rt, yt) = (0.1, 1). Since Cases 1 and 2 do not hold, we have Prft D(ft)(ft F ) 1 (n + 2)c. Let Bt 3 denote the event of ft F and Bt 3,i denote the event of ft Fi.

(a) With probability Pr(Bt 3,it) 1 n Pr(Bt 3) 1 (n+2)c

n , the learner samples a ft Fit, and thus misclassifies (0.9eit, 0.1, 1) in Ej for j = it but correctly classifies (0.9eit, 0, 1). In this case, the learner observes the same feedback in all Ej for j = it and identifies the target function 21{eit} 1 in Eit. (b) If the learner samples a ft with ft(eit) = ft(0.9eit) = 1, then the learner observes xt = 0.9eit, yt = 1 and byt = 1 in all Ej for j [n]. Therefore the learner cannot distinguish between adversaries in this case. (c) If the learner samples a ft with ft(0.9eit) = +1, then the learner observes xt = 0.9eit, yt = 1 and byt = +1 in all Ej for j [n]. Again, since the feedback are identical in all Ej and the learner cannot distinguish between adversaries in this case.

For any learning algorithm A, his predictions are identical in all of adversarial environments {Ej|j [n]} before he makes a mistake in Case 3(a) in one environment Eit. His predictions in the following rounds are identical in all of adversarial environments {Ej|j [n]} \ {Eit} before he makes another mistake in Case 3(a). Suppose that we run A in all adversarial environment of {Ej|j [n]} simultaneously. Note that once we make a mistake, the mistake must occur simultaneously in at least n 1 environments. Specifically, if we make a mistake in Case 1, 2 or 3(c), such a mistake simultaneously occur in all n environments. If we make a mistake in Case 3(a), such a mistake simultaneously occur in all n environments except Eit. Since we will make a mistake with probability at least min(c, 1 (n+2)c

n ) at each round, there exists one environment in {Ej|j [n]} in which A will make n 1 mistakes.

Now we lower bound the number of mistakes dependent on T. Let t1, t2, . . . denote the time steps in which we makes a mistake. Let t0 = 0 for convenience. Now we prove that

Pr(ti > ti 1 + k|ti 1) =

τ=ti 1+1 Pr(we don t make a mistake in round τ)

τ=ti 1+1 (1(Case 3 at round τ)(1 1 (n + 2)c

n ) + 1(Case 1 or 2 at round τ)(1 c))

(1 min(1 (n + 2)c

n , c))k (1 1 2(n + 2))k ,

by setting c = 1 2(n+2). Then by letting k = 2(n + 2) ln(n/δ), we have

Pr(ti > ti 1 + k|ti 1) δ/n .

Pr(# of mistakes < min( T k + 1, n 1))

= Pr( i [n 1], ti ti 1 > k)

i=1 Pr(ti ti 1 > k) δ .

Therefore, we have proved that for any T, with probability at least 1 δ, we will make at least min( T 2(n+2) ln(n/δ)+1, n 1) mistakes.

E Proof of Theorem 4

Algorithm 3 MWMR (Multiplicative Weights on Mistake Rounds)

1: Initialize the version space VS = H. 2: for t=1,...,T do 3: Pick one hypotheses ft from VS uniformly at random. 4: if byt = yt and yt = + then 5: VS VS \ {h VS|d(xt, h) d(xt, ft)}. 6: else if byt = yt and yt = then 7: VS VS \ {h VS|d(xt, h) d(xt, ft)}. 8: end if 9: end for

Proof. First, when the algorithm makes a mistake at round t, he can at least eliminate ft. Therefore, the total number of mistakes will be upper bounded by |H| 1.

Let pt denote the fraction of hypotheses misclassifying xt. We say a hypothesis h is inconsistent with (xt, ft, yt, byt) iff (d(xt, h) d(xt, ft) byt = yt = +) or (d(xt, h) d(xt, ft) byt = + yt = ). Then we define the following events.

Et denotes the event that MWMR makes a mistake at round t. We have Pr(Et) = pt.

Bt denotes the event that at least pt

2 fraction of hypotheses are inconsistent with (xt, ft, yt, byt). We have Pr(Bt|Et) 1

Let n = |H| denote the cardinality of hypothesis class and nt denote the number of hypotheses in VS after round t. Then we have

t=1 (1 1(Et)1(Bt)pt

By taking logarithm of both sides, we have

0 ln(n T ) = ln(n) +

t=1 ln(1 1(Et)1(Bt)pt

t=1 1(Et)1(Bt)pt

where the last inequality adopts ln(1 x) x for x [0, 1). Then by taking expectation of both sides, we have

t=1 Pr(Et Bt)pt

Since Pr(Et) = pt and Pr(Bt|Et) 1

2, then we have

t=1 p2 t ln(n) .

Then we have the expected number of mistakes E [MMWMR(T)] as

E [MMWMR(T)] =

where the first inequality applies Cauchy-Schwarz inequality.

F Proof of Theorem 5

Proof. Construction of Q, H and a set of realizable distributions

Let feature space X = {0, e1, . . . , en} X0, where X0 = { σ(0,1,...,n 1)

z |σ Sn} with

12+...+(n 1)2

α for some small α = 0.1. Here Sn is the set of all permutations over n elements. So X0 is the set of points whose coordinates are a permutation of {0, 1/z, . . . , (n 1)/z} and all points in X0 have the ℓ2 norm equal to α. Define a metric d by letting d(x1, x2) = x1 x2 2 for all x1, x2 X. Then for any x X0 and

i [n], d(x, ei) = x ei 2 = q

(xi 1)2 + P

j =i x2 j = q

1 + Pn j=1 x2 j 2xi =

1 + α2 2xi. Note that we consider space (X, d) rather than (Rn, 2).

Let the hypothesis class be a set of singletons over {ei|i [n]}, i.e., H = {21{ei} 1|i [n]}.

We now define a collection of distributions {Di|i [n]} in which Di is realized by 21{ei} 1. For any i [n], Di puts probability mass 1 3nε on (0, 0, 1). For the remaining 3nε probability mass, Di picks x uniformly at random from X0 and label it as positive. If xi = 0,

set radius r(x) = ru :=

1 + α2; otherwise, set radius r(x) = rl := q

z). Hence, X0 are all labeled as positive. For j = i, hj = 21{ej} 1 labels {x X0|xj = 0} negative since r(x) = rl and d(x, hj) = ru > r(x). Therefore, Lstr(hj) = 1

n 3nε = 3ε. To output fout H, we must identify the true target function.

Information gain from different choices of ft Let h = 21{ei } 1 denote the target function. Since (0, 0, 1) is realized by all hypotheses, we can only gain information about the target function when xt X0. For any xt X0, if d(xt, ft) rl or d(xt, ft) > ru, we cannot learn anything about the target function. In particular, if d(xt, ft) rl, the learner will observe xt Unif(X0), yt = +1, byt = +1 in all {Di|i [n]}. If d(xt, ft) > ru, the learner will observe xt Unif(X0), yt = +1, byt = 1 in all {Di|i [n]}. Therefore, we cannot obtain any information about the target function.

Now for any xt X0, with the it-th coordinate being 0, we enumerate the distance between x and x for all x X.

For all x X0, d(x, x ) x + x 2α < rl;

For all j = it, d(x, ej) = p

1 + α2 2xj rl;

d(x, eit) = ru;

d(x, 0) = α < rl.

Only ft = 21{eit} 1 satisfies that rl < d(xt, ft) ru and thus, we can only obtain information when ft = 21{eit} 1. And the only information we learn is whether it = i because if it = i , no matter which i is, our observation is identical. If it = i , we can eliminate 21{eit} 1.

Sample size analysis For any algorithm A, his predictions are identical in all environments {Di|i [n]} before a round t in which ft = 21{eit} 1. Then either he learns it in Dit or he eliminates 21{eit} 1 and continues to perform the same in the other environments {Di|i = it}. Suppose that we run A in all stochastic environments {Di|i [n]} simultaneously. When we identify it in environment Dit, we terminate A in Dit. Consider a good algorithm A which can identify i in Di with probability 7

8 after T rounds of interaction for each i [n], that is,

Pr Di,A(iout = i) 1

8, i [n] . (3)

Therefore, we have X

i [n] Pr Di,A(iout = i) n

Let n T denote the number of environments that have been terminated by the end of round T. Let Bt denote the event of xt being in X0 and Ct denote the event of ft = 21{eit} 1. Then we have Pr(Bt) = 3nε and Pr(Ct|Bt) = 1

n, and thus Pr(Bt Ct) = 3nε 1

n. Since at each round, we can eliminate one environment only when Bt Ct is true, then we have

t=1 1(Bt Ct)

Therefore, by setting T = n

6ε and Markov s inequality, we have

When there are n

2 + 1 environments remaining, the algorithm has to pick one iout, which fails in at least n

2 of the environments. Then we have X

i [n] Pr Di,A(iout = i) ln

m Pr(n T jn

which conflicts with Eq (4). Therefore, for any algorithm A, to achieve Eq (3), it requires T n

G Proof of Theorem 6

Given Lemma 1, we can upper bound the expected strategic loss, then we can boost the confidence of the algorithm through the scheme in Section A.1. Theorem 6 follows by combining Lemma 1 and Lemma 2. Now we only need to prove Lemma 1.

Proof of Lemma 1. For any set of hypotheses H, for every z = (x, r, y), we define

κp(H, z) := |{h H|h( (x, h, r)) = }| if y = + , 0 otherwise.

So κp(H, z) is the number of hypotheses mislabeling z for positive z s and 0 for negative z s. Similarly, we define κn as follows,

κn(H, z) := |{h H|h( (x, h, r)) = +}| if y = , 0 otherwise.

So κn(H, z) is the number of hypotheses mislabeling z for negative z s and 0 for positive z s.

In the following, we divide the proof into two parts. First, recall that in Algorithm 2, the output is constructed by randomly sampling two hypotheses with replacement and taking the union of them. We represent the loss of such a random predictor using κp(H, z) and κn(H, z) defined above. Then we show that whenever the algorithm makes a mistake, with some probability, we can reduce κp(VSt 1,zt)

2 or κn(VSt 1,zt)

2 hypotheses and utilize this to provide a guarantee on the loss of the final output.

Upper bounds on the strategic loss For any hypothesis h, let fpr(h) and fnr(h) denote the false positive rate and false negative rate of h respectively. Let p+ denote the probability of drawing a positive sample from D, i.e., Pr(x,r,y) D(y = +) and p denote the probability of drawing a negative sample from D. Let D+ and D denote the data distribution conditional on that the label is positive and that the label is negative respectively. Given any set of hypotheses H, we define a random predictor R2(H) = h1 h2 with h1, h2 randomly picked from H with replacement. For a true positive z, R2(H) will misclassify it with probability κp(H,z)2

|H|2 . Then we can find that the false negative rate of R2(H) is

fnr(R2(H)) = Ez=(x,r,+) D+ [Pr(R2(H)(x) = )] = Ez=(x,r,+) D+

" κp(H, z)2

Similarly, for a true negative z, R2(H) will misclassify it with probability 1 (1 κn(H,z)

|H| . Then the false positive rate of R2(H) is

fpr(R2(H)) = Ez=(x,r, ) D [Pr(R2(H)(x) = +)] Ez=(x,r, ) D+

Hence the loss of R2(H) is

Lstr(R2(H)) p+Ez D+

" κp(H, z)2

" κp(H, z)2

|H|2 + 2κn(H, z)

where the last equality holds since κp(H, z) = 0 for true negatives and κn(H, z) = 0 for true positives.

Loss analysis In each round, the data zt = (xt, rt, yt) is sampled from D. When the label yt is positive, if the drawn ft satisfying that 1) ft( (xt, ft, rt)) = and 2) d(xt, ft) median({d(xt, h)|h VSt 1, h( (xt, h, rt)) = }), then we are able to remove κp(VSt 1,zt)

2 hypotheses from the version

space. Let Ep,t denote the event of ft satisfying the conditions 1) and 2). With probability 1 log2(nt) ,

we sample kt = 1. Then we sample an ft Unif(VSt 1). With probability κp(VSt 1,zt)

2nt , the sampled ft satisfies the two conditions. So we have

Pr(Ep,t|zt, VSt 1) 1 log2(nt) κp(VSt 1, zt)

The case of yt being negative is similar to the positive case. Let En,t denote the event of ft satisfying that 1) ft( (xt, ft, rt)) = + and 2) d(xt, ft) median({d(xt, h)|h VSt 1, h( (xt, h, rt)) = +}). If κn(VSt 1, zt) nt

2 , then with probability 1 log2(nt) , we sample kt = 1. Then with probability greater than 1

4 we will sample an ft satisfying that 1) ft( (xt, ft, rt)) = + and 2) d(xt, ft) median({d(xt, h)|h VSt 1, h( (xt, h, rt)) = +}). If κn(VSt 1, zt) < nt

2 , then with probability 1 log2(nt) , we sampled a kt satisfying

nt 4κn(VSt 1, zt) < kt nt 2κn(VSt 1, zt) .

Then we randomly sample kt hypotheses and the expected number of sampled hypotheses which mislabel zt is kt κn(VSt 1,zt)

2]. Let gt (given the above fixed kt) denote the number of sampled hypotheses which mislabel xt and we have E [gt] ( 1

2]. When gt > 0, ft will misclassify zt by positive. We have

Pr(gt = 0) = (1 κn(VSt 1, zt)

nt )kt < (1 κn(VSt 1, zt)

nt 4κn(VSt 1,zt) e 1/4 0.78

and by Markov s inequality, we have

Pr(gt 3) E [gt]

Thus Pr(gt {1, 2}) 0.05. Conditional on gt is either 1 or 2, with probability 1

4, all of these gt hypotheses h satisfies d(xt, h ) median({d(xt, h)|h VSt 1, h( (xt, h, rt)) = +}), which implies that d(xt, ft) median({d(xt, h)|h VSt 1, h( (xt, h, rt)) = +}). Therefore, we have

Pr(En,t|zt, , VSt 1) 1 80 log2(nt) . (7)

Let vt denote the fraction of hypotheses we eliminated at round t, i.e., vt = 1 nt+1

nt . Then we have

vt 1(Ep,t)κp(VSt 1, zt)

2nt + 1(En,t)κn(VSt 1, zt)

Since nt+1 = nt(1 vt), we have

1 n T +1 = n

t=1 (1 vt) .

By taking logarithm of both sides, we have

0 ln n T +1 = ln n +

t=1 ln(1 vt) ln n

where we use ln(1 x) x for x [0, 1) in the last inequality. By re-arranging terms, we have

t=1 vt ln n .

Combined with Eq (8), we have

t=1 1(Ep,t)κp(VSt 1, zt)

2nt + 1(En,t)κn(VSt 1, zt)

By taking expectation w.r.t. the randomness of f1:T and dataset S = z1:T on both sides, we have

t=1 Ef1:T ,z1:T

1(Ep,t)κp(VSt 1, zt)

2nt + 1(En,t)κn(VSt 1, zt)

Since the t-th term does not depend on ft+1:T , zt+1:T and VSt 1 is determined by z1:t 1 and f1:t 1, the t-th term becomes

1(Ep,t)κp(VSt 1, zt)

2nt + 1(En,t)κn(VSt 1, zt)

=Ef1:t 1,z1:t

1(Ep,t)κp(VSt 1, zt)

2nt + 1(En,t)κn(VSt 1, zt)

2nt |f1:t 1, z1:t

=Ef1:t 1,z1:t

Eft [1(Ep,t)|f1:t 1, z1:t] κp(VSt 1, zt)

2nt + Eft [1(En,t)|f1:t 1, z1:t] κn(VSt 1, zt)

Ef1:t 1,z1:t

" 1 log2(nt) κ2 p(VSt 1, zt)

4n2 t + 1 80 log2(nt) κn(VSt 1, zt)

where Eq (9) holds due to that VSt 1 is determined by f1:t 1, z1:t 1 and does not depend on ft and Eq (10) holds since Prft(Ep,t|f1:t 1, z1:t) = Prft(Ep,t|VSt 1, zt) 1 log2(nt) κp(VSt 1,zt)

2nt by Eq (6) and Prft(En,t|f1:t 1, z1:t) = Prft(En,t|VSt 1, zt) 1 80 log2(nt) by Eq (7). Thus, we have

t=1 Ef1:t 1,z1:t

" 1 log2(nt) κ2 p(VSt 1, zt)

4n2 t + 1 80 log2(nt) κn(VSt 1, zt)

Since zt D and zt is independent of z1:t 1 and f1:t 1, thus, we have the t-th term on the LHS being

Ef1:t 1,z1:t

" 1 log2(nt) κ2 p(VSt 1, zt)

4n2 t + 1 80 log2(nt) κn(VSt 1, zt)

=Ef1:t 1,z1:t 1

" 1 log2(nt) κ2 p(VSt 1, zt)

4n2 t + 1 80 log2(nt) κn(VSt 1, zt)

1 320 log2(n)Ef1:t 1,z1:t 1

" κ2 p(VSt 1, z)

n2 t + 2κn(VSt 1, z)

1 320 log2(n)Ef1:t 1,z1:t 1 Lstr(R2(VSt 1)) ,

where the last inequality adopts Eq (5). By summing them up and re-arranging terms, we have

Ef1:T ,z1:T

t=1 Lstr(R2(VSt 1))

t=1 Ef1:t 1,z1:t 1 Lstr(R2(VSt 1)) 320 log2(n) ln(n)

For the output of Algorithm 2, which randomly picks τ from [T], randomly samples h1, h2 from VSτ 1 with replacement and outputs h1 h2, the expected loss is

E Lstr(A(S)) =ES,f1:T

t=1 Eh1,h2 Unif(VSt 1) Lstr(h1 h2) #

t=1 Lstr(R2(VSt 1))

320 log2(n) ln(n)

when T 320 log2(n) ln(n)

Post proof discussion of Lemma 1

Upon first inspection, readers might perceive a resemblance between the proof of the loss analysis section and the standard proof of converting regret bound to error bound.This standard proof converts a regret guarantee on f1:T to an error guarantee of 1

T PT t=1 ft. However, in this proof, the predictor employed in each round is ft, while the output is an average over R2(VSt 1) for all t [T]. Our algorithm does not provide a regret guarantee on f1:T .

Please note that our analysis exhibits asymmetry regarding losses on true positives and true negatives. Specifically, the probability of identifying and reducing half of the misclassifying hypotheses on true positives, denoted as Pr(Ep,t|zt, VSt 1) (Eq (6)), is lower than the corresponding probability for true negatives, Pr(En,t|zt, VSt 1) (Eq (7)). This discrepancy arises due to the different levels of difficulty in detecting misclassifying hypotheses. For example, if there is exactly one hypothesis h misclassifying a true positive zt = (xt, rt, yt), it is very hard to detect this h. We must select an ft satisfying that d(xt, ft) > d(xt, h ) for all h H \ {h} (hence ft will make a mistake), and that d(xt, ft) d(xt, h) (so that we will know h misclassifies zt). Algorithm 2 controls the distance d(xt, ft) through kt, which is the number of hypotheses in the union. In this case, we can only detect h when kt = 1 and ft = h, which occurs with probability 1 nt log(nt).

However, if there is exactly one hypothesis h misclassifying a true negative zt = (xt, rt, yt), we have that d(xt, h) = minh H d(xt, h ). Then by setting ft = h Hh, which will makes a mistake and tells us h is a misclassifying hypothesis. Our algorithm will pick such an ft with probability 1 log(nt).

H Proof of Theorem 7

Proof. We will prove Theorem 7 by constructing an instance of Q and H and showing that for any conservative learning algorithm, there exists a realizable data distribution s.t. achieving ε loss requires at least eΩ( |H|

ε ) samples.

Construction of Q, H and a set of realizable distributions

Let the input metric space (X, d) be constructed in the following way. Consider the feature space X = {e1, . . . , en} X0, where X0 = { σ(0,1,...,n 1)

z |σ Sn} with z =

12+...+(n 1)2

α for some small α = 0.1. Here Sn is the set of all permutations over n elements. So X0 is the set of points whose coordinates are a permutation of {0, 1/z, . . . , (n 1)/z} and all points in X0 have the ℓ2 norm equal to α. We define the metric d by restricting ℓ2 distance to X, i.e., d(x1, x2) = x1 x2 2 for all x1, x2 X. Then we have that for any x X0 and i [n], the distance between x and ei is

d(x, ei) = x ei 2 = s

(xi 1)2 + X

j =i x2 j =

j=1 x2 j 2xi = p

1 + α2 2xi ,

which is greater than

1 + α2 2α > 0.8 > 2α. For any two points x, x X0, d(x, x ) 2α by triangle inequality.

Let the hypothesis class be a set of singletons over {ei|i [n]}, i.e., H = {21{ei} 1|i [n]}.

We now define a collection of distributions {Di|i [n]} in which Di is realized by 21{ei} 1. For any i [n], we define Di in the following way. Let the marginal distribution DX over X be uniform over X0. For any x, the label y is + with probability 1 6ε and with probability 6ε, i.e., D(y|x) = Rad(1 6ε). Note that the marginal distribution DX Y = Unif(X0) Rad(1 6ε) is identical for any distribution in {Di|i [n]} and does not depend on i.

If the label is positive y = +, then let the radius r = 2. If the label is negative y = , then let

1 + α2 2(xi + 1

z), which guarantees that x can be manipulated to ej iff d(x, ej) <

d(x, ei) for all j [n]. Since xi α and 1

z α, we have q

1 + α2 2(xi + 1

z) 1 4α > 2α. Therefore, for both positive and negative examples, we have radius r strictly greater than 2α in both cases.

Randomization and improperness of the output fout do not help Note that algorithms are allowed to output a randomized fout and to output fout / H. We will show that randomization and improperness of fout don t make the problem easier. That is, supposing that the data distribution is Di for some i [n], finding a (possibly randomized and improper) fout is not easier than identifying i . Since our feature space X is finite, we can enumerate all hypotheses not equal to 21{ei } 1 and calculate their strategic population loss as follows.

21 1 predicts all negative and thus Lstr(21 1) = 1 6ε;

For any a X s.t. a X0 = , 21a 1 will predict any point drawn from Di as positive (since all points have radius greater than 2α and the distance between any two points in X0 is smaller than 2α) and thus Lstr(21a 1) = 6ε;

For any a {e1, . . . , en} satisfying that i = i , ei a, we have Lstr(21a 1) 3ε. This is due to that when y = , x is chosen from Unif(X0) and the probability of d(x, ei) < d(x, ei ) is 1

2. When d(x, ei) < d(x, ei ), 21a 1 will predict x as positive.

Under distribution Di , if we are able to find a (possibly randomized) fout with strategic loss of Lstr(fout) ε, then we have Lstr(fout) = Eh fout [Lstr(h)] Prh fout(h = 21{ei } 1) 3ε. Thus, Prh fout(h = 21{ei } 1) 2

3. Hence, if we are able to find a (possibly randomized) fout with ε error, then we are able to identify i by checking which realization of fout has probability greater than 2

3. In the following, we will focus on the sample complexity to identify i . Let iout denote the algorithm s answer to question what is i ? .

Conservative algorithms When running a conservative algorithm, the rule of choosing ft at round t and choosing the final output fout does not depend on the correct rounds, i.e. {τ [T]|byτ = yτ}. Let s define

t = t if byt = yt if byt = yt , (11)

where is just a symbol representing no information . Then for any conservative algorithm, the selected predictor ft is determined by (fτ, byτ, yτ, τ) for τ < t and the final output fout is determined by (ft, byt, yt, t)T t=1. From now on, we consider t as the feedback in the learning process of a conservative algorithm since it make no difference from running the same algorithm with feedback t.

Smooth the data distribution For technical reasons (appearing later in the analysis), we don t want to analyze distribution {Di|i [n]} directly as the probability of t = ei is 0 when ft(ei) = +1 under distribution Di. Instead, we consider the mixture of Di and another distribution D i , which is identical to Di except that r(x) = d(x, ei) when y = . More specifically, let D i = (1 p)Di + p D i with some extremely small p, where D i s marginal distribution over X Y is still Unif(X0) Rad(1 6ε); the radius is r = 2 when y = +, ; and the radius is r = d(x, ei) when y = . For any data distribution D, let PD be the dynamics of (f1, y1, by1, 1, . . . , f T , y T , by T , T ) under D. According to Lemma 4, by setting p = ε 16n2 , when T n

ε , with high probability we never sample from D i and have that for any i, j [n] PDi(iout = j) PD i(iout = j) 1

From now on, we only consider distribution D i instead of Di. The readers might have the question that why not using D i for construction directly. This is because D i does not satisfy realizability and no hypothesis has zero loss under D i.

Information gain from different choices of ft In each round of interaction, the learner picks a predictor ft, which can be out of H. Here we enumerate all choices of ft.

ft( ) = 21 1 predicts all points in X by negative. No matter what i is, we will observe ( t = xt, yt) Unif(X0) Rad(1 6ε) and byt = . They are identically distributed for all i [n], and thus, t is also identically distributed. We cannot tell any information of i from this round.

ft = 21at 1 for some at X s.t. a X0 = . Then t = (xt, ft, rt) = (xt, ft, 2α) since rt > 2α and d(xt, ft) 2α, byt = +, yt Rad(1 6ε). None of these depends on i and again, the distribution of (byt, yt, t) is identical for all i and we cannot tell any information of i from this round.

ft = 21at 1 for some non-empty at {e1, . . . , en}. For rounds with yt = +, we have byt = + and t = (xt, ft, 2), which still not depend on i . Thus we cannot learn any information about i . But we can learn when yt = . For rounds with yt = , if t at, then we could observe byt = + and t = t, which at least tells that 21{ t} 1 is not the target function (with high probability); if t / at, then byt = and we observe t = .

Therefore, we only need to focus on the rounds with ft = 21at 1 for some non-empty at {e1, . . . , en} and yt = . It is worth noting that drawing an example x from X0 uniformly, it is equivalent to uniformly drawing a permutation of H such that the distances between x and h over all h H are permuted according to it. Then t = ej iff ej at, d(x, ej) d(x, ei ) and d(x, ej) d(x, el) for all el at. Let kt = |at| denote the cardinality of at. In such rounds, under distribution D i , the distribution of t are described as follows.

1. The case of ei at: For all j at \ {i }, with probability 1 kt , d(xt, ej) = minel at d(xt, el) and thus, t = t = ej and byt = + (mistake round). With probability 1

kt , we have d(xt, ei ) = minel at d(xt, el). If the example is drawn from Di , we have t = xt and yt = (correct round), thus t = . If the example is drawn from D i , we have we have t = t = ei and yt = + (mistake round). Therefore, according to the definition of t (Eq (11)), we have

ej w.p. 1 kt for ej at, j = i

ei w.p. 1 kt p w.p. 1 kt (1 p) .

We denote this distribution by P (at, i ).

2. The case of ei / at: For all j at, with probability 1 kt+1, then d(xt, ej) = minel at {ei } d(xt, el) and thus, t = ej and byt = + (mistake round). With probability 1 kt+1, we have d(x, ei ) < minel at d(xt, el) and thus, t = xt, byt = (correct round), and t = . Therefore, the distribution of t is

( ej w.p. 1 kt+1 for ej at w.p. 1 kt+1 .

We denote this distribution by P/ (at).

To measure the information obtained from t, we will utilize the KL divergence of the distribution of t under the data distribution Di from that under a benchmark distribution. Let D = 1

i n D i denote the average distribution. The process of sampling from D is equivalent to sampling i

uniformly at random from [n] first and drawing a sample from Di . Then under D, for any ej at, we have

Pr( t = ej) = Pr(i = j) Pr( t = ej|i = j) + Pr(i at \ {j}) Pr( t = ej|i at \ {j})

+ Pr(i / at) Pr( t = ej|i / at)

n 1 kt + 1 = nkt 1 + p(kt + 1)

nkt(kt + 1) ,

Pr( t = ) = Pr(i at) Pr( t = |i at) + Pr(i / at) Pr( t = |i / at)

n 1 kt + 1 = n + 1 p(kt + 1)

n(kt + 1) .

Thus, the distribution of t under D is

( ej w.p. nkt 1+p(kt+1)

nkt(kt+1) for ej at w.p. n+1 p(kt+1)

We denote this distribution by P(at). Next we will compute the KL divergences of P (at, i ) and P/ (at) from P(at). We will use the inequality log(1+x) x for x 0 in the following calculation. For any i s.t. ei at, we have

DKL(P(at) P (at, i ))

=(kt 1)nkt 1 + p(kt + 1)

nkt(kt + 1) log(nkt 1 + p(kt + 1)

nkt(kt + 1) kt)

+ nkt 1 + p(kt + 1)

nkt(kt + 1) log(nkt 1 + p(kt + 1)

nkt(kt + 1) kt

+ n + 1 p(kt + 1)

n(kt + 1) log(n + 1 p(kt + 1)

n(kt + 1) kt 1 p)

0 + 1 kt + 1 log(1

p) + 2p kt + 1 = 1 kt + 1 log(1

p) + 2p kt + 1 , (13)

DKL(P(at) P/ (at))

=kt nkt 1 + p(kt + 1)

nkt(kt + 1) log(nkt 1 + p(kt + 1)

nkt(kt + 1) (kt + 1))

+ n + 1 p(kt + 1)

n(kt + 1) log(n + 1 p(kt + 1)

n(kt + 1) (kt + 1))

0 + n + 1 n2(kt + 1) = n + 1 n2(kt + 1) . (14)

Lower bound of the information We utilize the information theoretical framework of proving lower bounds for linear bandits (Theorem 11 by Rajaraman et al. (2023)) here. For notation simplicity, for all i [n], let Pi denote the dynamics of (f1, 1, y1, by1, . . . , f T , T , y T , by T ) under D i and P denote the dynamics under D. Let Bt denote the event of {ft = 21at 1 for some non-empty at {e1, . . . , en}}. As discussed before, for any at, conditional on Bt or yt = +1, ( t, yt, byt) are identical in all {D i|i [n]}, and therefore, also identical in D. We can only obtain information at rounds when Bt (yt = 1) occurs. In such rounds, we know that ft is fully determined by history (possibly with external randomness , which does not depend on data distribution), yt = 1 and byt is fully determined by t (byt = +1 iff. t at).

Therefore, conditional the history Ht 1 = (f1, 1, y1, by1, . . . , ft 1, t 1, yt 1, byt 1) before time t, we have

DKL(P(ft, t, yt, byt|Ht 1) Pi(ft, t, yt, byt|Ht 1))

=P(Bt (yt = 1))DKL(P( t|Ht 1, Bt (yt = 1)) Pi( t|Ht 1, Bt (yt = 1)))

=6εP(Bt)DKL(P( t|Ht 1, Bt (yt = 1)) Pi( t|Ht 1, Bt (yt = 1))) , (15)

where the last equality holds due to that yt Rad(1 6ε) and does not depend on Bt.

For any algorithm that can successfully identify i under the data distribution Di with probability 3

4 for all i [n], then PDi(iout = i) 3

4 and PDj(iout = i) 1

4 for all j = i. Recall that Di and D i are very close when the mixture parameter p is small. Combining with Eq (12), we have

|Pi(iout = i) Pj(iout = i)|

PDi(iout = i) PDj(iout = i) |PDi(iout = i) Pi(iout = i)| PDj(iout = i) Pj(iout = i)

Then we have the total variation distance between Pi and Pj

TV(Pi, Pj) |Pi(iout = i) Pj(iout = i)| 1

Then we have

Ei Unif([n]) TV2(Pi, P(i+1) mod n) 4Ei Unif([n]) TV2(Pi, P)

2Ei DKL(P Pi) (Pinsker s ineq)

t=1 DKL(P(ft, t, yt, byt|Ht 1) Pi(ft, t, yt, byt|Ht 1))

(Chain rule)

P(Bt)DKL(P( t|Ht 1, Bt (yt = 1)) Pi( t|Ht 1, Bt (yt = 1)))

(Apply Eq (15))

i=1 DKL(P( t|Ht 1, Bt (yt = 1)) Pi( t|Ht 1, Bt (yt = 1)))

i:i at DKL(P(at) P (at, i)) + X

i:i/ at DKL(P(at) P/ (at))

t=1 Ef1:T P

1 kt + 1 log(1

p) + 2p kt + 1

n + 1 n2(kt + 1)

(Apply Eq (13),(14))

p) + 2p + 1)

12Tε(log(16n2/ε) + 2)

Combining with Eq (16), we have that there exists a universal constant c such that T cn ε(log(n/ε)+1).

I Proof of Theorem 8

Proof. We will prove Theorem 8 by constructing an instance of Q and H and then reduce it to a linear stochastic bandit problem.

Construction of Q, H and a set of realizable distributions

Consider the input metric space in the shape of a star, where X = {0, 1, . . . , n} and the distance function of d(0, i) = 1 and d(i, j) = 2 for all i = j [n].

Let the hypothesis class be a set of singletons over [n], i.e., H = {21{i} 1|i [n]}.

We define a collection of distributions {Di|i [n]} in which Di is realized by 21{i} 1. The data distribution Di put 1 3(n 1)ε on (0, 1, +) and 3ε on (i, 1, ) for all i = i . Hence, note that all distributions in {Di|i [n]} share the same distribution support {(0, 1, +)} {(i, 1, )|i [n]}, but have different weights.

Randomization and improperness of the output fout do not help. Note that algorithms are allowed to output a randomized fout and to output fout / H. We will show that randomization and improperness of fout don t make the problem easier. Supposing that the data distribution is Di for some i [n], finding a (possibly randomized and improper) fout is not easier than identifying i . Since our feature space X is finite, we can enumerate all hypotheses not equal to 21{i } 1 and calculate their strategic population loss as follows. The hypothesis 21 1 will predict all by negative and thus Lstr(21 1) = 1 3(n 1)ε. For any hypothesis predicting 0 by positive, it will predict all points in the distribution support by positive and thus incurs strategic loss 3(n 1)ε. For any hypothesis predicting 0 by negative and some i = i by positive, then it will misclassify (i, 1, ) and incur strategic loss 3ε. Therefore, for any hypothesis h = 21{i } 1, we have Lstr Di (h) 3ε.

Similar to the proof of Theorem 7, under distribution Di , if we are able to find a (possibly randomized) fout with strategic loss Lstr(fout) ε. Then Prh fout(h = 21{i } 1) 2

3. We can identify i by checking which realization of fout has probability greater than 2

3. In the following, we will focus on the sample complexity to identify the target function 21{i } 1 or simply i . Let iout denote the algorithm s answer to question of what is i ? .

Smooth the data distribution For technical reasons (appearing later in the analysis), we don t want to analyze distribution {Di|i [n]} directly as the probability of (i, 1, ) is 0 under distribution Di. Instead, for each i [n], let D i = (1 p)Di + p D i be the mixture of Di and D i for some small p, where D i = (1 3(n 1)ε)1{(0,1,+)} + 3(n 1)ε1{(i,1, )}. Specifically,

1 3(n 1)ε for z = (0, 1, +) 3(1 p)ε for z = (j, 1, ), j = i 3(n 1)pε for z = (i, 1, )

For any data distribution D, let PD be the dynamics of (f1, y1, by1, . . . , f T , y T , by T ) under D. According to Lemma 4, by setting p = ε 16n2 , when T n

ε , we have that for any i, j [n]

PDi(iout = j) PD i(iout = j) 1

From now on, we only consider distribution D i instead of Di. The readers might have the question that why not using D i for construction directly. This is because no hypothesis has zero loss under D i, and thus D i does not satisfy realizability requirement.

Information gain from different choices of ft Note that in each round, the learner picks a ft and then only observes byt and yt. Here we enumerate choices of ft as follows.

1. ft = 21 1 predicts all points in X by negative. No matter what i is, we observe byt = and yt = 21(xt = 0) 1. Hence (byt, yt) are identically distributed for all i [n], and thus, we cannot learn anything about i from this round.

2. ft predicts 0 by positive. Then no matter what i is, we have byt = + and yt = 1(xt = 0). Thus again, we cannot learn anything about i .

3. ft = 21at 1 for some non-empty at [n]. For rounds with xt = 0, we have byt = yt = + no matter what i is and thus, we cannot learn anything about i . For rounds with yt = , i.e., xt = 0, we will observe byt = ft( (xt, ft, 1)) = 1(xt at).

Hence, we can only extract information with the third type of ft at rounds with xt = 0.

Reduction to stochastic linear bandits In rounds with ft = 21at 1 for some non-empty at [n] and xt = 0, our problem is identical to a stochastic linear bandit problem. Let us state our problem as Problem 1 and a linear bandit problem as Problem 2. Let A = {0, 1}n \ {0}.

Problem 1. The environment picks an i [n]. At each round t, the environment picks xt {ei|i [n]} with P(i) = 1 p

n 1 for i = i and P(i ) = p and the learner picks an at A (where we use a n-bit string to represent at and at,i = 1 means that at predicts i by positive). Then the learner observes byt = 1(a t xt > 0) (where we use 0 to represent nagative label).

Problem 2. The environment picks a linear parameter w {wi|i [n]} with wi = 1 p

n 1 p)ei. The arm set is A. For each arm a A, the reward is i.i.d. from the following distribution:

rw(a) = 1, w.p. w a , 0 . (18)

If the linear parameter w = wi , the optimal arm is ei .

Claim 1. For any δ > 0, for any algorithm A that identify i correctly with probability 1 δ within T rounds for any i [n] in Problem 1, we can construct another algorithm A can also identify the optimal arm in any environment with probability 1 δ within T rounds in Problem 2.

This claim follows directly from the problem descriptions. Given any algorithm A for Problem 1, we can construct another algorithm A which simulates A. At round t, if A selects predictor at, then A picks arm the same as at. Then A observes a reward rwi (at), which is 1 w.p. wi at and feed rwi (at) to A. Since byt in Problem 1 is 1 w.p. Pn i=1 at,i P(i) = wi at, it is distributed identically as rwi (at). Since A will be able to identify i w.p. 1 δ in T rounds, A just need to output ei as the optimal arm.

Then any lower bound on T for Problem 2 also lower bounds Problem 1. Hence, we adopt the information theoretical framework of proving lower bounds for linear bandits (Theorem 11 by Rajaraman et al. (2023)) to prove a lower bound for our problem. In fact, we also apply this framework to prove the lower bounds in other settings of this work, including Theorem 7 and Theorem 9.

Lower bound of the information For notation simplicity, for all i [n], let Pi denote the dynamics of (f1, y1, by1, . . . , f T , y T , by T ) under D i and and P denote the dynamics under D = 1

n D i. Let Bt denote the event of {ft = 21at 1 for some non-empty at [n]}. As discussed before, for any at, conditional on Bt or yt = +1, (xt, yt, byt) are identical in all {D i|i [n]}, and therefore, also identical in D. We can only obtain information at rounds when Bt yt = 1 occurs. In such rounds, ft is fully determined by history (possibly with external randomness , which does not depend on data distribution), yt = 1 and byt = rw(at) with rw(at) sampled from the distribution defined in Eq (18).

For any algorithm that can successfully identify i under the data distribution Di with probability 3

4 for all i [n], then PDi(iout = i) 3

4 and PDj(iout = i) 1

4 for all j = i. Recall that Di and D i are very close when the mixture parameter p is small. Combining with Eq (17), we have

|Pi(iout = i) Pj(iout = i)|

PDi(iout = i) PDj(iout = i) |PDi(iout = i) Pi(iout = i)| PDj(iout = i) Pj(iout = i)

Let w = 1 n1. Let kl(q, q ) denote the KL divergence from Ber(q) to Ber(q ). Let Ht 1 = (f1, y1, by1, . . . , ft 1, yt 1, byt 1) denote the history up to time t 1. Then we have

Ei Unif([n]) TV2(Pi, Pi+1 mod n) 4Ei Unif([n]) TV2(Pi, P)

2Ei DKL(P Pi) (Pinsker s ineq)

t=1 DKL(P(ft, yt, byt|Ht 1) Pi(ft, yt, byt|Ht 1))

(Chain rule)

P(Bt yt = 1)Ea1:T P DKL(Ber( w, at ) Ber( wi, at )) #

P(Bt)Ea1:T P DKL(Ber( w, at ) Ber( wi, at )) #

t=1 Ea1:T P

i=1 DKL(Ber( w, at ) Ber( wi, at ))

t=1 Ea1:T P

i:i at kl(kt

n , (kt 1)(1 p)

n 1 + p) + X

i:i/ at kl(kt

n , kt(1 p)

t=1 Ea1:T P

n , (kt 1)(1 p)

n 1 + p) + (n kt)kl(kt

n , kt(1 p)

If kt = 1, then

n , (kt 1)(1 p)

n 1 + p) = kl( 1

(n kt) kl(kt

n , kt(1 p)

n 1 ) = (n 1) kl( 1

n 1) 1 (1 p)n(n 2) ,

where the ineq holds due to kl(q, q ) (q q )2

q (1 q ). If kt = n 1, it is symmetric to the case of kt = 1. We have

n , (kt 1)(1 p)

n 1 + p) = (n 1)kl(n 1

n 1 + 1 n 1p) = (n 1)kl( 1

1 (1 p)n(n 2) ,

(n kt) kl(kt

n , kt(1 p)

n 1 ) = kl(n 1

n , 1 p) = kl( 1

If 1 < kt < n 1, then

n , (kt 1)(1 p)

n 1 + p) =kt kl(kt

n 1 p) (a) kt kl(kt

(b) kt ( kt

n 1 (1 kt 1

n 1 ) = kt n kt n2(kt 1) kt n(kt 1) 2

where inequality (a) holds due to that kt 1

n and kl(q, q ) is monotonically decreasing

in q when q q and inequality (b) adopts kl(q, q ) (q q )2

q (1 q ), and

(n kt) kl(kt

n , kt(1 p)

n 1 ) (n kt) kl(kt

n , kt n 1) kt(n kt) n2(n 1 kt) 2kt

where the first inequality hold due to that kt(1 p)

n , and kl(q, q ) is monotonically increasing in

q when q q and the second inequality adopts kl(q, q ) (q q )2

q (1 q ). Therefore, we have

Eq (20) 6(n 1)ε

t=1 Ea1:T P

p) 12εT log(1/p)

Combining with Eq (19), we have that there exists a universal constant c such that T cn ε(log(n/ε)+1).

J Proof of Theorem 9

Proof. We will prove Theorem 9 by constructing an instance of Q and H and showing that for any learning algorithm, there exists a realizable data distribution s.t. achieving ε loss requires at least eΩ( |H|

ε ) samples.

Construction of Q, H and a set of realizable distributions

Let feature vector space X = {0, 1, . . . , n} and let the space of feature-manipulation set pairs Q = {(0, {0} s)|s [n]}. That is to say, every agent has the same original feature vector x = 0 but has different manipulation ability according to s.

Let the hypothesis class be a set of singletons over [n], i.e., H = {21{i} 1|i [n]}.

We now define a collection of distributions {Di|i [n]} in which Di is realized by 21{i} 1. For any i [n], let Di put probability mass 1 6ε on (0, X, +1) and 6ε uniformly over {(0, {0} sσ,i, 1)|σ Sn}, where Sn is the set of all permutations over n elements and sσ,i := {j|σ 1(j) < σ 1(i)} is the set of elements appearing before i in the permutation (σ(1), . . . , σ(n)). In other words, with probability 1 6ε, we will sample (0, X, +1) and with ε, we will randomly draw a permutation σ Unif(Sn) and return (0, {0} sσ,i, 1). The data distribution Di is realized by 21{i} 1 since for negative examples (0, {0} sσ,i, 1), we have i / s and for positive examples (0, X, +1), we have i X.

Randomization and improperness of the output fout do not help Note that algorithms are allowed to output a randomized fout and to output fout / H. We will show that randomization and improperness of fout don t make the problem easier. That is, supposing that the data distribution is Di for some i [n], finding a (possibly randomized and improper) fout is not easier than identifying i . Since our feature space X is finite, we can enumerate all hypotheses not equal to 21{i } 1 and calculate their strategic population loss as follows.

21 1 predicts all points in X by negative and thus Lstr(21 1) = 1 6ε;

For any a X s.t. 0 a, 21a 1 will predict 0 as positive and thus will predict any point drawn from Di as positive. Hence Lstr(21a 1) = 6ε;

For any a [n] s.t. i = i , i a, we have Lstr(21a 1) 3ε. This is due to that when y = 1, the probability of drawing a permutation σ with σ 1(i) < σ 1(i ) is 1

2. In this case, we have i sσ,i and the prediction of 21a 1 is +1.

Under distribution Di , if we are able to find a (possibly randomized) fout with strategic loss Lstr(fout) ε, then we have Lstr(fout) = Eh fout [Lstr(h)] Prh fout(h = 21{i } 1) 3ε. Thus, Prh fout(h = 21{i } 1) 2

3 and then, we can identify i by checking which realization of fout has probability greater than 2

3. In the following, we will focus on the sample complexity to identify the target function 21{i } 1 or simply i . Let iout denote the algorithm s answer to question of what is i ? .

Smoothing the data distribution For technical reasons (appearing later in the analysis), we don t want to analyze distribution {Di|i [n]} directly as the probability of t = i is 0 when ft(i ) = +1. Instead, we consider the mixture of Di and another distribution D i to make the probability of t = i be a small positive number. More specifically, let D i = (1 p)Di + p D i , where D i is defined by drawing (0, X, +1) with probability 1 6ε and (0, {0, i}, 1) with probability 6ε. When p is extremely small, we will never sample from D i when time horizon T is not too large and therefore, the algorithm behaves the same under D i and Di. For any data distribution D, let PD be the dynamics of (x1, f1, 1, y1, by1, . . . , x T , f T , T , y T , by T ) under D. According to Lemma 4, by setting p = ε 16n2 , when T n

ε , we have that for any i, j [n]

PDi(iout = j) PD i(iout = j) 1

From now on, we only consider distribution D i instead of Di. The readers might have the question that why not using D i for construction directly. This is because no hypothesis has zero loss under D i, and thus D i does not satisfy realizability requirement.

Information gain from different choices of ft In each round of interaction, the learner picks a predictor ft, which can be out of H. Suppose that the target function is 21{i } 1 . Here we enumerate all choices of ft and discuss how much we can learn from each choice.

ft = 21 1 predicts all points in X by negative. No matter what i is, we will observe t = xt = 0, yt Rad(1 6ε), byt = 1. They are identically distributed for any i [n] and thus we cannot tell any information of i from this round.

ft = 21at 1 for some at X s.t. 0 at. Then no matter what i is, we will observe t = xt = 0, yt Rad(1 6ε), byt = +1. Again, we cannot tell any information of i from this round.

ft = 21at 1 for some some non-empty at [n]. For rounds with yt = +1, we have xt = 0, byt = +1 and t = (0, ft, X) Unif(at), which still do not depend on i . For rounds with yt = 1, if the drawn example (0, {0} s, 1) satisfies that s at = , the we would observe t at and byt = +1. At least we could tell that 1{ t} is not the target function. Otherwise, we would observe t = xt = 0 and byt = 1.

Therefore, we can only gain some information about i at rounds in which ft = 21at 1 for some non-empty at [n] and yt = 1. In such rounds, under distribution D i , the distribution of t is described as follows. Let kt = |at| denote the cardinality of at. Recall that agent (0, {0} s, 1) breaks ties randomly when choosing t if there are multiple elements in at s. Here are two cases: i at and i / at.

1. The case of i at: With probability p, we are sampling from D i and then t = i . With probability 1 p, we are sampling from Di . Conditional on this, with probability 1

kt , we sample an agent (0, {0} sσ,i , 1) with the permutation σ satisfying that σ 1(i ) < σ 1(j) for all j at \ {i } and thus, t = 0. With probability 1 1 kt , there exists j at \ {i } s.t. σ 1(j) < σ 1(i ) and t = 0. Since all j at \ {i } are symmetric, we have Pr( t = j) = (1 p)(1 1

kt ) 1 kt 1 = 1 p

kt . Hence, the distribution of t is

kt for j at, j = i

i w.p. p 0 w.p. 1 p

We denote this distribution by P (at, i ).

2. The case of i / at: With probability p, we are sampling from D i , we have t = xt = 0. With probability 1 p, we are sampling from Di . Conditional on this, with probability of 1 kt+1, σ 1(i ) < σ 1(j) for all j at and thus, t = xt = 0. With probability 1 1 kt+1 there exists j at s.t. σ 1(j) < σ 1(i ) and t at. Since all j at are symmetric, we have Pr( t = j) = (1 p)(1 1 kt+1) 1

kt = 1 p kt+1. Hence the distribution of t is

( j w.p. 1 p kt+1 for j at 0 w.p. p + 1 p

We denote this distribution by P/ (at).

To measure the information obtained from t, we will use the KL divergence of the distribution of t under the data distribution D i from that under a benchmark data distribution. We use the average distribution over {D i|i [n]}, which is denoted by D = 1

i n D i. The sampling process is equivalent to drawing i Unif([n]) first and then sampling from D i . Under D, for any j at, we have

Pr( t = j) = Pr(i at \ {j}) Pr( t = j|i at \ {j}) + Pr(i = j) Pr( t = j|i = j) + Pr(i / at) Pr( t = ej|i / at)

kt + 1 = (nkt 1)(1 p)

nkt(kt + 1) + p

Pr( t = 0) = Pr(i at) Pr( t = 0|i at) + Pr(i / at) Pr( t = 0|i / at)

kt + 1) = (n + 1)(1 p)

n(kt + 1) + (n kt)p

Thus, the distribution of t under D is

( j w.p. (nkt 1)(1 p)

nkt(kt+1) + p

n for j at 0 w.p. (n+1)(1 p)

n(kt+1) + (n kt)p

We denote this distribution by P(at). Next we will compute the KL divergence of P/ (at) and P (at) from P(at). Since p = ε 16n2 1 16n2 , we have (nkt 1)(1 p)

nkt(kt+1) + p

n 1 p kt+1 and (n+1)(1 p)

n(kt+1) + (n kt)p

1 kt + p. We will also use log(1 + x) x for x 0 in the following calculation. For any i at, we have

DKL(P(at) P (at, i ))

=(kt 1) (nkt 1)(1 p)

nkt(kt + 1) + p

log ((nkt 1)(1 p)

nkt(kt + 1) + p

+ (nkt 1)(1 p)

nkt(kt + 1) + p

log ((nkt 1)(1 p)

nkt(kt + 1) + p

+ (n + 1)(1 p)

n(kt + 1) + (n kt)p

log (n + 1)(1 p)

n(kt + 1) + (n kt)p

(kt 1) (nkt 1)(1 p)

nkt(kt + 1) + p

kt + 1 kt 1 p) + 1 p

kt + 1 log(1 1

kt + p) log (1 + pkt)

0 + 1 kt + 1 log(1

kt pkt = 1 kt + 1 log(1

p) + 2p . (22)

For P/ (at), we have

DKL(P(at) P/ (at))

(nkt 1)(1 p)

nkt(kt + 1) + p

log ((nkt 1)(1 p)

nkt(kt + 1) + p

+ (n + 1)(1 p)

n(kt + 1) + (n kt)p

(n + 1)(1 p)

n(kt + 1) + (n kt)p

(nkt 1)(1 p)

nkt(kt + 1) + p

kt + 1 kt + 1

kt + p) log

(n + 1)(1 p)

n(kt + 1) + (n kt)p

kt + p) log(1 + 1 p(k2 t + kt + 1) n(1 + ktp) )

kt + p) 1 n(1 + ktp) = 1 nkt . (23)

Lower bound of the information Now we adopt the similar framework used in the proofs of Theorem 7 and 8. For notation simplicity, for all i [n], let Pi denote the dynamics of (x1, f1, 1, y1, by1, . . . , x T , f T , T , y T , by T ) under D i and and P denote the dynamics under D. Let Bt denote the event of {ft = 21at 1 for some non-empty at [n]}. As discussed before, for any at, conditional on Bt or yt = +1, (xt, t, yt, byt) are identical in all {D i|i [n]}, and therefore, also identical in D. We can only obtain information at rounds when Bt (yt = 1) occurs.

In such rounds, we know that xt is always 0, ft is fully determined by history (possibly with external randomness , which does not depend on data distribution), yt = 1 and byt is fully determined by t (byt = +1 iff. t = 0).

Therefore, conditional the history Ht 1 = (x1, f1, 1, y1, by1, . . . , xt 1, ft 1, t 1, yt 1, byt 1) before time t, we have

DKL(P(xt, ft, t, yt, byt|Ht 1) Pi(xt, ft, t, yt, byt|Ht 1))

=P(Bt yt = 1)DKL(P( t|Ht 1, Bt yt = 1) Pi( t|Ht 1, Bt yt = 1))

=6εP(Bt)DKL(P( t|Ht 1, Bt yt = 1) Pi( t|Ht 1, Bt yt = 1)) , (24)

where the last equality holds due to that yt Rad(1 6ε) and does not depend on Bt.

For any algorithm that can successfully identify i under the data distribution Di with probability 3

4 for all i [n], then PDi(iout = i) 3

4 and PDj(iout = i) 1

4 for all j = i. Recall that Di and D i are very close when the mixture parameter p is small. Combining with Eq (21), we have

|Pi(iout = i) Pj(iout = i)|

PDi(iout = i) PDj(iout = i) |PDi(iout = i) Pi(iout = i)| PDj(iout = i) Pj(iout = i)

Then we have the total variation distance between Pi and Pj

TV(Pi, Pj) |Pi(iout = i) Pj(iout = i)| 1

Then we have

Ei Unif([n]) TV2(Pi, P(i+1) mod n) 4Ei Unif([n]) TV2(Pi, P)

2Ei DKL(P Pi) (Pinsker s ineq)

t=1 DKL(P(xt, ft, t, yt, byt|Ht 1) Pi(xt, ft, t, yt, byt|Ht 1))

(Chain rule)

P(Bt)DKL(P( t|Ht 1, Bt yt = 1) Pi( t|Ht 1, Bt yt = 1))

(Apply Eq (24))

i=1 DKL(P( t|Ht 1, Bt yt = 1) Pi( t|Ht 1, Bt yt = 1))

i:i at DKL(P(at) P (at)) + X

i:i/ at DKL(P(at) P/ (at))

1 kt + 1 log(1

p) + 2p + X

(Apply Eq (22),(23))

p) + 2np + 1)

12Tε(log(16n2/ε) + 2)

Combining with Eq (25), we have that there exists a universal constant c such that T cn ε(log(n/ε)+1).