# online_bayesian_persuasion_without_a_clue__fa19a178.pdf

Online Bayesian Persuasion Without a Clue

Francesco Bacchiocchi Politecnico di Milano francesco.bacchiocchi@polimi.it

Matteo Bollini Politecnico di Milano matteo.bollini@polimi.it

Matteo Castiglioni Politecnico di Milano matteo.castiglioni@polimi.it

Alberto Marchesi Politecnico di Milano alberto.marchesi@polimi.it

Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it

We study online Bayesian persuasion problems in which an informed sender repeatedly faces a receiver with the goal of influencing their behavior through the provision of payoff-relevant information. Previous works assume that the sender has knowledge about either the prior distribution over states of nature or receiver s utilities, or both. We relax such unrealistic assumptions by considering settings in which the sender does not know anything about the prior and the receiver. We design an algorithm that achieves sublinear in the number of rounds regret with respect to an optimal signaling scheme, and we also provide a collection of lower bounds showing that the guarantees of such an algorithm are tight. Our algorithm works by searching a suitable space of signaling schemes in order to learn receiver s best responses. To do this, we leverage a non-standard representation of signaling schemes that allows to cleverly overcome the challenge of not knowing anything about the prior over states of nature and receiver s utilities. Finally, our results also allow to derive lower/upper bounds on the sample complexity of learning signaling schemes in a related Bayesian persuasion PAC-learning problem.

1 Introduction

Bayesian persuasion has been introduced by Kamenica and Gentzkow [2011] to model how strategically disclosing information to decision makers influences their behavior. Over the last years, it has received a terrific attention in several fields of science, since it is particularly useful for understanding strategic interactions involving individuals with different levels of information, which are ubiquitous in the real world. As a consequence, Bayesian persuasion has been applied in several settings, such as online advertising [Emek et al., 2014, Badanidiyuru et al., 2018, Bacchiocchi et al., 2022, Agrawal et al., 2023], voting [Alonso and Câmara, 2016, Castiglioni et al., 2020a, Castiglioni and Gatti, 2021], traffic routing [Vasserman et al., 2015, Bhaskar et al., 2016, Castiglioni et al., 2021a], recommendation systems [Cohen and Mansour, 2019, Mansour et al., 2022], security [Rabinovich et al., 2015, Xu et al., 2016], e-commerce [Bro Miltersen and Sheffet, 2012, Castiglioni et al., 2022] medical research [Kolotilin, 2015], and financial regulation [Goldstein and Leitner, 2018].

In its simplest form, Bayesian persuasion involves a sender observing some information about the world, called state of nature, and a receiver who has to take an action. Agents utilities are misaligned, but they both depend on the state of nature and receiver s action. Thus, sender s goal is to devise a

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

mechanism to (partially) disclose information to the receiver, so as to induce them to take a favorable action. This is accomplished by committing upfront to a signaling scheme, encoding a randomized policy that defines how to send informative signals to the receiver based on the observed state.

Classical Bayesian persuasion models (see, e.g., [Dughmi and Xu, 2016, 2017, Xu, 2020]) rely on rather stringent assumptions that considerably limit their applicability in practice. Specifically, they assume that the sender perfectly knows the surrounding environment, including receiver s utilities and the probability distribution from which the state of nature is drawn, called prior. This has motivated a recent shift of attention towards Bayesian persuasion models that incorporate concepts and ideas from online learning, with the goal of relaxing some of such limiting assumptions. However, existing works only partially fulfill this goal, as they still assume some knowledge of either the prior (see, e.g., [Castiglioni et al., 2020b, 2021b, 2023, Babichenko et al., 2022, Bernasconi et al., 2023]) or receiver s utilities (see, e.g., [Zu et al., 2021, Bernasconi et al., 2022, Wu et al., 2022, Lin and Li, 2025]).

1.1 Original contributions

We address for the first time to the best of our knowledge Bayesian persuasion settings where the sender has no clue about the surrounding environment. In particular, we study the online learning problem faced by a sender who repeatedly interacts with a receiver over multiple rounds, without knowing anything about both the prior distribution over states of nature and receiver s utilities. At each round, the sender commits to a signaling scheme, and, then, they observe a state realization and send a signal to the receiver based on that. After each round, the sender gets partial feedback, namely, they only observe the best-response action played by the receiver in that round. In such a setting, the goal of the sender is to minimize their regret, which measures how much utility they lose with respect to committing to an optimal (i.e., utility-maximizing) signaling scheme in every round.

We provide a learning algorithm that achieves regret of the order of e O(

T), where T is the number of rounds. We also provide lower bounds showing that the regret guarantees attained by our algorithm are tight in T and in the parameters characterizing the Bayesian persuasion instance, i.e., the number of states of nature d and that of receiver s actions n. Our algorithm implements a sophisticated explorethen-commit scheme, with exploration being performed in a suitable space of signaling schemes so as to learn receiver s best responses exactly. This is crucial to attain tight regret guarantees, and it is made possible by employing a non-standard representation of signaling schemes, which allows to cleverly overcome the challenging lack of knowledge about both the prior and receiver s utilities.

Our results also allow us to derive lower/upper bounds on the sample complexity of learning signaling schemes in a related Bayesian persuasion PAC-learning problem, where the goal is to find, with high probability, an approximately-optimal signaling scheme in the minimum possible number of rounds.

1.2 Related works

Castiglioni et al. [2020b] were the first to introduce online learning problems in Bayesian persuasion scenarios, with the goal of relaxing sender s knowledge about receiver s utilities (see also follow-up works [Castiglioni et al., 2021b, 2023, Bernasconi et al., 2023]). In their setting, sender s uncertainty is modeled by means of an adversary selecting a receiver s type at each round, with types encoding information about receiver s utilities. However, in such a setting, the sender still needs knowledge about the finite set of possible receiver s types and their associated utilities, as well as about the prior.

A parallel research line has focused on relaxing sender s knowledge about the prior. Zu et al. [2021] study online learning in a repeated version of Bayesian persuasion. Differently from this paper, they consider the sender s learning problem of issuing persuasive action recommendations (corresponding to signals in their case), where persuasiveness is about correctly incentivizing the receiver to actually follow such recommendations. They provide an algorithm that attains sublinear regret while being persuasive at every round with high probability, despite having no knowledge of the prior. Wu et al. [2022], Gan et al. [2023], Bacchiocchi et al. [2024c] achieve similar results for Bayesian persuasion in episodic Markov decision processes, while Bernasconi et al. [2022] in non-Markovian environments. All these works crucially differ from ours, since they strongly rely on the assumption that receiver s utilities are known to the sender, which is needed in order to meet persuasiveness requirements. As a result, the techniques employed in such works are fundamentally different from ours as well.

Finally, learning receiver s best responses exactly (a fundamental component of our algorithm) is related to learning in Stackelberg games [Letchford et al., 2009, Peng et al., 2019, Bacchiocchi et al., 2024a]. For more details on these works and other related works, we refer the reader to Appendix A.

2 Preliminaries

In Section 2.1, we introduce all the needed ingredients of the classical Bayesian persuasion model by Kamenica and Gentzkow [2011], while, in the following Section 2.2, we formally define the Bayesian persuasion setting faced in the rest of the paper and its related online learning problem.

2.1 Bayesian persuasion

A Bayesian persuasion instance is characterized by a finite set Θ := {θi}d i=1 of d states of nature and a finite set A := {ai}n i=1 of n receiver s actions. Agents payoffs are encoded by utility functions u, us : Θ A [0, 1], with uθ(a) := u(θ, a), respectively us θ(a) := us(θ, a), denoting the payoff of the receiver, respectively the sender, when action a A is played in state θ Θ. The sender observes a state of nature drawn from a commonly-known prior probability distribution µ int( Θ),1 with µθ (0, 1] denoting the probability of θ Θ. To disclose information about the realized state, the sender can publicly commit upfront to a signaling scheme ϕ : Θ S, which defines a randomized mapping from states of nature to signals being sent to the receiver, for a finite set S of signals. For ease of notation, we let ϕθ := ϕ(θ) be the probability distribution over signals prescribed by ϕ when the the sate of nature is θ Θ, with ϕθ(s) [0, 1] denoting the probability of sending signal s S.

The sender-receiver interaction goes as follows: (1) the sender commits to a signaling scheme ϕ; (2) the sender observes a state of nature θ µ and sends a signal s ϕθ to the receiver; (3) the receiver updates their belief over states of nature according to Bayes rule; and (4) the receiver plays a best-response action a A, with sender and receiver getting payoffs uθ(a) and us θ(a), respectively. Specifically, an action is a best response for the receiver if it maximizes their expected utility given the belief computed in step (3) of the interaction. Formally, given a signaling scheme ϕ : Θ S and a signal s S, we let Aϕ(s) A be the set of receivers best-response actions, where:

θ Θ µθϕθ(s)uθ(ai) X

θ Θ µθϕθ(s)uθ(aj) aj A

As customary in the literature on Bayesian persuasion (see, e.g., [Dughmi and Xu, 2016]), we assume that, when the receiver has multiple best responses available, they break ties in favor of the sender. In particular, we let aϕ(s) be the best response that is actually played by the receiver when observing signal s S under signaling scheme ϕ, with aϕ(s) arg maxa Aϕ(s) P

θ Θ µθϕθ(s)us θ(a).

The goal of the sender is to commit to an optimal signaling scheme, namely, a ϕ : Θ S that maximizes sender s expected utility, defined as us(ϕ) := P

θ Θ µθϕθ(s)us θ(aϕ(s)). In the following, we let OPT := maxϕ us(ϕ) be the optimal value of sender s expected utility. Moreover, given an additive error γ (0, 1), we say that a signaling scheme ϕ is γ-optimal if us(ϕ) OPT γ.

2.2 Learning in Bayesian persuasion

We study settings in which the sender repeatedly interacts with the receiver over multiple rounds, with each round involving a one-shot Bayesian persuasion interaction (as described in Section 2.1). We assume that the sender has no knowledge about both the prior µ and receiver s utility u, and that the only feedback they get after each round is the best-response action played by the receiver.

At each round t [T],2 the sender commits to a signaling scheme ϕt : Θ S and observes a state of nature θt µ. Then, they draw a signal st ϕt,θt and send it to the receiver, who plays a best-response action at := aϕt(st). Finally, the sender gets payoff us t := us θt(at) and observes a feedback consisting in the action at played by the receiver. The goal of the sender is to learn how to maximize their expected utility while repeatedly interacting with the receiver. When the

1Given a finite set X, we denote by X the set of all the probability distributions over X. 2We denote by [n] := {1, . . . , n} the set of the first n N natural numbers.

sender commits to a sequence {ϕt}t [T ] of signaling schemes, their performance over the T rounds is measured by means of the following notion of cumulative (Stackelberg) regret:

RT ({ϕt}t [T ]) := T OPT E

where the expectation is with respect to the randomness of the algorithm. In the following, for ease of notation, we omit the dependency on {ϕt}t [T ] from the cumulative regret, by simply writing RT . Then, our goal is to design no-regret learning algorithms for the sender, which prescribe a sequence of signaling schemes ϕt that results in the regret RT growing sublinearly in T, namely RT = o(T).

3 Warm-up: A single signal is all you need

In order to design our learning algorithm in Section 4, we exploit a non-standard representation of signaling schemes, which we introduce in this section. Adopting such a representation is fundamental to be able to learn receiver s best responses without any knowledge of both the prior µ and receiver s utility function u. The crucial observation that motivates its adoption is that receiver s best responses aϕ(s) only depend on the components of ϕ associated with s S, namely ϕθ(s) for θ Θ. Thus, in order to learn them, it is sufficient to learn how aϕ(s) varies as a function of such components.

The signaling scheme representation introduced in this section revolves around the concept of slice. Definition 1 (Slice). Given a signaling scheme ϕ : Θ S, the slice of ϕ with respect to signal s S is the d-dimensional vector x [0, 1]d with components xθ := ϕθ(s) for θ Θ.

In the following, we denote by X := [0, 1]d the set of all the possible slices of signaling schemes. Moreover, we let X be the set of normalized slices, which is simply obtained by restricting slices to lie in the (d 1)-dimensional simplex. Thus, it holds that X := x [0, 1]d | P

θ Θ xθ = 1 .

X (a1) X (a2) X (a3)

Figure 1: Representation of sets X (ai) and X (ai) for an instance with d = 2 states of nature and n = 3 receivers actions.

Next, we show that receiver s actions induce particular coverings of the sets X and X , which also depend on both the prior µ and receiver s utility u. First, we introduce Hij Rd to denote the halfspace of slices under which action ai A is (weakly) better than action aj A for the receiver.

θ Θ xθµθ uθ(ai) uθ(aj) 0

Moreover, we denote by Hij := Hij the hyperplane constituting the boundary of the halfspace Hij, which we call the separating hyperplane between actions ai and aj.3 Then, for every ai A, we introduce the polytopes X (ai) X and X (ai) X :

X (ai) := X

aj A:ai =aj Hij

and X (ai) := X

aj A:ai =aj Hij

Clearly, the sets X (ai), respectively X (ai), define a cover of X , respectively X .4 Intuitively, the set X (ai) encompasses all the slices under which action ai is a best response for the receiver. The set X (ai) has the same interpretation, but for normalized slices. Specifically, if x X (ai) is a slice of ϕ with respect to s S, then ai Aϕ(s). Notice that a slice x X may belong to more than one polytope X (ai), when it is the case that |Aϕ(s)| > 1 for signaling schemes ϕ having x

3We let H be the boundary hyperplane of halfspace H Rd. Notice that Hij and Hji actually refer to the same hyperplane. In this paper, we use both names for ease of presentation. 4In this paper, given a polytope P RD, we let V (P) be its set of vertices, while we denote by volm(P) its Lebesgue measure in m dimensions. For ease of notation, whenever m = D 1, we simply write vol(P). Moreover, we let int(P) be the interior of P relative to a subspace that fully contains P and has minimum dimension. In the case of polytopes X (ai), the (d 1)-dimensional simplex is one of such subspaces.

as slice with respect to s S.5 In order to denote the best-response action actually played by the receiver under a slice x X , we introduce the symbol a(x), where a(x) := aϕ(s) for any ϕ having x as slice with respect to s S. Figure 1 depicts an example of the polytopes X (ai) and X (ai) in order to help the reader to grasp more intuition about them.

A crucial fact exploited by the learning algorithm developed in Section 4 is that knowing the separating hyperplanes Hij defining the polytopes X (ai) of normalized slices is sufficient to determine an optimal signaling scheme. Indeed, the polytopes X (ai) of unnormalized slices can be easily reconstructed by simply removing the normalization constraint P

θ Θ xθ = 1 from X (ai). Furthermore, as we show in Section 4 (see the proof of Lemma 4 in particular), there alway exits an optimal signaling scheme using at most one slice xa X (a) for each receiver s action a A.

We conclude the section with some remarks that help to better clarify why we need to work with signaling scheme slices in order to design our learning algorithm in Section 4.

Why we need slices for learning The coefficients of separating hyperplanes Hij are products between prior probabilities µθ and receiver s utility differences uθ(ai) uθ(aj). In order to design a no-regret learning algorithm, it is fundamental that such coefficients are learned exactly, since even an arbitrarily small approximation error may result in missing some receiver s best responses, and this may potentially lead to a large loss in sender s expected utility (and, in its turn, to large regret). As a result, any naïve approach that learns the prior and receiver s payoffs separately is deemed to fail, as it would inevitably result in approximate separating hyperplanes being learned. Operating in the space of signaling scheme slices crucially allows us to learn separating hyperplanes exactly. As we show in Section 4, it makes it possible to directly learn the coefficients of separating hyperplanes without splitting them into products of prior probabilities and receiver s utility differences.

Why we need normalized slices One may wonder why we cannot work with (unnormalized) slices in X , rather than with normalized ones in X . Indeed, as we show in Section 4, our procedure to learn separating hyperplanes crucially relies on the fact that we can restrict the search space to a suitable subset of the (d 1)-dimensional simplex. This makes it possible to avoid always employing a number of rounds exponential in d, which would lead to non-tight regret guarantees.

Why not working with posteriors In Bayesian persuasion, it is oftentimes useful to work in the space of posterior distributions induced by sender s signals (see, e.g., [Castiglioni et al., 2020b]). These are the beliefs computed by the receiver according to Bayes rule at step (3) of the interaction. Notice that posteriors do not only depend on the signaling scheme ϕ and the sent signal s S, but also on the prior distribution µ. Indeed, the same signaling scheme may induce different posteriors for different prior distributions. Thus, since in our setting the sender has no knowledge of µ, we cannot employ posteriors. Looking at signaling scheme slices crucially allows us to overcome the lack of knowledge of the prior. Indeed, one way of thinking of them is as prior-free posterior distributions.

4 Learning to persuade without a clue

In this section, we design our no-regret algorithm (Algorithm 1). We adopt a sophisticated explorethen-commit approach that exploits the signaling scheme representation based on slices introduced in Section 3. Specifically, our algorithm works by first exploring the space X := X of normalized slices in order to learn satisfactory approximations of the polytopes X(ai) := X (ai).6 Then, it exploits them in order to compute suitable approximately-optimal signaling schemes to be employed in the remaining rounds. Effectively implementing this approach raises considerable challenges.

The first challenge is that the algorithm cannot directly query a slice x X to know action a(x), as it can only commit to fully-specified signaling schemes. Indeed, even if the algorithm commits to a signaling scheme including the selected slice x X, the probability that the signal associated with x is sent depends on the (unknown) prior, as it is equal to P

θ Θ µθxθ. This probability can be

5Let us remark that, in this paper, we assume that there are no two equivalent receiver s actions ai = aj A such that uθ(ai) = uθ(aj) for all θ Θ. Thus, a slice can belong to more than one polytope only if it lies on some separating hyperplane. This assumption is w.l.o.g. since it is possible to account for equivalent actions by introducing at most n additional separating hyperplanes to consider tie breaking in favor of the sender. 6In the rest of the paper, for ease of notation, we write X and X(ai) instead of X and X (ai).

arbitrarily small. Thus, in order to observe a(x), the algorithm may need to commit to the signaling scheme for an unreasonably large number of rounds. To circumvent this issue, we show that it is possible to focus on a subset Xϵ X of normalized slices inducible with at least a suitably-defined probability ϵ (0, 1). Such a set Xϵ is built by the algorithm in its first phase.

The second challenge that we face is learning the polytopes Xϵ(ai) := X(ai) Xϵ. This is done by means of a technically-involved procedure that learns the separating hyperplanes Hij needed to identify them. This procedure is an adaptation to Bayesian persuasion settings of an algorithm recently introduced for a similar problem in Stackelberg games [Bacchiocchi et al., 2024a].

Finally, the third challenge is how to use the polytopes Xϵ(ai) to compute suitable approximatelyoptimal signaling schemes to commit to after exploration. We show that this can be done by solving an LP, which, provided that the set Xϵ is carefully constructed, gives signaling schemes with sender s expected utility sufficiently close to that of an optimal signaling scheme.

Algorithm 1 Learn-to-Persuade-w/o-Clue Require: T N

1: δ 1/T, ζ 1/T

4: t 1 5: Xϵ Build-Search-Space(T1, ϵ) 6: Rϵ Find-Polytopes(Xϵ, ζ) 7: while t T do 8: ϕt Compute-Signaling(Rϵ, Xϵ, bµt) 9: Commit to ϕt, observe θt, and send st

10: Observe feedback at and receive us t 11: Compute prior estimate bµt+1 12: t t + 1

The pseudocode of our no-regret learning algorithm is provided in Algorithm 1. In the pseudocode, we assume that all the sub-procedures have access to the current round counter t, all the observed states of nature θt, and all the feedbacks at, us t received by the sender.7 Moreover, in Algorithm 1 and its sub-procedures, we use bµt Θ to denote the prior estimate at any round t > 1, which is a vector with components defined as bµt,θ := Nt,θ/(t 1) for θ Θ, where Nt,θ N denotes the number of times that state of nature θ is observed up to round t (excluded). Algorithm 1 can be conceptually divided into three phases. In phase 1, the algorithm employs the Build-Search-Space procedure (Algorithm 2) to build a suitable subset Xϵ X of inducible normalized slices. Then, in phase 2, the algorithm employs the Find-Polytopes procedure (see Algorithm 6 in Appendix E) to find a collection of polytopes R := {Rϵ(a)}a A, where each Rϵ(a) is either Xϵ(a) or a suitable subset of Xϵ(a) that is sufficient for achieving the desired goals (see Section 4.2). Finally, phase 3 uses the remaining rounds to exploit the knowledge acquired in the preceding two phases. Specifically, at each t, this phase employs the Compute-Signaling procedure (Algorithm 3) to compute an approximately-optimal signaling scheme, by using Rϵ, the set Xϵ, and the current prior estimate bµt.

In the rest of this section, we describe in detail the three phases of Algorithm 1, bounding the regret attained by each of them. This allows us to prove the following main result about Algorithm 1.8

Theorem 1. The regret attained by Algorithm 1 is RT e O d+n d n 3/2d3

We observe that the regret bound in Theorem 1 has an exponential dependence on the number of states of nature d and the number of receiver s actions n, due to the binomial coefficient. Indeed, when d, respectively n, is constant, the regret bound is of the order of e O(nd

T), respectively e O(dn

T). Such a dependence is tight, as shown by the lower bounds that we provide in Section 5.

4.1 Phase 1: Build-Search-Space

Given an ϵ (0, 1/6d) and a number of rounds T1, the Build-Search-Space procedure (Algorithm 2) computes a subset Xϵ X of normalized slices satisfying two crucial properties needed by the learning algorithm to attain the desired guarantees. Specifically, the first property is that any slice x Xϵ can be induced with sufficiently high probability by a signaling scheme, while the

7Notice that, in Algorithm 1, the sub-procedures Build-Search-Space and Find-Polytopes perform some rounds of interaction, and, thus, they update the current round counter t. For ease of presentation, we assume that, whenever t > T, their execution is immediately stopped (as well as the execution of Algorithm 1). 8Notice that the regret attained by Algorithm 6 (stated in Theorem 1) depends on B, which is the bitcomplexity of numbers µθuθ(ai), i.e., the number of bits required to represent them. We refer the reader to Appendix E for more details about how we manage the bit-complexity of numbers in this paper.

second one is that, if x / Xϵ, then signaling schemes induce such a slice with sufficiently small probability. Intuitively, the first property ensures that it is possible to associate any x Xϵ with the action a(x) in a number of rounds of the order of 1/ϵ, while the second property is needed to bound the loss in sender s expected utility due to only considering signaling schemes with slices in Xϵ.

Algorithm 2 Build-Search-Space Require: ϵ (0, 1/6d), number of rounds T1

1: eΘ 2: while t T1 do 3: Commit to any ϕt, observe θt, and send st

4: Observe feedback at and receive us t 5: t t + 1 6: Compute prior estimate bµt 7: bµ bµt 8: for θ Θ do 9: if bµθ > 2ϵ then eΘ eΘ {θ} 10: Xϵ x X | P

θ eΘ bµθxθ 2ϵ

11: return Xϵ

Algorithm 2 works by simply observing realized states of nature θt for T1 rounds, while committing to any signaling scheme meanwhile. This allows the algorithm to build a sufficiently accurate estimate bµ of the true prior µ. Then, the algorithm uses such an estimate to build the set Xϵ. Specifically, it constructs Xϵ as the set containing all the normalized slices that are inducible with probability at least 2ϵ under the estimated prior bµ, after filtering out all the states of nature whose estimated probability is not above the 2ϵ threshold (see Lines 8 10 in Algorithm 2).

The following lemma formally establishes the two crucial properties that are guaranteed by Algorithm 2, as informally described above. Lemma 1. Given T1 := 12

ϵ log (2d/δ) and ϵ (0, 1/6d), Algorithm 2 employs T1 rounds and terminates with a set Xϵ X such that, with probability at least 1 δ: (i) P

θ Θ µθxθ ϵ for every slice x Xϵ and (ii) P

θ Θ µθxθ 10ϵ for every slice x X \ Xϵ.

To prove Lemma 1, we employ the multiplicative version of Chernoff bound, so as to show that it is possible to distinguish whether prior probabilities µθ are above or below suitable thresholds in a number of rounds of the order of 1/ϵ. Specifically, we show that, after T1 rounds and with probability at least 1 δ, the set eΘ in the definition of Xϵ does not contain states θ Θ with µθ ϵ, while it contains all the states with a sufficiently large µθ. This immediately proves properties (i) and (ii) in Lemma 1. Notice that using a multiplicative Chernoff bound is a necessary technicality, since standard concentration inequalities would result in a number of needed rounds of the order of 1/ϵ2, leading to a suboptimal regret bound in the number of rounds T.

For ease of presentation, we introduce the following clean event for phase 1 of Algorithm 1. This encompasses all the situations in which Algorithm 2 outputs a set Xϵ with the desired properties. Definition 2 (Phase 1 clean event). E1 is the event in which Xϵ meets properties (i) (ii) in Lemma 1.

4.2 Phase 2: Find-Polytopes

Given a set Xϵ X computed by the Build-Search-Space procedure and ζ (0, 1) as inputs, the Find-Polytopes procedure (Algorithm 6 in Appendix E) computes a collection Rϵ := {Rϵ(a)}a A of polytopes enjoying suitable properties sufficient to achieve the desired goals.

Ideally, we would like Rϵ(a) = Xϵ(a) for every a A. However, it is not possible to completely identify the polytopes Xϵ(a) with vol(Xϵ(a)) = 0. Indeed, if vol(Xϵ(ai)) = 0, then Xϵ(ai) Xϵ(aj) for some other polytope Xϵ(aj) with positive volume. Thus, due to receiver s tie breaking, it could be impossible to identify the whole polytope Xϵ(ai). As a result, the Find-Polytopes procedure can output polytopes Rϵ(a) = Xϵ(a) only if vol(Xϵ(a)) > 0. However, we show that, if vol(Xϵ(a)) = 0, it is sufficient to guarantee that the polytope Rϵ(a) contains a suitable subset Vϵ(a) of the vertices of Xϵ(a); specifically, those in which the best response actually played by the receiver is a. For every a A, such a set is formally defined as Vϵ(a) := {x V (Xϵ(a)) | a(x) = a}. Thus, we design Find-Polytopes so that it returns a collection Rϵ := {Rϵ(a)}a A of polytopes such that:

(i) if it holds vol(Xϵ(a)) > 0, then Rϵ(a) = Xϵ(a), while (ii) if vol(Xϵ(a)) = 0, then Rϵ(a) is a (possible improper) face of Xϵ(a) with Vϵ(a) Rϵ(a).

As a result each polytope Rϵ(a) can be either Xϵ(a) or a face of Xϵ(a), or it can be empty, depending on receiver s tie breaking. In all these cases, it is always guaranteed that Vϵ(a) Rϵ(a).

To achieve its goal, the Find-Polytopes procedure works by searching over the space of normalized slices Xϵ, so as to learn exactly all the separating hyperplanes Hij characterizing the needed vertices.

The algorithm does so by using and extending tools that have been developed for a related learning problem in Stackelberg games (see [Bacchiocchi et al., 2024a]). Notice that our Bayesian persuasion setting has some distinguishing features that do not allow us to use such tools off the shelf. We refer the reader to Appendix E for a complete description of the Find-Polytopes procedure.

A crucial component of Find-Polytopes is a tool to query a normalized slice x Xϵ in order to obtain the action a(x). This is done by using a sub-procedure that we call Action-Oracle (see Algorithm 5 in Appendix E), which works by committing to a signaling scheme including slice x until the signal corresponding to such a slice is actually sent. Under the clean event E1, the set Xϵ is built in such a way that Action-Oracle returns the desired action a(x) in a number of rounds of the order of 1/ϵ, with high probability. This is made formal by the following lemma. Lemma 2. Under event E1, given any ρ (0, 1) and a normalized slice x Xϵ, if the sender commits to a signaling scheme ϕ : Θ S := {s1, s2} such that ϕθ(s1) = xθ for all θ Θ during q := 1

ϵ log(1/ρ) rounds, then, with probability at least 1 ρ, signal s1 is sent at least once.

Notice that the signaling scheme used to query an x Xϵ only employs two signals: one is associated with slice x, while the other crafted so as to make the signaling scheme well defined.

The following lemma provides the guarantees of Algorithm 6 in Appendix E. Lemma 3. Given inputs Xϵ X and ζ (0, 1) for Algorithm 6, let L := B + Bϵ + Bˆµ, where B, Bϵ, and Bbµ denote the bit-complexity of numbers µθuθ(ai), ϵ, and bµ, respectively. Then, under event E1 and with at probability at least 1 ζ, Algorithm 6 outputs a collection Rϵ := {Rϵ(a)}a A, where Rϵ(a) is a (possibly improper) face of Xϵ(a) such that Vϵ(a) Xϵ(a), in a number of rounds T2:

d7L + d + n d

For ease of presentation, we introduce the clean event for phase 2 of Algorithm 1, defined as follows: Definition 3 (Phase 2 clean event). E2 is the event in which Vϵ(a) Rϵ(a) for every a A.

4.3 Phase 3: Compute-Signaling

Given the collection of polytopes Rϵ := {Rϵ(a)}a A returned by the Find-Polytopes procedure and an estimated prior bµt Θ, the Compute-Signaling procedure (Algorithm 3) outputs an approximately-optimal signaling scheme by solving an LP (Program (2) in Algorithm 3).

Algorithm 3 Compute-Signaling Require: Rϵ := {Rϵ(a)}a A, Xϵ, bµt Θ

1: Solve Program (2) for x := (x ,a)a A:

max (xa)a A

θ Θ bµt,θxa θus θ(a) s.t. (2)

xa Rϵ(a) a A X

a A xa θ 1 θ Θ

2: S {s } {sa | a A} 3: for θ Θ do

4: ϕθ ϕθ(sa) = x ,a θ a A ϕθ(s ) = 1 P

a A x ,a θ 5: return ϕ

Program (2) maximizes an approximated version of sender s expected utility over a suitable space of (partially-specified) signaling schemes. These are defined by tuples of slices (xa)a A containing an (unnormalized) slice xa Rϵ(a) for every receiver s action a A. The objective function being maximized by Program (2) accounts for the sender s approximate utility under each of the slices xa, where the approximation comes from the estimated prior bµt. The intuitive idea exploited by the LP formulation is that, under slice xa, the receiver always plays the same action a as best response, since a(xa) = a holds by the way in which Rϵ(a) is constructed by Find-Polytopes. In particular, each polytope Rϵ(a) is built so as to include all the unnormalized slices corresponding to the normalized slices in the set Rϵ(a). Formally, for every receiver s action a A, it holds:

Rϵ(a) := x X | x = αx x Rϵ(a) α [0, 1] {0},

where 0 denotes the vector of all zeros in Rd. We observe that, since the polytopes Rϵ(a) are constructed as the intersection of some halfspaces Hij and Xϵ, it is possible to easily build polytopes Rϵ(ai) by simply removing the normalization constraint P

θ Θ xθ = 1. Notice that, if Rϵ(a) = , then Rϵ(a) = {0}, which implies that action a is never induced as a best response, since xa = 0.

After solving Program (2) for an optimal solution x := (x ,a)a A, Algorithm 3 employs such a solution to build a signaling scheme ϕ. This employs a signal sa for every action a A, plus an additional signal s , namely S := {s } {sa | a A}. Specifically, the slice of ϕ with respect to sa is set to be equal to xa, while its slice with respect to s is set so as to render ϕ a valid signaling scheme (i.e., probabilities over signal sum to one for every θ Θ). Notice that this is always possible thanks to the additional constraints P

a A xa θ 1 in Program (2). Moreover, such a slice may belong to X \ Xϵ . Indeed, in instances where there are some X (a) falling completely outside Xϵ , this is fundamental to build a valid signaling scheme. Intuitively, one may think of s as incorporating all the missing signals in ϕ, namely those corresponding to actions a A with X (a) outside Xϵ .

The following lemma formally states the theoretical guarantees provided by Algorithm 3.

Lemma 4. Given inputs Rϵ := {Rϵ(a)}a A, Xϵ, and bµt Θ for Algorithm 3, under events E1 and E2, the signaling scheme ϕ output by the algorithm is O(ϵnd+ν)-optimal for ν P

θ Θ bµt,θ µθ .

In order to provide some intuition on how Lemma 4 is proved, let us assume that each polytope Xϵ(a) is either empty or has volume larger than zero, implying that Rϵ(a) = Xϵ(a). In Appendix D, we provide the complete formal proof of Lemma 4, working even with zero-measure non-empty polytopes. The first observation the we need is that sender s expected utility under a signaling scheme ϕ can be decomposed across its slices, with each slice x providing a utility of P

θ Θ µθxθus θ(a(x)). The second crucial observation is that there always exists an optimal signaling scheme ϕ that is direct and persuasive, which means that ϕ employs only one slice xa for each action a A, with a being a best response for the receiver under xa. It is possible to show that the slices xa that also belong to Xϵ can be used to construct a feasible solution to Program 2, since xa Rϵ(a) by definition. Thus, restricted to those slices, the signaling scheme ϕ computed by Algorithm 3 achieves an approximate sender s expected utility that is greater than or equal to the one achieved by ϕ . Moreover, the loss due to dropping the slices that are not in Xϵ can be bounded thanks to point (ii) in Lemma 1. Finally, it remains to account for the approximation due to using bµt instead of the true prior in the objective of Program 2. All the observations above allow to bound sender s expected utility loss as in Lemma 4.

5 Lower bounds for online Bayesian persuasion

In this section, we present two lower bounds on the regret attainable in the setting faced by Algorithm 1. The first lower bound shows that an exponential dependence in the number of states of nature d and the number of receiver s actions n is unavoidable. This shows that one cannot get rid of the binomial coefficient in the regret bound of Algorithm 1 provided in Theorem 1. Formally:

Theorem 2. For any sender s algorithm, there exists a Bayesian persuasion instance in which n = d + 2 and the regret RT suffered by the algorithm is at least 2Ω(d), or, equivalently, 2Ω(n).

Theorem 2 is proved by constructing a collection of Bayesian persuasion instances in which an optimal signaling scheme has to induce the receiver to take an action that is a best response only for a unique posterior belief (among those computable by the receiver at step (3) of the interaction). This posterior belief belongs to a set of possible candidates having size exponential in the number of states of nature d. As a result, in order to learn such a posterior belief, any algorithm has to commit to a number of signaling schemes that is exponential in d (and, given how the instances are built, in n).

The second lower bound shows that the regret bound attained by Algorithm 1 is tight in T.

Theorem 3. For any sender s algorithm, there exists a Bayesian persuasion instance in which the regret RT suffered by the algorithm is at least Ω(

To prove Theorem 3, we construct two Bayesian persuasion instances with Θ = {θ1, θ2} such that, in the first instance, µθ1 is slightly greater than µθ2, while the opposite holds in the second instance. Furthermore, the two instances are built so that the sender does not gain any information that helps to distinguish between them by committing to signaling schemes. As a consequence, to make a distinction, the sender can only leverage the information gained by observing the states of nature realized at each round, and this clearly results in the regret being at least Ω(

6 The sample complexity of Bayesian persuasion

In this section, we show how the no-regret learning algorithm developed in Section 4 can be easily adapted to solve a related Bayesian persuasion PAC-learning problem. Specifically, given an (additive) approximation error γ (0, 1) and a probability η (0, 1), the goal of such a problem is to learn a γ-optimal signaling scheme with probability at least 1 η, by using the minimum possible number of rounds. This can be also referred to as the sample complexity of learning signaling schemes.

As in the regret-minimization problem addressed in Section 4, we assume that the sender does not know anything about both the prior distribution µ and receiver s utility function u.

Algorithm 4 PAC-Persuasion-w/o-Clue Require: γ (0, 1), η (0, 1)

1: δ η/2, ζ η/2, ϵ1 γ/12nd, t 1 2: ϵ Compute-Epsilon(ϵ1) 3: T1 1

4: Xϵ Build-Search-Space(T1, ϵ) 5: Rϵ Find-Polytopes(Xϵ, ζ) 6: ϕ Compute-Signaling(Rϵ, Xϵ, bµ) 7: return ϕ

We tackle the Bayesian persuasion PAC-learning problem with a suitable adaptation of Algorithm 1, provided in Algorithm 4. The first two phases of the algorithm follow the line of Algorithm 1, with the Build-Search-Space and Find-Polytopes procedures being called for suitably-defined parameters ϵ, δ, T1, and ζ (taking different values with respect to their counterparts in Algorithm 1). In particular, the value of ϵ depends on γ and is carefully computed so as to control the bit-complexity of numbers used in the Find-Polytopes procedure (see Lemma 3), as detailed in Appendix G. Finally, in its third phase, the algorithm calls Compute-Signaling to compute a signaling scheme ϕ that can be proved to γ-optimal with probability at least 1 η.

The most relevant difference between Algorithm 4 and Algorithm 1 is the number of rounds used to build the prior estimate defining Xϵ. Specifically, while the latter has to employ T1 of the order of 1/ϵ and rely on a multiplicative Chernoff bound to get tight regret guarantees, the former has to use T1 of the order of 1/ϵ2 and standard concentration inequalities to get an O(ϵ)-optimal solution. Formally:

Lemma 5. Given T1 := 1

2ϵ2 log (2d/δ) and ϵ (0, 1), Algorithm 2 employs T1 rounds and outputs Xϵ X such that, with probability at least 1 δ: (i) P

θ Θ µθxθ ϵ for every slice x Xϵ, (ii) P

θ Θ µθxθ 6ϵ for every slice x X \ Xϵ, and (iii) |bµθ µθ| ϵ for every θ Θ.

By Lemma 5, it is possible to show that the event E1 holds. Hence, the probability that a signaling scheme including a slice x Xϵ actually induces such a slice is at least ϵ, and, thus, the results concerning the second phase of Algorithm 1 are valid also in this setting. Finally, whenever the events E1 and E2 hold, we can provide an upper bound on the number of rounds required by Algorithm 4 to compute a γ-optimal signaling scheme as desired. Formally: Theorem 4. Given γ (0, 1) and η (0, 1), with probability at least 1 η, Algorithm 4 outputs a γ-optimal signaling scheme in a number of rounds T such that:

d8B + d d + n d

We conclude by providing two negative results showing that the result above is tight. Theorem 5. There exist two absolute constants κ, λ > 0 such that no algorithm is guaranteed to return a κ-optimal signaling scheme with probability of at least 1 λ by employing less than 2Ω(n) and 2Ω(d) rounds, even when the prior distribution µ is known to the sender.

Theorem 6. Given γ (0, 1/8) and η (0, 1), no algorithm is guaranteed to return a γ-optimal signaling scheme with probability at least 1 η by employing less than Ω 1

γ2 log(1/η) rounds.

In Appendix H , we also study the case in which the prior µ is known to the sender. In such a case, we show that the sample complexity can be improved by a factor 1/γ, which is tight.

Acknowledgments

This work was supported by the Italian MIUR PRIN 2022 Project Targeted Learning Dynamics: Computing Efficient and Fair Equilibria through No-Regret Algorithms , by the FAIR (Future

Artificial Intelligence Research) project, funded by the Next Generation EU program within the PNRRPE-AI scheme (M4C2, Investment 1.3, Line on Artificial Intelligence), and by the EU Horizon project ELIAS (European Lighthouse of AI for Sustainability, No. 101120237). This work was also partially supported by project SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the European Union - Next Generation EU. Francesco Bacchiocchi was also supported by the G-Research November Grant.

Shipra Agrawal, Yiding Feng, and Wei Tang. Dynamic pricing and learning with bayesian persuasion. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 59273 59285, 2023.

Ricardo Alonso and Odilon Câmara. Persuading voters. American Economic Review, 2016.

Yakov Babichenko, Inbal Talgam-Cohen, Haifeng Xu, and Konstantin Zabarnyi. Regret-minimizing Bayesian persuasion. Games and Economic Behavior, 136:226 248, 2022. ISSN 0899-8256.

Francesco Bacchiocchi, Matteo Castiglioni, Alberto Marchesi, Giulia Romano, and Nicola Gatti. Public signaling in Bayesian ad auctions. In IJCAI, 2022.

Francesco Bacchiocchi, Matteo Bollini, Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. The sample complexity of Stackelberg games. ar Xiv preprint ar Xiv:2405.06977, 2024a.

Francesco Bacchiocchi, Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. Learning optimal contracts: How to exploit small action spaces. In The Twelfth International Conference on Learning Representations, 2024b.

Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. Markov persuasion processes: Learning to persuade from scratch. ar Xiv preprint ar Xiv:2402.03077, 2024c.

Ashwinkumar Badanidiyuru, Kshipra Bhawalkar, and Haifeng Xu. Targeting and signaling in ad auctions. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 2018.

Yu Bai, Chi Jin, Huan Wang, and Caiming Xiong. Sample-efficient learning of stackelberg equilibria in general-sum games. Advances in Neural Information Processing Systems, 34:25799 25811, 2021.

Martino Bernasconi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti, and Francesco Trovò. Sequential information design: Learning to persuade in the dark. In Advances in Neural Information Processing Systems, volume 35, pages 15917 15928, 2022.

Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Francesco Trovò, and Nicola Gatti. Optimal rates and efficient algorithms for online Bayesian persuasion. In Proceedings of the 40th International Conference on Machine Learning, pages 2164 2183, 2023.

Umang Bhaskar, Yu Cheng, Young Kun Ko, and Chaitanya Swamy. Hardness results for signaling in bayesian zero-sum and network routing games. In EC, 2016.

Peter Bro Miltersen and Or Sheffet. Send mixed signals: earn more, work less. In EC, 2012.

Modibo K. Camara, Jason D. Hartline, and Aleck Johnsen. Mechanisms for a no-regret agent: Beyond the common prior. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 259 270, 2020.

Matteo Castiglioni and Nicola Gatti. Persuading voters in district-based elections. In AAAI, 2021.

Matteo Castiglioni, Andrea Celli, and Nicola Gatti. Persuading voters: It s easy to whisper, it s hard to speak loud. In AAAI, 2020a.

Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Online Bayesian persuasion. Advances in Neural Information Processing Systems, 33:16188 16198, 2020b.

Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Signaling in Bayesian network congestion games: the subtle power of symmetry. In AAAI, 2021a.

Matteo Castiglioni, Alberto Marchesi, Andrea Celli, and Nicola Gatti. Multi-receiver online Bayesian persuasion. In Proceedings of the 38th International Conference on Machine Learning, pages 1314 1323, 2021b.

Matteo Castiglioni, Giulia Romano, Alberto Marchesi, and Nicola Gatti. Signaling in posted price auctions. In AAAI, 2022.

Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Regret minimization in online Bayesian persuasion: Handling adversarial receiver s types under full and partial feedback models. Artificial Intelligence, 314:103821, 2023.

Alon Cohen, Argyrios Deligkas, and Moran Koren. Learning approximately optimal contracts. In Algorithmic Game Theory: 15th International Symposium, SAGT 2022, page 331 346, 2022.

Lee Cohen and Yishay Mansour. Optimal algorithm for Bayesian incentive-compatible exploration. In EC, 2019.

Shaddin Dughmi and Haifeng Xu. Algorithmic Bayesian persuasion. In STOC, 2016.

Shaddin Dughmi and Haifeng Xu. Algorithmic persuasion with no externalities. In EC, 2017.

Yuval Emek, Michal Feldman, Iftah Gamzu, Renato Paes Leme, and Moshe Tennenholtz. Signaling schemes for revenue maximization. ACM Transactions on Economics and Computation, 2014.

Yiding Feng, Wei Tang, and Haifeng Xu. Online bayesian recommendation with no regret. In Proceedings of the 23rd ACM Conference on Economics and Computation, page 818 819, 2022.

Michal Forišek. Approximating rational numbers by fractions. In Fun with Algorithms, pages 156 165. Springer Berlin Heidelberg, 2007.

Jiarui Gan, Rupak Majumdar, Debmalya Mandal, and Goran Radanovic. Sequential principalagent problems with communication: Efficient computation and learning. ar Xiv preprint ar Xiv:2306.03832, 2023.

Itay Goldstein and Yaron Leitner. Stress tests and information disclosure. Journal of Economic Theory, 177:34 69, 2018.

Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, page 359 376, 2014.

Emir Kamenica and Matthew Gentzkow. Bayesian persuasion. American Economic Review, 101(6): 2590 2615, 2011.

Anton Kolotilin. Experimental design to persuade. Games and Economic Behavior, 90:215 226, 2015.

Niklas Lauffer, Mahsa Ghasemi, Abolfazl Hashemi, Yagiz Savas, and Ufuk Topcu. No-regret learning in dynamic Stackelberg games. ar Xiv preprint ar Xiv:2202.04786, 2024.

Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In Algorithmic Game Theory: Second International Symposium, SAGT 2009, pages 250 262, 2009.

Tao Lin and Ce Li. Information design with unknown prior, 2025. URL https://arxiv.org/abs/ 2410.05533.

Yishay Mansour, Alex Slivkins, Vasilis Syrgkanis, and Zhiwei Steven Wu. Bayesian exploration: Incentivizing exploration in Bayesian games. Operations Research, 70(2):1105 1127, 2022.

Binghui Peng, Weiran Shen, Pingzhong Tang, and Song Zuo. Learning optimal strategies to commit to. In AAAI Conference on Artificial Intelligence, volume 33, pages 2149 2156, 2019.

Zinovi Rabinovich, Albert Xin Jiang, Manish Jain, and Haifeng Xu. Information disclosure as a means to security. In AAMAS, 2015.

Shoshana Vasserman, Michal Feldman, and Avinatan Hassidim. Implementing the wisdom of waze. In IJCAI, 2015.

Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael I. Jordan, and Haifeng Xu. Sequential information design: Markov persuasion process and its efficient reinforcement learning. In EC, 2022.

Haifeng Xu. On the tractability of public persuasion with no externalities. In SODA, 2020.

Haifeng Xu, Rupert Freeman, Vincent Conitzer, Shaddin Dughmi, and Milind Tambe. Signaling in Bayesian Stackelberg games. In AAMAS, 2016.

Banghua Zhu, Stephen Bates, Zhuoran Yang, Yixin Wang, Jiantao Jiao, and Michael I. Jordan. The sample complexity of online contract design. In Proceedings of the 24th ACM Conference on Economics and Computation, page 1188, 2023.

You Zu, Krishnamurthy Iyer, and Haifeng Xu. Learning to persuade on the fly: Robustness against ignorance. In EC, pages 927 928, 2021.

The appendixes are organized as follows:

Appendix A provides a discussion of the previous works most related to ours. Appendix B presents additional preliminaries. Appendix C presents the omitted proofs from Section 4. Appendix D presents the proof of Lemma 4 from Section 4.3. Appendix E provides a description of the results and the procedures employed in phase 2. Appendix F presents the omitted proofs from Section 5. Appendix G presents the omitted proofs and some technical details from Section 6. Appendix H discusses the PAC-learning problem when the prior is known.

A Additional Related Works

Learning in Bayesian persuasion settings In addition to the works presented in Section 1.2, the problem of learning optimal signaling schemes in Bayesian persuasion settings has received growing attention over the last few years. Camara et al. [2020] study an adversarial setting where the receiver does not know the prior, and the receiver s behavior is aimed at minimizing internal regret. Babichenko et al. [2022] consider an online Bayesian persuasion setting with binary actions when the prior is known, and the receiver s utility function has some regularities. Feng et al. [2022] study the online Bayesian persuasion problem faced by a platform that observes some relevant information about the state of a product and repeatedly interacts with a population of myopic receivers through a recommendation mechanism. Agrawal et al. [2023] design a regret-minimization algorithm in an advertising setting based on the Bayesian persuasion framework, assuming that the receiver s utility function satisfies some regularity conditions.

Online learning in problems with commitment From a technical point of view, our work is related to the problem of learning optimal strategies in Stackelberg games when the leader has no knowledge of the follower s utility. Letchford et al. [2009] propose the first algorithm to learn optimal strategies in Stackelberg games. Their algorithm is based on an initial random sampling that may require an exponential number of samples, both in the number of leader s actions m and in the representation precision L. Peng et al. [2019] improve the algorithm of Letchford et al. [2009], while Bacchiocchi et al. [2024a] further improve the approach by Peng et al. [2019] by relaxing some of their assumptions.

Furthermore, our work is also related to the problem of learning optimal strategies in Stackelberg games where the leader and the follower interaction is modelled by a Markov Decision Process. Lauffer et al. [2024] study Stackelberg games with a state that influences the leader s utility and available actions. Bai et al. [2021] consider a setting where the leader commits to a pure strategy and observes a noisy measurement of their utility.

Finally, our work is also related to online hidden-action principal-agent problems, in which a principal commits to a contract at each round to induce an agent to take favorable actions. Ho et al. [2014] initiated the study by proposing an algorithm that adaptively refines a discretization over the space of contracts, framing the model as a multi-armed bandit problem where the discretization provides a finite number of arms to play with. Cohen et al. [2022] similarly work in a discretized space but with milder assumptions. Zhu et al. [2023] provide a more general algorithm that works in hidden-action principal-agent problems with multiple agent types. Finally, Bacchiocchi et al. [2024b] study the same setting and propose an algorithm with smaller regret when the number of agent actions is small.

B Additional preliminaries

B.1 Additional preliminaries on Bayesian persuasion

In step (3) of the sender-receiver interaction presented in Section 2.1, after observing s S, the receiver performs a Bayesian update and infers a posterior belief ξs Θ over the states of nature,

according to the following equation:

ξs θ = µθ ϕθ(s) P

θ Θ µθ ϕθ (s) θ Θ. (3)

Consequently, given a signaling scheme ϕ, we can equivalently represent it as a distribution over the set of posteriors it induces. Formally, we say that ϕ induces γ : Θ [0, 1] if, for each posterior distribution ξ Θ, we have:

θ Θ µθϕθ(s) and X

ξ supp(γ) γ(ξ) = 1.

Furthermore, we say that a distribution over a set of posteriors γ is consistent, i.e., there exists a valid signaling scheme ϕ inducing γ if the following holds: X

ξ supp(γ) γ(ξ)ξθ = µθ θ Θ.

With an abuse of notation, we will sometimes refer to a consistent distribution over a set of posteriors γ as a signaling scheme. This is justified by the fact that there exists a signaling scheme ϕ inducing such distribution, but we are interested only in the distribution over the set of posteriors that ϕ induces.

B.2 Additional preliminaries on the representation of numbers

In the following, we assume that all the numbers manipulated by our algorithms are rational. Furthermore, we assume that rational numbers are represented as fractions, by specifying two integers which encode their numerator and denominator. Given a rational number q Q represented as a fraction b/c with b, c Z, we denote the number of bits that q occupies in memory, called bit-complexity, as Bq := Bb + Bc, where Bb (Bc) is the number of bits required to represent the numerator (denominator). For the sake of the presentation, with an abuse of terminology, given a vector in QD of D rational numbers represented as fractions, we let its bit-complexity be the maximum bit-complexity among its entries.

Furthermore, we assume that the bit-complexity encoding both the receiver s utility and the prior distribution is bounded. Formally, we denote by Bµ the bit-complexity of the prior µ, while we assume Bu to be an upper bound to the bit-complexity of each d-dimensional vector uθ(a) with θ Θ. Moreover, we let B := Bµ + Bu. Finally, we also denote with Bϵ the bit-complexity of the parameter ϵ computed by our algorithms, while we denote with Bbµ the bit-complexity of the estimator bµ computed by Algorithm 2.

C Omitted proofs from Section 4

Lemma 1. Given T1 := 12

ϵ log (2d/δ) and ϵ (0, 1/6d), Algorithm 2 employs T1 rounds and terminates with a set Xϵ X such that, with probability at least 1 δ: (i) P

θ Θ µθxθ ϵ for every slice x Xϵ and (ii) P

θ Θ µθxθ 10ϵ for every slice x X \ Xϵ.

Proof. For each θ Θ we consider two possible cases.

1. If µθ > ϵ, then we employ the multiplicative Chernoff inequality as follows:

P |µθ bµθ| 1

where T1 N+ is the number of rounds employed to estimate bµθ. As a result, by setting the number of rounds to estimate µθ equal to T1 = 12/ϵ log (2d/δ) we get:

P |µθ bµθ| 1

since µθ > ϵ. Then, we get:

Consequently, with a probability of at least 1 δ/d the estimator bµθ is such that bµθ [µθ/2, 3µθ/2]. Thus, if µθ 6ϵ, then bµθ 3ϵ > 2ϵ and θ eΘ. We also notice that there always exits a θ Θ such that µθ 1/d 6ϵ, since ϵ (0, 1/6d). Consequently, with a probability of at least 1 δ/d, there always exist a θ eΘ.

2. If µθ ϵ, then we employ the multiplicative Chernoff inequality as follows:

P (bµθ (1 + c)µθ) e c2T1µθ

with c = ϵ/µθ. Thus, by setting the number of rounds employed to estimate µθ equal to T1 = 12/ϵ log (2d/δ) , we get:

P (bµθ µθ + ϵ) exp

since x/(x + 2) 1/3, for each x 1. As a result, we have:

P (bµθ µθ + ϵ) 1 δ

Thus, with a probability of at least 1 δ/d, if µθ ϵ, then bµθ µθ + ϵ, which implies that bµθ 2ϵ. Furthermore, if µθ ϵ, then bµθ 2ϵ and θ eΘ.

Thus, by employing a union bound over the set of natures, we have that if µθ ϵ, then its corresponding estimate bµθ falls within the interval [0, 2ϵ] and θ eΘ, while if µθ 6ϵ, then its corresponding estimate is such that bµθ > 2ϵ and θ eΘ, with a probability of at least 1 δ. We also notice that, with the same probability, the set eΘ is always non empty.

Consequently, for each slice x Xϵ with respect to a signal s, the probability of observing s can be lower bounded as follows:

θ eΘ bµθxθ 3

θ eΘ µθxθ X

θ eΘ µθxθ X

where the inequalities above hold because of the definition of Xϵ and observing that each θ eΘ satisfies Equation 4 with probability at least 1 δ.

Furthermore, for each x Xϵ, the two following conditions hold:

θ eΘ µθxθ X

θ eΘ bµθxθ 2ϵ and X

θ eΘ µθxθ 6ϵ X

θ eΘ xθ 6ϵ,

with probability at least 1 δ. Thus, by putting the two inequalities above together, for each x Xϵ, we have: X

θ Θ µθxθ 10ϵ,

with probability at least 1 δ, concluding the proof.

Lemma 2. Under event E1, given any ρ (0, 1) and a normalized slice x Xϵ, if the sender commits to a signaling scheme ϕ : Θ S := {s1, s2} such that ϕθ(s1) = xθ for all θ Θ during q := 1

ϵ log(1/ρ) rounds, then, with probability at least 1 ρ, signal s1 is sent at least once.

Proof. In the following, we let τ be the first round in which the sender commits to ϕ. The probability of observing the signal s1 at a given round t τ is can be lower bounded as follows:

P st = s1 = X

θ Θ µθxθ ϵ,

where the inequality holds under the event E1. Thus, at each round, the probability of sampling the signal s1 S is greater or equal to ϵ > 0. Consequently, the probability of never observing the signal s1 S in q rounds is given by:

t=τ {st = s1}

where the last inequality holds by taking q = l log(ρ) log(1 ϵ) m l log(1/ρ)

ϵ m , for each ϵ (0, 1).

As a result, the probability of observing the signal s1 at least once in q rounds is greater or equal to:

t=τ {st = s1}

t=τ {st = s1}

concluding the proof.

Theorem 1. The regret attained by Algorithm 1 is RT e O d+n d n 3/2d3

Proof. In the following, we let δ = ζ = 1

T , as defined in Algorithm 1. To prove the theorem, we decompose the regret suffered in the three phases of Algorithm 1:

1. Phase 1. We observe that the number of rounds to execute the Build-Search-Space procedure (Algorithm 2) is equal to T1 = O (1/ϵ log (1/δ) log(d)) . Thus, the cumulative regret of Phase 1 can be upper bounded as follows:

ϵ log(T) log(d) e O

This is because, at each round, the regret suffered during the execution of Algorithm 2 is at most one.

2. Phase 2. Under the event E1, which holds with probability 1 δ, Algorithm 6 correctly terminates with probability 1 ζ. Thus, with probability at least 1 δ ζ, the number of rounds employed by such algorithm is of the order:

d7(B + Bϵ + Bbµ) + d + n d

ϵ log2(T) d7(B + Bϵ) + d + n d

where the last equality holds because Bbµ = O(log(1/ϵ) + log(d) + log(T)). As a result, by taking the expectation, the regret suffered in Phase 2 by Algorithm 1 can be upper bounded as follows:

R2 T e O n 3/2d3 d + n d

since, at each round, the regret suffered during the execution of Algorithm 6 is at most one.

3. Phase 3. Let τ be the number of rounds required by Phase 1 and Phase 2 to terminate. Under the events E1 and E2, which hold with probability at least 1 δ ζ, thanks to Lemma 4, the solution returned by Algorithm 3 at each round t > τ is O(dnϵ + νt)-optimal, where we define νt = P θ Θ µθ bµt,θ . We introduce the following event:

Et = {|µθ bµt,θ| ϵt θ Θ} ,

where we let ϵt > 0 be defined as follows:

log (2d T/ι)

2(t τ) , t > τ.

Then, by Hoeffding s inequality and a union bound we have:

t>τ Et 1 ι.

Thus, by setting ι = 1/T, the regret suffered in Phase 3 by Algorithm 1 can be upper bounded as follows:

θ Θ µθ bµt,θ

t=τ+1 ϵt + O (dnϵT)

T + n 3/2d5

As a result, the regret of Algorithm 1 is in the order of:

RT e O n 3/2d3 d + n d

BT + n 3/2d5

BT = e O n 3/2d3 d + n d

concluding the proof.

D Proof of Lemma 4 from Section 4.3

In order to prove Lemma 4, we first consider an auxiliary LP (Program 5a) that works on the vertices of the regions Xϵ(a). This is useful to take into account the polytopes Xϵ(a) with null volume. Indeed, for every action a A such that vol(Xϵ(a)) = 0, Algorithm 3 takes in input only a face Rϵ(a) of Xϵ(a) such that Vϵ(a) V (Rϵ(a)). By working on the vertices of the regions Xϵ(a), we can show that the vertices in Vϵ(a) are sufficient to compute an approximately optimal signaling scheme.

The auxiliary LP that works on the vertices is the following:

θ Θ µθxθus θ(a(x)) (5a)

x Vϵ αxxθ 1 θ Θ. (5b)

Program 5a takes in input the set of vertices Vϵ := S

a A V (Xϵ(a)), along with the corresponding best-responses (a(x))x Vϵ and the exact prior µ. It then optimizes over the non-negative variables αx 0, one for vertex x Vϵ. These variables αx act as weights for the corresponding slices x Vϵ, identifying a non-normalized slice αxx X .

In the following, we show that the value of an optimal solution v to Program 5a is at least v OPT O(ϵnd). Then, we prove that the signaling scheme ϕ computed by Algorithm 3 achieves a principal s expected utility of at least v minus a quantity related to the difference between the estimated prior bµt and with the actual prior µ. Thus, by considering that v OPT O(ϵnd), we will be able to prove Lemma 4.

As a first step, we show that it is possible to decompose each slice x Xϵ into a weighted sum of the vertices x Vϵ without incurring a loss in the sender s utility. Thus, a generic slice x of an optimal signaling scheme can be written as a convex combination of slices x with x Vϵ. This property is formalized in the following lemma.

Lemma 6. For every x Xϵ, there exists a distribution α Vϵ such that:

x Vϵ αx x θ θ Θ.

Furthermore, the following holds: X

θ Θ µθxθus θ(a(x)) X

θ Θ µθx θus θ(a(x )).

Proof. Let a := a(x). Since x Xϵ(a), by the Carathéodory theorem, there exists an α V (Xϵ(a)) such that: X

x V (Xϵ(a)) αx x θ = xθ θ Θ.

Furthermore:

θ Θ µθxθus θ(a) = X

x V (Xϵ(a)) αx x θ

x V (Xϵ(a)) αx X

θ Θ µθx θus θ(a)

x V (Xϵ(a)) αx X

θ Θ µθx θus θ(a(x ))

where the inequality holds because the receiver breaks ties in favor of the sender. Finally, we observe that for each distribution over the set V (Xϵ(a)) for a given Xϵ(a), we can always recover a probability distribution supported in Vϵ, since V (Xϵ(a)) Vϵ by construction.

Thanks to the the result above, in the next lemma (Lemma 7) we prove that an optimal solution of Program 5a has value at least v OPT 10ϵnd. To show this, we begin by observing that there exists a set J of slices of the optimal signaling scheme that belong to the search space Xϵ. By applying Lemma 6 to each of these slices, we obtain a feasible solution for Program 5a. The value of this solution is at least the sender s expected utility given by the slices in J . Finally, thanks to the properties of the search space Xϵ, we can bound the expected sender s utility provided by the slices that lie outside the search space.

Lemma 7. Under the event E1, the optimal solution of Program 5a has value at least v OPT 10ϵnd.

Proof. In the following we let ϕθ A for each θ Θ be an optimal signaling scheme, where we assume, without loss of generality, such a signaling scheme to be direct, meaning that S = A and a Aϕ(a) for every action a A. Furthermore, for each action a supp(ϕ), we define:

xa θ = ϕθ(a) P

θ Θ ϕθ(a) θ Θ and αxa = X

We observe that each xa X(a), indeed we have: X

θ Θ µθxθuθ(a) = 1 αxa

θ Θ µθϕθ(a)uθ(a) 1 αxa

θ Θ µθϕθ(a)uθ(a ) = X

θ Θ µθxθuθ(a ).

for every action a A. We define the subset of actions A A in a way that if a A , then xa Xϵ, i.e., A := {a A | xa Xϵ}. Then, we let A := supp(ϕ) \ A . Furthermore, for each a A , thanks to Lemma 6, there exists a distribution αa Vϵ such that:

x Vϵ αa x x θ,

and the following holds:

θ Θ µθxa θuθ(a) = X

x Vϵ αa x x θ

x Vϵ µθαa x x θus θ(a)

x Vϵ µθαa x x θus θ(a(x )). (6)

We also define α : Vϵ R+ as follows:

a A αa x αxa,

for each x Vϵ. First, we show that α : Vϵ R+ is a feasible solution to LP 5a. Indeed, for each θ Θ, it holds: X

x Vϵ α x x θ = X

a A αa x αxax θ

x Vϵ αa x x θ

a A αxaxa θ

a A ϕθ(a) 1

Where the equalities above holds thanks to Equation 6 and the definition of α . Then, we show that the utility achieved by α : Vϵ Rm + is greater or equal to OPT 10ϵd. Formally, we have: X

θ Θ µθx θus θ(a(x )) = X

a A αxaαa x X

θ Θ µθx θus θ(a(x ))

θ Θ µθαa x x θus θ(a(x ))

θ Θ xa θµθus θ(a)

θ Θ µθϕθ(a)us θ(a)

θ Θ µθϕθ(a)us θ(a)

θ Θ µθαxaxa θus θ(a)

θ Θ µθxa θus θ(a)

Where the first inequality holds thanks to Inequality (6), the second inequality holds since αxa d and the last inequality holds since, for each x Xϵ, it holds P

θ Θ µθxa θ(a) 10ϵ, under the event E1 and xa / Xϵ for every a A . Consequently, α is a feasible solution to LP 5a and provides, under the event E1, a value of at least OPT 10ϵnd. As a result, under the event E1 the optimal solution of LP 5a has value v OPT 10ϵnd, concluding the proof.

Lemma 4. Given inputs Rϵ := {Rϵ(a)}a A, Xϵ, and bµt Θ for Algorithm 3, under events E1 and E2, the signaling scheme ϕ output by the algorithm is O(ϵnd+ν)-optimal for ν P

θ Θ bµt,θ µθ .

Proof. We observe that under the events E1 and E2, the collection Rϵ = {Rϵ(a)}a A is composed of faces Rϵ(a) of Xϵ(a) (possibly the improper face Xϵ(a) itself) such that every vertex x V (Xϵ(a)) that satisfies a(x) = a belongs to Rϵ(a). The following statements hold under these two events.

As a first step, we prove that given a feasible solution α = (αx)x Vϵ to Program 5a, one can construct a feasible solution φ to Program 2 with the same value. In particular, we consider a solution φ = (exa)a A defined as: exa θ := X

x Vϵ(a) αxxθ a A, θ Θ.

We observe that for every a A and θ Θ we can bound exa θ as follows:

0 exa θ = X

x Vϵ(a) αxxθ X

x Vϵ αxxθ 1,

where the last inequality holds due to the constraints of Program 5a. Consequently, the vectors exa belong to X .

Now we show that exa belongs to Rϵ(a) for every a A. This holds trivially for every action a A such that P

x Vϵ(a) αx = 0, as exa = 0 Rϵ(a). Consider instead an action a A such that P

x Vϵ(a) αx > 0, and let us define the coefficient:

βa x := αx P x Vϵ(a) αx ,

for every vertex x Vϵ(a). One can easily verify that βa Vϵ(a). Now consider the normalized slice: ex N,a := X

x Vϵ(a) βa xx.

This slice belongs to R ϵ (a) := Rϵ(a), as it is the weighted sum of the vertices Vϵ(a) V (R ϵ (a)) with weights βa Vϵ(a). Furthermore, we can rewrite the component exa of the solution ϕ as:

exa = ex N,a X

x Vϵ(a) αx.

Thus, by considering that ex N,a R ϵ (a) and exa X , we have that exa Rϵ(a).

Finally, since α is a feasible solution for Program 5a, we can observe that: X

a A exa θ = X

x Vϵ(a) αxx = X

x Vϵ αxx 1.

As a result, φ = (exa)a A is a feasible solution to Program 2.

If the estimator bµt coincides with the exact prior µ, then direct calculations show that the solution φ to Program 2 achieves the same value of the solution α to Program 5a.

It follows that, when bµt = µ, the optimal solution of Program 2 has at least the same value of the optimal solution of Program 5a.

In order to conclude the proof, we provide a lower bound on the utility of the signaling scheme computed by Algorithm 3. Let ϕLP be the signaling scheme computed by Algorithm 3, while let Ψ = (x LP,a)a A be the optimal solution to Program 2. Furthermore, we let ψ = (x E,a)a A be the optimal solution of Program 2 and ϕE the signaling scheme computed by Algorithm 3 when the prior estimator coincides exactly with the prior itself, i.e., bµ := bµt = µ.

Since x LP,a Rϵ(a) for every a A, we have that a AϕLP(a).9 Breaking ties in favor of the sender, the action aϕLP(sa) is such that: X

θ Θ µθx LP,a θ us θ(aϕLP(sa)) X

θ Θ µθx LP,a θ us θ(a).

9Observe that when R ϵ (a) = and Rϵ(a) = {0}, we have x LP,a = 0. Thus, x LP,a does not contribute to the sender s utility, sa / supp(ϕLP) and AϕLP(sa) = A by definition.

θ Θ µθϕLP θ (s)us θ(aϕLP(s))

θ Θ µθϕLP θ (s)us θ(aϕLP(s))

θ Θ µθx LP,a θ us θ(aϕLP(sa))

θ Θ µθx LP,a θ us θ(a)

θ Θ bµθx LP,a θ us θ(a)

θ Θ bµθx E,a θ us θ(a)

θ Θ µθx E,a θ us θ(a) 2

We recall that the value of ψ = (x E,a)a A is at least the optimal value of Program 5a when bµ = µ, and such a value is at least v OPT 10ϵnd according to Lemma 7, Thus, we have that:

θ Θ µθx E,a θ us θ(a) 2

OPT 10ϵnd 2

concluding the proof.

E Omitted proofs and sub-procedures of Phase 2

E.1 Action-Oracle

The goal of the Action-Oracle procedure (Algorithm 5) is to assign the corresponding best-response to a slice x Xϵ received as input. In order to do so, it repeatedly commits to a signaling scheme ϕ such that x is the slice of ϕ with respect to the signal s1. When the signal s1 is sampled, the procedure returns the best-response a(x).

Algorithm 5 Action-Oracle Require: x Xϵ

1: ϕθ(s1) xθ and ϕθ(s2) 1 xθ θ Θ 2: do 3: Commit to ϕt = ϕ, observe θt, and send st

4: Observe feedback at

6: while st = s1 7: return a

In the following, for the sake of analysis, we introduce the definition of a clean event under which the Action-Oracle procedure always returns the action a(x) A, as formally stated below. Definition 4 (Clean event of Action-Oracle). We denote Ea as the event in which Algorithm 5 correctly returns the follower s best response a(x) whenever executed.

In the proof of Lemma 3 we show that, thanks to Lemma 2 and the definition of Xϵ, it is possible to bound the number of rounds required to Algorithm 5 to ensure that it always returns the best response a(x) A with high probability.

E.2 Find-Polytopes

Algorithm 6 Find-Polytopes Require: Search space Xϵ X, parameter ζ (0, 1)

1: (C, {Xϵ(a)}a C) Find-Fully-Dimensional-Regions(Xϵ, ζ) 2: for all aj A \ C do 3: Rϵ(aj) Find-Face(C, {Xϵ(a)}a C, aj)

4: Rϵ(ak) Xϵ(ak) ak C 5: return {Rϵ(a)}a A.

Algorithm 6 can be divided in two parts. First, by means of Algorithm 7, it computes the polytopes Rϵ(a) = Xϵ(a) with volume strictly larger than zero. This procedure is based on the algorithm developed by Bacchiocchi et al. [2024a] for Stackelberg games. The main differences are the usage of Action-Oracle to query a generic normalized slice, and some technical details to account for the shape of the search space Xϵ.

In the second part (loop at Line 2 Algorithm 6), it finds, for every polytope Xϵ(a) with null volume, a face Rϵ(a) such that Vϵ(a) Rϵ(a). Let us remark that Rϵ(a) could be the improper face Xϵ(a) itself, and it is empty if Vϵ(a) = . Lemma 3. Given inputs Xϵ X and ζ (0, 1) for Algorithm 6, let L := B + Bϵ + Bˆµ, where B, Bϵ, and Bbµ denote the bit-complexity of numbers µθuθ(ai), ϵ, and bµ, respectively. Then, under event E1 and with at probability at least 1 ζ, Algorithm 6 outputs a collection Rϵ := {Rϵ(a)}a A, where Rϵ(a) is a (possibly improper) face of Xϵ(a) such that Vϵ(a) Xϵ(a), in a number of rounds T2:

d7L + d + n d

Proof. In the following we let L = B + Bϵ + Bbµ.

As a first step, Algorithm 6 invokes the procedure Find-Fully-Dimensional-Regions. Thus, according to Lemma 8, under the event Ea, with probability at least 1 ζ/2, Algorithm 6 computes every polytope Xϵ(a) with volume larger than zero by performing at most:

C1 = e O n2 d7L log(1/ζ) + d + n d

calls to the Action-Oracle procedure. Together with these polytopes, it computes the set C A containing the actions a such that vol(Xϵ(a)) > 0.

Subsequently, Algorithm 6 employs the procedure Find-Face at most n times and, according to Lemma 12, it computes the polytopes Rϵ(aj) for every aj / C. Overall, this computation requires:

C2 = n2 d + n d

calls to the Action-Oracle procedure. Thus, under the event Ea and with probability at least 1 ζ/2, Algorithm 6 correctly computes the polytopes Rϵ(aj) for every action aj A. Furthermore, the number of calls C 0 to the Action-Oracle procedure can be upper bounded as:

C := C1 + C2 e O n2 d7L log(1/ζ) + d + n d

Moreover, by setting ρ = ζ/2C, and thanks to Lemma 2, with a probability of at least 1 ρ, every execution of Action-Oracle requires at most N 0 rounds, where N can be bounded as follows:

N O log(1/ρ)

Consequently, since the number of calls to the Action-Oracle procedure is equal to C, by employing a union bound, the probability that each one of these calls requires N rounds to terminate is greater than or equal to:

To conclude the proof, we employ an union bound over the probability that every execution of Action-Oracle terminates in at most N rounds, and the probability that Algorithm 6 performs C calls to the Action-Oracle procedure. Since the probability that each one of these two events hold is at least 1 ζ/2, with probability at least 1 ζ, Algorithm 6 correctly terminates by using a number of samples of the order:

d7L + d + n d

concluding the proof.

E.3 Find-Fully-Dimensional-Regions

Algorithm 7 Find-Fully-Dimensional-Regions Require: Search space Xϵ X, parameter δ (0, 1)

1: δ ζ/2n2(2(d+n)+n) 2: C 3: while S

aj C U(aj) = Xϵ do

4: xint Sample a point from int Xϵ \ S

5: aj Action-Oracle(xint) 6: U(aj) Xϵ 7: Bx Bit-complexity of xint

8: λ d2 d(Bx+4(B+Bϵ+Bbµ)) 1

9: for all v V (U(aj)) do 10: x λxint + (1 λ)v 11: a Action-Oracle(x) 12: if a = aj then 13: Hjk Find-Hyperplane(aj, U(aj), xint, v, δ) 14: U(aj) U(aj) Hjk 15: else 16: restart the for-loop at Line 9 17: C C {aj} 18: Xϵ(aj) U(aj)

19: return C, {Xϵ(aj)}aj C

At a high level, Algorithm 7 works by keeping track of a set C A of closed actions, meaning that the corresponding polytope Xϵ(a) has been completely identified. First, at Line 4, Algorithm 7 samples at random a normalized slice xint from the interior of one of the polytopes Xϵ(aj) that have not yet been closed and queries it, observing the best-response aj A. Then, it initializes the entire Xϵ as the upper bound U(aj) of the region Xϵ(aj). As a further step, to verify whether the upper bound U(aj) coincides with Xϵ(aj), Algorithm 7 queries at Line 11 one of the vertices of the upper bound U(aj). Since the same vertex may belong to the intersection of multiple regions, the Action-Oracle procedure is called upon an opportune convex combination of xint and the vertex v itself (Line 10). If the vertex does not belong to Xϵ(aj), then a new separating hyperplane can be computed (Line 13) and the upper bound U(aj) is updated accordingly. In this way, the upper bound U(aj) is refined by finding new separating hyperplanes until it coincides with the polytope Xϵ(aj). Finally, such a procedure is iterated for all the receiver s actions aj A such that vol(Xϵ(aj)) > 0, ensuring that all actions corresponding to polytopes with volume larger than zero are closed.

We observe that the estimator bµt is updated during the execution of Algorithm 7 according to the observed states. However, let us remark that the search space Xϵ = x X | P

θ eΘ bµθxθ 2ϵ

does not change during the execution of this procedure. Lemma 8. Given in input Xϵ X and ζ (0, 1), then under the event Ea with probability at least 1 ζ/2 Algorithm 7 computes the collection of polytopes {Xϵ(aj)}aj C with volume larger than zero, and the corresponding set of actions C. Furthermore, it employs at most:

e O n2 d7L log(1/ζ) + d + n d

calls to the Action-Oracle procedure.

Proof. Thanks to Lemma 9 and Lemma 10, with an approach similar to the one proposed in Theorem 4.3 by Bacchiocchi et al. [2024a], we can prove that, under the event Ea, with probability at least 1 δn2(2(d + n)2 + n), Algorithm 7 computes every polytope Xϵ(a) with volume larger than zero by performing at most:

O n2 d7L log(1/δ) + d + n d

calls to the Action-Oracle procedure. Together with these polytopes, it computes the set C A containing the actions a such that vol(Xϵ(a)) > 0.

Furthermore, we observe that ζ = 2δn2(2(d + n) + n), as defined at Line 1 in Algorithm 7. As a result, under the event Ea, with probability at least 1 ζ/2, the number of calls C1 0 performed by Algorithm 7 to the Action-Oracle procedure can be bounded as follows:

C1 e O n2 d7L log(1/ζ) + d + n d

concluding the proof.

E.4 Find-Hyperplane

Algorithm 8 Find-Hyperplane Require: aj, U(aj), xint, v, δ

1: x Sample-Int(U(aj), δ) 2: x1 x 3: if Action-Oracle(x) = aj then 4: x2 v 5: else 6: x2 xint

7: x Binary-Search(aj, x1, x2) 8: α 2 4d(Bx+B+Bbµ+Bϵ)/d 9: Sj ; Sk 10: for i = 1 . . . d do 11: x Sample-Int(Hi X, δ) 12: x+i x + α(x x ) 13: x i x α(x x ) 14: if Action-Oracle(x+i) = aj then 15: Sj Sj {x+i} Sk Sk {x i} 16: else 17: Sk Sk {x+i} Sj Sj {x i}

18: Build Hjk by Binary-Search(aj, x1, x2) for d 1 pairs of linearly-independent points x1 Sj, x2 Sk

The goal of the Find-Hyperplane procedure is to compute a new separating hyperplane Hjk between a given region Xϵ(aj) and some other polytope Xϵ(ak), with aj, ak A. To do so, it receives as input an upper bound U(aj) of some polytope Xϵ(aj), an interior point xint int(U(aj)), a vertex v V(U(aj)) that does not belong to Xϵ(aj), and a parameter δ > 0 as required by the Sample-Int procedure. As a first step, Algorithm 8 samples at random a slice x from the interior of the upper bound U(aj). Subsequently, it performs a binary search on the segment between x and either v or xint, depending on the best response a(x) in x. This binary search returns a point x on some new separating hyperplane Hjk (Line 7). As a further step, the algorithm computes two sets of normalized slices, Sj Xϵ(aj) and Sk Xϵ(ak). Finally, it performs d 1 binary searches between different couples of points, one in Sj and the other in Sk, in order to completely identify the separating hyperplane. Lemma 9. With probability at least 1 (d+n)2δ, under the event Ea Algorithm 8 returns a separating hyperplane Hjk by using O(d7(B + Bϵ + Bbµ) + d4 log(1/δ)) calls to Algorithm 5.

Proof. We observe that, with the same analysis provided in Lemma 4.7 by Bacchiocchi et al. [2024a], we can prove that, under the event Ea, the binary-search procedure described in Algorithm 9 correctly computes a point on a separating hyperplane by calling the Action-Oracle procedure at most O(d(Bx + B)) times.

With the same reasoning applied in Lemma 4.4 and 4.5 by Bacchiocchi et al. [2024a], we can prove that with probability at least 1 (d+n)2δ, under the event Ea, the points x+i are linearly independent and do not belong to Hjk. To conclude the proof, we have to show that every x+i belongs to either Xϵ(aj) or Xϵ(ak). This is because, if the previous condition holds, under the event Ea, Algorithm 8 correctly computes a new separating hyperplane with probability at least 1 (d + n)2δ.

To do that, we show that the constant α defined at Line 8 in Algorithm 8 is such that all the points x+i and x i either belong to Xϵ(aj) or Xϵ(ak), given that x belongs to the hyperplane between these polytopes. With an argument similar to the one proposed in Bacchiocchi et al. [2024a], the distance between xi and any separating hyperplane can be lower bounded by 2 d(Bx+4B), where Bx is the bit-complexity of x .

Similarly, the distance between x and the hyperplane b H = {x Rd | P

θ eΘ bµθxθ 2ϵ} can be lower bounded as follows:

d(x , b H) =

θ eΘ xθbµθ + 2ϵ q P

θ eΘ bµ2 θ 1 d23d(Bx+Bbµ+Bϵ) , (7)

where Bbµ is the bit-complexity of bµ. The inequality follows by observing that the denominator of the fraction above is at most d, while to lower bound the numerator we define the following quantities: X

θ eΘ xθbµθ = α

β and ϵ = γ

where α and β are integers numbers, while γ and ν are natural numbers. Thus, the numerator P

θ eΘ xθbµθ + 2ϵ of the fraction defined in Equation 7 can be lower bounded as follows:

θ eΘ xθbµθ + 2ϵ

2 3d(Bx+Bbµ+Bϵ),

where the last inequality follows from the fact that the bit-complexity of ν is at most Bϵ, while the bit-complexity of β cannot exceed 3d(Bx + Bbµ) as stated by Lemma D.1 of Bacchiocchi et al. [2024a].

Overall, the distance between x and the boundary of the polytope Xϵ(aj) Xϵ(ak) is strictly larger than α := 2 4d(Bx+B+Bbµ+Bϵ) log2(d). Thus, every signaling scheme x+i and x i belongs to Xϵ and either X(aj) or X(ak).

Finally, we observe that the bit-complexity of x is bounded by Bx = O(d3(B+Bϵ+Bbµ)+log(1/δ)), as stated by Lemma 10. Thus, the first binary-search requires O(d(Bx + B)) = O(d4L + d log(1/δ)) calls to Action-Oracle, where L = B + Bϵ + Bbµ.

Furthermore, the bit-complexity of x is bounded by O(d4L + d log(1/δ)). As a result, the bitcomplexity of the slices x+i and x i is bounded by O(d5L+d log(1/δ)), given that the bit-complexity of α is O(d L). It follows that each binary search between two points in Sj and Sk requires O(d6L + d2 log(1/δ)) calls to Action-Oracle.

Overall, Algorithm 8 invokes the Action-Oracle procedure at most O(d7L + d4 log(1/δ)) times, accounting for the d 1 binary searches, concluding the proof.

E.5 Binary-Search

The Binary-Search procedure performs a binary search on the segment connecting two points, x1, x2 Xϵ such that x1 Xϵ(aj) and x2 / Xϵ(aj) for some aj A, in order to find a point x on some separating hyperplane Hjk. At each iteration, the binary search queries the middle point of the segment. Depending on the receiver s best-response in such a point, it keeps one of the two halves of the segment for the subsequent iteration. The binary search ends when the segment is

sufficiently small, so that it contains a single point with a bit-complexity appropriate for a point that lies on both the hyperplane Hjk and the segment connecting x1 and x2. Such a point can be found traversing the Stern-Brocot-Tree. For an efficient implementation, see Forišek [2007]. Overall, Algorithm 9 performs O(d(Bx + B)) calls to the Action-Oracle procedure, and returns a normalized slice x with bit-complexity bounded by O(d(Bx + B)), where Bx is the bit-complexity of the points x1 and x2.

Algorithm 9 Binary-Search Require: aj, x1, x2 of bit-complexity bounded by some Bx > 0

1: λ1 0; λ2 1 2: while |λ2 λ1| 2 6d(5Bx+8B) do 3: λ (λ1 + λ2)/2; x x1 + λ(x2 x1) 4: if Action-Oracle(x ) = aj then 5: λ1 λ 6: else 7: λ2 λ 8: λ Stern-Brocot-Tree(λ1, λ2, 3d(5Bx + 8B)) 9: x λx1 + (1 λ)x2

E.6 Sample-Int

Algorithm 10 Sample-Int Require: P Xϵ : vold 1(P) > 0, and δ

1: V d linearly-independent vertexes of P 2: x 1

3: ρ d329d3L+4d L 1 ; M

4: y Uniform({ 1, M 1

M , . . . , 0, . . . , M 1

M , 1}d 1) 5: for all i 1, . . . , d 1 do 6: xi x i + ρyi 7: xd 1 Pd 1 i=1 xi

The Sample-Int procedure (Algorithm 10) samples at random a normalized slice from the interior of a given polytope P. We observe that each polytope Algorithm 7 is required to sample from is defined as the intersection of Xϵ with some separating half-spaces as the ones defined in Section 3. This procedure provides theoretical guarantees both on the bit-complexity of the point x being sampled and on the probability that such a point belongs to a given hyperplane. Furthermore, it can be easily modified to sample a point from a facet of the simplex X = d (intuitively, this is equivalent to sample a point from d 1). As a first step, Algorithm 10 computes a normalized slice x in the interior of P (Line 2). Subsequently, it samples randomly a vector y from a suitable grid belonging to the (d 1)-dimensional hypercube with edges of length 2. As a further step, it sums each component x i of the normalized slice x with the corresponding component yi of y, scaled by an opportune factor ρ, where the constant ρ is defined to ensure that x belongs to the interior of P.

The theoretical guarantees provided by Algorithm 10 are formalized in the following lemma: Lemma 10. Given a polytope P Xϵ : vold 1(P) > 0 defined by separating or boundary hyperplanes, Algorithm 10 computes x int(P) such that, for every linear space H Rd : P H of dimension at most d 1, the probability that x H is at most δ. Furthermore, the bit-complexity of x is O(d3(B + Bϵ + Bbµ) + log(1/δ)).

Proof. In the following, we prove that x belongs to the interior of the polytope P. To do that, we observe that the point x belongs to the interior of P, while, with the same analysis proposed in Lemma 4.8 by Bacchiocchi et al. [2024a], the distance between x and x can be upper bounded by ρn. To ensure that x belongs to int(P), we have to show that d(x , x) is smaller than the distance between x and any hyperplane defining the boundary of P.

Let us denote with vh the h-vertex of the set V, so that V = {v1, v2, . . . , vd}. Furthermore, we define:

vh θ := γh θ νh for each h [d] and θ Θ. Since x belongs to int(P), then the distance between x and any separating hyperplane Hjk can be lower bounded as follows:

d(x , Hjk) =

θ Θ x θµθ(uθ(aj) uθ(ak)) q P

θ Θ µ2 θ(uθ(aj) uθ(ak))2

P θ Θ Pd h=1 vh θ µθ(uθ(aj) uθ(ak))

θ Θ µ2 θ(uθ(aj) uθ(ak))2

d2 2 4d B 9d3(B+Bϵ+Bbµ).

To prove the last inequality, we observe that the denominator of the fraction above can be upper bounded by d2. To lower bound the nominator, we define:

µθ(uθ(aj) uθ(ak)) := αθ

βθ for each θ Θ. As a result, we have:

h=1 vh θ µθ(uθ(aj) uθ(ak))

αθγh θ βθνh

θ Θ Pd h=1 αθγh θ Q

θ Θ βθ Qd h=1 νh

2 4d B 9d3(B+Bϵ+Bbµ).

The first inequality holds because the numerator of the fraction above can be lower bounded by one. The denominator can be instead upper bounded observing that the bit-complexity of each βθ is at most 4B while the bit-complexity of each νh is at most 9d2(B + Bϵ + Bbµ), as stated by Lemma 11.

In a similar way, we can lower bound the distance between x and b H :={x Rd |P

θ eΘ bµθxθ 2ϵ}.

d(x , b H) =

θ eΘ bµθx θ + 2ϵ q P

θ eΘ Pd h=1 bµθvh θ + 2ϵ

d2 2 Bbµ Bϵ 9d3(B+Bϵ+Bbµ).

To prove the last inequality, we observe that the denominator of the fraction above can be upper bounded by d2, while to lower bound the numerator, we define:

p and ϵ = α

for each θ eΘ. As a result, we have:

h=1 bµθvh θ + 2ϵ

Nθγh θ pνh + 2ϵ

θ eΘ Pd h=1 βNθγh θ Q

h =h νh + 2αp Qd h=1 νh

p Qd h=1 νhβ

2 Bbµ Bϵ 9d3(B+Bϵ+Bbµ)

Thus, the distance between x and any hyperplane H defining the boundary of P can be lower bounded as follows:

d2 2 9d3(B+Bϵ+Bbµ) 4d B Bbµ Bϵ 1

d2 2 10d3(B+Bϵ+Bbµ).

We observe that, given the definition of ρ at Line 3, the distance d(x , x) is strictly smaller than d(x , H), showing that x int(P).

Furthermore, we can prove that the bit-complexity of x is bounded by O(d3(B+Bϵ+Bbµ)+log(1/δ)). To do so, we observe that the denominator of x is equal to d Qd h=1 νh, while the denominator of yi is equal to M =

d/δ . As a result, the denominator of every xi = x i + ρyi, with i [d 1], can be written as follows:

h=1 νh DρM,

where Dρ is the denominator of the rational number ρ. Similarly, the last component xd can be written with the same denominator. As a result, the bit complexity of x [0, 1]d can be upper bounded as follows:

Bx 2 log(D)

h=1 νh DρM))

h=1 29d2(B+Bϵ+Bbµ) !

+ log(d210d3(B+Bϵ+Bbµ)) + log(

= O d3(B + Bϵ + Bbµ) + log 1

Finally, with the same analysis performed in Lemma 4.8 by Bacchiocchi et al. [2024a], we can show that the probability that x belongs to a given hyperplane H is at most δ.

Lemma 11. Each vertex v of a polytope P Xϵ : vold 1(P) > 0, defined by separating or boundary hyperplanes, has bit-complexity at most 9d2(B + Bϵ + Bbµ). Furthermore, with a bitcomplexity of 9d2(B + Bϵ + Bbµ), all the components of the vector v identifying a vertex can be written as fractions with the same denominator.

Proof. We follow a line of reasoning similar to the proof of Lemma D.2 in Bacchiocchi et al. [2024a]. Let v be a vertex of the polytope P. Then such a vertex lies on the hyperplane H ensuring that the sum of its components is equal to one. Furthermore, it also belongs to a subset of d 1 linearly independent hyperplanes. These can be separating hyperplanes:

θ Θ µθxθ(uθ(ai) uθ(aj)) = 0

with ai, aj A, boundary hyperplanes of the form Hi = {x Rd | xi > 0}, or the hyperplane b H := x Rd | P

θ eΘ bµθxθ 2ϵ . Consequently, there exists a matrix A Qd d and a vector b Qd such that Av = b.

Suppose that v is not defined by the hyperplane b H := x Rd | P

θ Θ bµθxθ 2ϵ . Then, each entry of the matrix A is either equal to one or the quantity µθ(uθ(a) uθ(a ) for some θ Θ and a, a A. Thus, its bit-complexity is bounded by B. Similarly, each entry of the vector b is either

equal to one or zero. With a reasoning similar to the one applied in Bacchiocchi et al. [2024a], the bit-complexity of v is at most 9d2(Bµ + Bu).

Suppose instead that v is defined also by the hyperplane b H, corresponding to the last row of the matrix A and the last component of the vector b. This hyperplane can be rewritten as:

bµθ 2ϵ xθ 1

Thus, each element of the last row of A is either zero or the quantity bµθ/2ϵ for some θ eΘ. We observe that bµθ/2ϵ is a rational number with numerator bounded by 2Bbµ+Bϵ and denominator bounded by 2Bbµ+Bϵ+1. Thus, we multiply the last row of A and the last component of b by a constant bounded by 2d(Bbµ+Bϵ+1). The other rows of A and the corresponding components of b are multiplied instead by some constants bounded by 24d B. This way, we obtain an equivalent system A v = b with integer coefficients.

We define A (j) as the matrix obtained by substituting the j-th column of A with b . Then, by Cramer s rule, the value of the j-th component of vj can be computed as follows:

vj = det(A (j))

det(A ) j [d].

We observe that both determinants are integer numbers as the entries of both A and b are all integers, thus by Hadamard s inequality we have:

| det(A )| Y

j [d] a ji 2

j [d] (24d B)2

j [d] (2d(Bbµ+Bϵ+1))2

d22d(Bbµ+Bϵ+1)

i [d 1] d 1 2 (24d B)

d 1 2 2d(Bbµ+Bϵ+1)

= d d 2 (24d(d 1)B)2d(Bbµ+Bϵ+1)

d d 2 (24d2B)2d(Bbµ+Bϵ+1)

With a reasoning similar to Bacchiocchi et al. [2024a], we can show that that the bit-complexity Dv of the vertex v is bounded by:

Dv 9Bd2 + 2(d(Bbµ + Bϵ + 1)) 9d2(B + Bϵ + Bbµ)

Furthermore, this result holds when the denominator of every component vj of the vertex v is written with the same denominator det(A ), concluding the proof.

E.7 Find-Face

Algorithm 11 Find-Face Require: The set of polytopes {Xϵ(ai)}ai C, with volume larger than zero, and action aj / C

1: Compute the minimal H-representation M(Xϵ(ai)) for every polytope Xϵ(ai), ai C 2: H(ai) ai C 3: First True 4: for all x S

ai C V (Xϵ(ai)) do 5: a Action-Oracle(x) 6: if a = aj then 7: if First = False then 8: H(ai) {H M(Xϵ(ai)) | x H, H H(ai)} ai C 9: else 10: H(ai) {H M(Xϵ(ai)) | x H} ai C 11: First False 12: if First = True then 13: Fϵ(aj)

14: Fϵ(aj) Xϵ(ai) T

H H(ai) H for any ai such that Xϵ(ai) T

H H(ai) H = 15: Return Fϵ(aj)

Algorithm 11 takes in input the collection of polytopes {Xϵ(ai)}ai C and another action aj / C such that vol(Xϵ(aj)) = 0, and outputs the H-representation of a (possibly improper) face of Xϵ(aj) that contains all those vertices x V (Xϵ(aj)) where a(x) = aj. As we will show by means of a pair of technical lemmas, the polytope Xϵ(aj) is a face of some other polytope Xϵ(ak), with ak C. Consequently, Algorithm 11 looks for a face of some polytope Xϵ(ak) containing the set of vertices Vϵ(aj).

As a first step, Algorithm 11 computes, for every action ai C, the set of hyperplanes M(Xϵ(ai)) corresponding to the minimal H-representation of Xϵ(ai). This set includes every separating hyperplane Hik found by Algorithm 7, together with the non-redundant boundary hyperplanes that delimit Xϵ. Subsequently, Algorithm 11 iterates over the vertices of the regions with volume larger than zero, which we prove to include all the vertices of the region Xϵ(aj). While doing so, it builds a set of hyperplanes H(ai) M(Xϵ(ai)) for every action ai C. Such a (possibly empty) set includes all and only the hyperplanes in M(Xϵ(ai)) that contain all the vertices where the action aj has been observed, i.e, a(x) = aj.

Finally, at Line 14 Algorithm 11 intersects every region Xϵ(ai) with the corresponding hyperplanes in H(ai), obtaining a (possibly empty) face for every polytope Xϵ(ai), ai C. At least one of these faces is the face the algorithm is looking for, and corresponds to the output of Algorithm 11.

The main result concerning Algorithm 11 is the following: Lemma 12. Given the collection of polytopes {Xϵ(ai)}ai C with volume larger than zero and an another action aj, then, under the event Ea, Algorithm 11 returns a (possibly improper) face Fϵ(aj) of Xϵ(aj) such that Vϵ(aj) Fϵ(aj). Furthermore, the Algorithm requires O(n d+n d ) calls to Algorithm 5.

In order to prove it, we first need to introduce two technical lemmas to characterize the relationship between regions with null volume and those with volume larger than zero. Lemma 13. Let ai, aj A such that vol(Xϵ(ai)) > 0 and vol(Xϵ(aj)) = 0. Then Xϵ(ai) Xϵ(aj) is a (possibly improper) face of Xϵ(ai) and Xϵ(aj).

Proof. In the following we assume that Xϵ(ai) Xϵ(aj) is non-empty, as the empty set is an improper face of every polytope.

We prove that Xϵ(ai) Xϵ(aj) = Xϵ(ai) Hij = Xϵ(aj) Hij. In order to do that, we first show that Xϵ(ai) Hij Xϵ(ai) Xϵ(aj). Consider a normalized slice x Xϵ(ai) Hij. Then we have thatx Xϵ(aj). As this holds for every x Xϵ(ai) Hij, it follows that Xϵ(ai) Hij Xϵ(ai) Xϵ(aj).

Similarly, we show that Xϵ(ai) Xϵ(aj) Xϵ(ai) Hij. Take any normalized slice x Xϵ(ai) Xϵ(aj). Then x Hij as it belongs to both Xϵ(ai) and Xϵ(aj), thus x Xϵ(ai) Hij. This implies that Xϵ(ai) Xϵ(aj) Xϵ(ai) Hij.

Consequently, we have that Xϵ(ai) Xϵ(aj) = Xϵ(ai) Hij. With a similar argument, we can prove that Xϵ(ai) Xϵ(aj) = Xϵ(aj) Hij. As a result, we have that Xϵ(ai) Xϵ(aj) = Xϵ(ai) Hij = Xϵ(aj) Hij.

In order to conclude the proof, we show that Xϵ(ai) Xϵ(aj) = Xϵ(ai) Hij = Xϵ(aj) Hij is a face of both Xϵ(ai) and Xϵ(aj). We observe that Xϵ(ai) Hij, thus the non-empty region Xϵ(ai) Hij is by definition a face of Xϵ(ai). Similarly, Xϵ(aj) Hji, thus the non-empty region Xϵ(aj) Hij is a face of Xϵ(aj) (possibly the improper face Xϵ(aj) itself).

Lemma 14. Let Xϵ(aj) be a polytope such that vol(Xϵ(aj)) = 0. Then Xϵ(aj) is a face of some polytope Xϵ(ai) with vol(Xϵ(ai)) > 0.

Proof. First, we observe that if Xϵ(aj) is empty, then it is the improper face of any region Xϵ(ai) with vol(Xϵ(ai)) > 0. Thus, in the following, we consider Xϵ(aj) to be non-empty.

As a first step, we observe that any normalized slice x Xϵ(aj) belongs also to some region Xϵ(ak), where ak A depends on x, such that vol(Xϵ(ak)) > 0. Suppose, by contradiction, that x int(Xϵ(ai)). Then ai is a best-response in Xϵ(aj) Hij, i.e., Xϵ(aj) Hij Xϵ(ai). One can easily observe that such a region has positive volume, thus contradicting the hypothesis that vol(Xϵ(aj)) = 0.

Now we prove that there exists some Xϵ(ai) with vol(Xϵ(ai)) > 0 such that Xϵ(aj) Xϵ(ai). If Xϵ(aj) is a single normalized slice x, then this trivially holds.

Suppose instead that Xϵ(aj) has dimension at least one. Consider a fixed normalized slice x int(Xϵ(aj)), where the interior is taken relative to the subspace that contains Xϵ(aj) and has minimum dimension. There exists a region Xϵ(ai) with vol(Xϵ(ai)) > 0 such that x Xϵ(ai).

We prove that Xϵ(aj) X(ai). Suppose, by contradiction, that there exists a normalized slice x Xϵ(aj) such that x / Xϵ(ai). It follows that the line segment co( x, x) intersect the separating hyperplane Hij in some normalized slice ex co( x, x) Hij. Furthermore, since ex = x and x int(Xϵ(aj)), then ex int(Xϵ(aj)). However, if the internal point ex belongs to the hyperplane Hij and Xϵ(aj) Hji, then it must be the case that Xϵ(aj) Hij. This implies that Xϵ(aj) Xϵ(ai) and thus x Xϵ(ai), which contradicts the hypothesis..

Given that there exists some Xϵ(ai) with vol(Xϵ(ai)) > 0 such that Xϵ(aj) Xϵ(ai), then Xϵ(aj) = Xϵ(ai) Xϵ(aj) is a face of Xϵ(ai) by Lemma 13.

Lemma 12. Given the collection of polytopes {Xϵ(ai)}ai C with volume larger than zero and an another action aj, then, under the event Ea, Algorithm 11 returns a (possibly improper) face Fϵ(aj) of Xϵ(aj) such that Vϵ(aj) Fϵ(aj). Furthermore, the Algorithm requires O(n d+n d ) calls to Algorithm 5.

Proof. In the following, for the sake of notation, given a polytope Xϵ(a) and the a set of hyperplanes H(a), with an abuse of notation we denote with Xϵ(a) H(a) the intersection of Xϵ(a) with every hyperplane in H(a). Formally:

Xϵ(a) H(a) := Xϵ(a) \

H H(a) H. (8)

Suppose that Xϵ(aj) = . Then, one can easily verify that Algorithm 11 returns . Thus, in the following we assume Xϵ(aj) = .

Let ai be action selected at Line 14 Algorithm 11. We denote with Fϵ(ai) the face returned by Algorithm 11: Fϵ(ai) := Xϵ(ai) H(ai).

We observe that by Lemma 14, there exists an action ak C such that Xϵ(aj) is a face of Xϵ(ak). Consequently, Algorithm 11 queries every vertex x Vϵ(aj).

As a first step we show that Fϵ(ai) actually is a face of Xϵ(ai) and contains every vertex of Vϵ(aj). Being the non-empty intersection of Xϵ(ai) with some hyperplanes in M(Xϵ(ai)), Fϵ(ai) is a face of Xϵ(ai). One can easily prove by induction that H(ai) includes all and only the hyperplanes within M(Xϵ(ai)) containing every vertex in Vϵ(aj). Thus, Vϵ(aj) Fϵ(ai).

Now we show that Fϵ(ai) is not only a (proper) face of Xϵ(ai) containing the set Vϵ(aj), but also a face of Xϵ(aj) (possibly the improper face Xϵ(aj) itself). We consider the set Xϵ(ai) Xϵ(aj), which is a face of both Xϵ(ai) and Xϵ(aj) thanks to Lemma 13. Thus, there exists some set of hyperplanes H (ai) M(Xϵ(ai)) such that:

Xϵ(ai) H (ai) = Xϵ(ai) Xϵ(aj). (9)

Furthermore, we observe that Vϵ(aj) Xϵ(ai) Xϵ(aj). Indeed, we have that Vϵ(aj) Xϵ(ai) since Vϵ(aj) Fϵ(ai) and Fϵ(ai) is a face of Xϵ(ai), and Vϵ(aj) Xϵ(aj) by definition.

We want to prove that H (ai) H(ai), where the set H(ai) contains all and only the hyperplanes within M(Xϵ(ai)) that include the whole set Vϵ(aj). In order to do that, suppose, by contradiction, that there exists a vertex x Vϵ(aj) such that x / H for some H H (ai). Then, x / Xϵ(ai) H (ai) H. However, we proved that Vϵ(aj) Xϵ(ai) Xϵ(aj) and Xϵ(ai) Xϵ(aj) = Xϵ(ai) H (ai) by definition of H (ai), reaching a contradiction.

Consequently:

Fϵ(ai) := Xϵ(ai) H(ai) = Xϵ(ai) \

= Xϵ(ai) H (ai) = Xϵ(ai) Xϵ(aj),

where we applied Equation 8, the fact that H (ai) H(ai), and Equation 9.

Finally, we can show that Fϵ(ai) is a face of Xϵ(aj). We have that Fϵ(ai) is a face of Xϵ(ai) and Fϵ(ai) Xϵ(ai) Xϵ(aj), which is itself a face of Xϵ(ai) by Lemma 13. Thus, Fϵ(ai) is a face of Xϵ(ai) Xϵ(aj). Furthermore, Lemma 13 states that Xϵ(ai) Xϵ(aj) is also a face of Xϵ(aj). This implies that Fϵ(ai) is a face of a face of Xϵ(aj), and thus a face of Xϵ(aj) itself.

In order to conclude the proof, we have to prove that at Line 14 Algorithm 11 can actually find an action ai such that Fϵ(ai) = Xϵ(ai) H(ai) is non-empty. Let ak C be such that Xϵ(aj) is a face of Xϵ(ak), which exists thanks to Lemma 14. Let x be any vertex in the set Vϵ(aj) and define H (aj) as: H (ak) := {H RH(Xϵ(ak)) | x H}. Then Xϵ(ak) H (ak) = {x}. Consequently, H(ak) H (ak), and thus:

Xϵ(ak) H(ak) = Xϵ(ak) \

= Xϵ(ak) H (ak) = {x} = .

As a result, there is always an action ak C such that Xϵ(ak) H(ak) = .

Finally, we observe that Algorithm 11 executes Algorithm 5 once for every vertex in the set S

ai C V (Xϵ(ai)), which has size O(n d+n d ).

F Omitted proofs from Section 5

Theorem 2. For any sender s algorithm, there exists a Bayesian persuasion instance in which n = d + 2 and the regret RT suffered by the algorithm is at least 2Ω(d), or, equivalently, 2Ω(n).

Proof. In the following, for the sake of the presentation, we consider a set of instances characterized by an even number d N+ of states of nature and n = d + 2 receiver s actions. All the instances share the same uniform prior distribution and the same sender s utility, given by us θ(ad+1) = 1 for all θ Θ, and us θ(a) = 0 for all θ Θ and a A \ {ad+1}. Each instance is parametrized by a vector p belonging to a set P defined as follows:

p {0, 1}d |

Furthermore, we assume that the receiver s utility in each instance Ip is given by:

uθi(aj) = δij i, j [d], uθi(ad+1) = 2

dpi i [d], uθi(ad+2) = 2

We show that ξ θi := 2

dpi for each i [d] is the only posterior inducing the receiver s action ad+1 A in the instance Ip, since the receiver breaks ties in favor of the sender. To prove that, we observe that the action ad+1 is preferred to the action ad+2 only in those posteriors that satisfy the condition ξθi = 0 for each i [d] with pi = 0. Furthermore, to incentivize the action ad+1 over the set of actions ai with i [d], the following condition must hold: X

i [d]:pi>0 ξθiuθi(an+1) = 2

i [d]:pi>0 ξθi = 2

d max i [d]:pi>0 ξθi.

We notice that the last step holds only if ξθi 2/d for each i [d] such that pi > 0. Consequently, since the number of ξθi > 0 is equal to d/2, it holds ξθi = 2/d for each i [d] such that pi > 0. Thus, the only posterior inducing action ad+1 is equal to ξ θi := 2

We also notice that, given p P, the optimal signaling scheme γ is defined as γ(ξ ) = 1/2 and γ(ξ ) = 1/2, with ξ θ = µθ 1

2ξ θ for each θ Θ. With a simple calculation, we can show that the expected sender s utility in γ is equal to 1/2.

We set the time horizon T = |P|/4 to show that any algorithm suffers regret of at least 2Ω(d). This is sufficient to prove the statement. We start by making the following simplifications about the behavior of the algorithm. First, we observe that if the algorithm can choose any posterior (instead of a signaling scheme), then this will only increase the performance of the algorithm. Consequently, we assume that the algorithm chooses a posterior ξt at each round t [T].

Thus, we can apply Yao s minimax principle to show that any deterministic algorithm fails against a distribution over instances. In the following, we consider a uniform distribution over instances Ip with p P. Furthermore, we observe that the feedback of any algorithm is actually binary. Thus, it is easy to see that an optimal algorithm works as follows: (i) it ignores the feedback whenever the action is not ad+1, and (ii) it starts to play the optimal posterior when the action is ad+1 since it found an optimal posterior.

This observation is useful for showing that any deterministic algorithm does not find a posterior that induces action ad+1 with a probability of at least 1 |P|/(4|P|) = 3/4 (since it can choose only |P|/4 posteriors among the |P| possible optimal posteriors). Hence, by Yao s minimax principle, for any (randomized) algorithm there exists an instance such that the regret suffered in the T rounds is at least:

since x x 1 x/2, for each x 2. Finally, we notice that |P| = d d/2 = 2Ω(d), which concludes the proof.

Theorem 3. For any sender s algorithm, there exists a Bayesian persuasion instance in which the regret RT suffered by the algorithm is at least Ω(

Proof. To prove the theorem, we introduce two instances characterized by two states of nature and four receiver s actions. In both the instances the sender s utility is given by us θ(a1) = us θ(a2) = 0 for

all θ Θ, while us θ1(a3) = 1, us θ2(a3) = 0 and us θ2(a4) = 0 us θ2(a4) = 1. The receiver s utilities and the prior distributions in the two instances are:

2 + ϵ, µ1 θ2 = 1

2 ϵ u1 θ1(a1) = 1 (2+4ϵ), u1 θ2(a1) = 1 (10 20ϵ) u1 θ1(a2) = 1 (10+20ϵ), u1 θ2(a2) = 1 (2 4ϵ) u1 θ1(a3) = 3 10, u1 θ2(a3) = 3 10 u1 θ1(a4) = 0, u1 θ2(a4) = 1 (2 4ϵ)

2 ϵ, µ2 θ2 = 1

2 + ϵ u2 θ1(a1) = 1 (2 4ϵ), u2 θ2(a1) = 1 (10+20ϵ) u2 θ1(a2) = 1 (10 20ϵ), u2 θ2(a2) = 1 (2+4ϵ) u2 θ1(a3) = 3 10, u2 θ2(a3) = 3 10 u2 θ1(a4) = 0, u2 θ2(a4) = 1 (2+4ϵ) with ϵ (0, 1/4). With a simple calculation, we can show that, in both the two instances, for any signaling scheme ϕ employing a generic set of signals S, the sender receives the following feedback:

1. s S such that ϕθ1(s) > ϕθ2(s), then aϕ(s) = a1.

2. s S such that 0 < ϕθ1(s) < ϕθ2(s), then aϕ(s) = a2.

3. s S such that ϕθ1(s) = ϕθ2(s), then aϕ(s) = a3.

4. s S such that 0 = ϕθ1(s), then aϕ(s) = a4.

As a result, for any signaling scheme the sender may commit to, the resulting feedback in each signal of the signaling scheme is the same. Thus, we assume without loss of generality, that the sender only commits to signaling schemes that maximizes the probability of inducing actions a3 or a4. This is because, the sender does not gain any information by committing to one signaling scheme over another, while the signaling schemes that induce these two actions are the only ones that provide the sender with strictly positive expected utility.

Furthermore, thanks to what we have observed before, we can restrict our attention to direct signaling schemes, i.e., those in which the set of signals coincides with the set of actions. Thus, at each round, we assume that the sender commits to a signaling scheme ϕt of the form:

ϕt θ1(a3) = ϕt θ2(a3) := ϕt 1, ϕt θ1(a4) = 0, ϕt θ2(a4) = 1 ϕt 1, ϕt θ1(a1) = 1 ϕt 1, ϕt θ2(a1) = 0, (10)

with ϕt 1 [0, 1]. We also notice that, in each round, the optimal signaling scheme in the first instance is the one that induces action a3, meaning ϕt 1 = 1 for each t [T]. While the optimal signaling scheme in the second instance at each round is the one that reveals the state of nature to the receiver, meaning ϕt 1 = 0 for each t [T]. In such a way, the learning task undertaken by the sender reduces to select a value of ϕt 1 [0, 1] for each t [T] controlling the probability of inducing action a3 over action a4.

In the following, we define P1 (P2) as the probability distribution generated by the execution of a given algorithm in the first (second) instance and we let E1 (E2) be the expectation induced by such a distribution.

The cumulative regret in the first instance can be written as follows:

R1 T = E1 " T X

2 + ϵ 1 ϕt 1 1

Similarly, in the second instance, the cumulative regret is given by:

Furthermore, it is easy to check that:

R1 T P1 T X

t=1 ϕ1 t T/2

ϵT and R2 T P2 T X

t=1 ϕ1 t T/2

By employing the relative entropy identities divergence decomposition we also have that:

KL P1, P2 = T KL µ1, µ2

3 Tϵ2 22Tϵ2,

where we employed the fact that for two Bernoulli distribution it holds

KL(Be(p), Be(q)) (p q)2

Then, by employing the Bretagnolle Huber inequality we have that,

R1 T + R2 T ϵT

t=1 ϕ1 t T/2

t=1 ϕ1 t T/2

2ϵT exp KL P1, P2

2ϵT exp 22Tϵ2 .

By taking ϵ = q

1 22T we get:

R1 T + R2 T C1

T. with C1 = e 1/(2

22) Thus, we have:

concluding the proof.

G Details and omitted proofs from Section 6

G.1 Compute-Threshold procedure

The Compute-Threshold procedure takes as input a real parameter ϵ1 > 0. Then, it iteratively halves the value of a different parameter ϵ, initially set to one, until it is smaller than or equal to ϵ1. In this way, Algorithm 12 computes a parameter ϵ [ϵ1/2, ϵ1] in O(log(1/ϵ1)) rounds with bit complexity Bϵ = O(log(1/ϵ1)). This technical component is necessary to ensure that the bitcomplexity of the parameter ϵ is not too large while guaranteeing that the solution returned by Algorithm 4 is still γ-optimal with probability at least 1 η.

Algorithm 12 Compute-Threshold Require: ϵ1 (0, 1)

1: ϵ 1 2: while ϵ ϵ1 do 3: ϵ ϵ/2 4: Return ϵ

G.2 Omitted proofs from Section 6

Lemma 5. Given T1 := 1

2ϵ2 log (2d/δ) and ϵ (0, 1), Algorithm 2 employs T1 rounds and outputs Xϵ X such that, with probability at least 1 δ: (i) P

θ Θ µθxθ ϵ for every slice x Xϵ, (ii) P

θ Θ µθxθ 6ϵ for every slice x X \ Xϵ, and (iii) |bµθ µθ| ϵ for every θ Θ.

Proof. Thanks to the definition of T1 := 1/2ϵ2 log (2d/δ) in Algorithm 4 and employing both a union bound and the Hoeffding bound we have:

P (|bµθ µθ| ϵ) 1 δ, θ Θ.

Consequently, if |bµθ µθ| ϵ for each θ Θ, then for each x Xϵ, the probability of inducing the slice x can be lower bounded as follows:

θ Θ bµθxθ ϵ X

θ Θ (bµθ ϵ)xθ X

where the above inequalities hold because each x Xϵ satisfies the constraint P

θ Θ bµθxθ 2ϵ. Furthermore, if |bµθ µθ| ϵ for each θ Θ, then for each x Xϵ the following holds: X

θ Θ µθxθ ϵ + X

θ Θ (µθ ϵ)xθ ϵ + X

θ Θ bµθxθ 3ϵ, (11)

since, if x Xϵ, it holds P

θ Θ bµθxθ 2ϵ, and, X

θ Θ (µθ ϵ)xθ + ϵ X

θ Θ bµθxθ + ϵ 3ϵ. (12)

Thus, by combining Inequality (11) and Inequality (12), we have: X

θ Θ µθxθ 6ϵ,

when x Xϵ, concluding the proof.

Theorem 4. Given γ (0, 1) and η (0, 1), with probability at least 1 η, Algorithm 4 outputs a γ-optimal signaling scheme in a number of rounds T such that:

d8B + d d + n d

Proof. Thanks to Lemma 5, with probability at least 1 δ = 1 η/2 Algorithm 4 correctly completes Phase 1 in T1 = O(1/ϵ2 log(1/η) log(d)) rounds. Thus, with probability at least 1 η/2, both the event E1 and the inequalities |bµθ µθ| ϵ for each θ Θ hold.

Consequently, under the event E1, with probability at least 1 ζ = 1 η/2, Algorithm 4 correctly partitions the search space Xϵ in at most:

d7L + d + n d

rounds, as stated by Lemma 3. Furthermore, we notice that L = B + Bϵ + Bbµ, with:

Bbµ = O(log(T1)) = O (log(1/ϵ) + log(1/η) + log(d)) .

As a result, with probability at least 1 η, Algorithm 4 correctly terminates in a number of rounds N which can be upper bounded as follows:

log(d) + n2

d7(B + Bϵ) + d + n d

Furthermore, we observe that if |bµθ µθ| ϵ for each θ Θ, then the following holds:

θ Θ |bµθ µθ| X

θ Θ ϵ = ϵd.

Consequently, thanks to the result provided by Lemma 4, with probability at least 1 η, Algorithm 4 computes a 12nϵd-optimal solution. Thus, by setting ϵ1 := γ/12nd and ϵ ϵ1, with probability at least 1 η Algorithm 4 computes a γ-optimal solution in a number of rounds N bounded by:

d8(B + Bϵ) + d d + n d

d8(B + Bϵ) + d d + n d

d8B + d d + n d

where the last equality holds because the bit-complexity of ϵ is Bϵ = O(log(nd/γ)), concluding the proof.

Theorem 5. There exist two absolute constants κ, λ > 0 such that no algorithm is guaranteed to return a κ-optimal signaling scheme with probability of at least 1 λ by employing less than 2Ω(n) and 2Ω(d) rounds, even when the prior distribution µ is known to the sender.

Proof. In the proof of Theorem 2 we showed that, with probability 3/4, in N = |P|/4 rounds any algorithm does not correctly identify the posterior inducing action ad+1. This is because, any deterministic algorithm can identify the optimal posterior only in |P|/4 instances, as observed in the proof of Theorem 2.

As a result, in the remaining |P| N instances, any deterministic algorithm will receive the same feedback and thus will always output the same posterior after N rounds, which will result in the optimal one in only a single instance.

Thus, thanks to the Yao s minimax principle, there is no algorithm that is guaranteed to return an optimal solution with probability at least:

Finally, we observe that OPT = 1/2, while any algorithm that does not induce the posterior ξ provides an expected utility equal to zero. As a result, for each κ < 1/2 there is no algorithm that is guaranteed to return a solution which is κ-optimal in |P|/4 = 2Ω(d) rounds with probability at least 3/8.

Theorem 6. Given γ (0, 1/8) and η (0, 1), no algorithm is guaranteed to return a γ-optimal signaling scheme with probability at least 1 η by employing less than Ω 1

γ2 log(1/η) rounds.

Proof. To prove the theorem, we consider the same instance and the same definitions introduced in the proof of Theorem 3. In this case, we let P1 (P2) be the probability distribution generated by the execution of a given algorithm in the first (second) instance for N = log(1/4η)/22ϵ2 rounds. Furthermore, we introduce the event E, under which the signaling scheme returned at the round N, according to the definition presented in Equation 10, is such that ϕN 1/2. We notice that, if such signaling scheme is such that ϕN < 1/2, then the sender s expected utility in the first instance is smaller or equal to 1/2, thus being ϵ/2-optimal. At the same time, if ϕN 1/2 in the second instance, then the solution returned by the algorithm is not ϵ/2-optimal.

Thus, by employing the Bretagnolle Huber inequality we have:

P1 (E) + P2 EC 1

2 exp KL P1, P2 1

2 exp 22Nϵ2 ,

since KL P1, P2 22Nϵ2, as observed in the proof of Theorem 6. Finally, by employing the definition of N, we have:

P1 (E) η P2 EC η.

As a result, by setting 2γ = ϵ, the statement of the lemma holds.

H Sample complexity with known prior

In this section, we discuss the Bayesian persuasion PAC-learning problem when the prior distribution µ is known to the sender. To tackle the problem, we propose Algorithm 13. The main difference with respect to Algorithm 4 is that, in this case, we do not need to employ the Build-Search-Space procedure, as the prior is already known to the sender. This allows us to compute a γ-optimal signaling scheme in only O(1/γ) rounds, instead of O(1/γ2) rounds as in the case with an unknown prior.

Algorithm 13 PAC-Persuasion-w/o-Clue-Known Require: η (0, 1), γ (0, 1), µ Θ

1: ϵ1 γ/10nd 2: ϵ Compute-Epsilon(ϵ1), bµ µ 3: eΘ {θ Θ | bµθ > 2ϵ} 4: Xϵ x X | P

θ eΘ bµθxθ 2ϵ

5: Rϵ Find-Polytopes(Xϵ, η) 6: ϕ Compute-Signaling(Rϵ, Xϵ, µ) 7: return ϕ

In this case, the following theorem holds. Theorem 7. With probability at least 1 η and in Algorithm 2 computes a γ-optimal signaling scheme in e O n3/γ log2 (1/η) d8B + d d+n d rounds.

Proof. Since bµ = µ, the clean event E1 holds with probability one. Consequently, thanks to Lemma 3, with probability at least 1 η, Algorithm 13 correctly partitions the search space Xϵ in at most:

ϵ log2 (1/η) d7L + d + n d

rounds, with L := B +Bϵ +Bbµ. According to Algorithm 13, we have ϵ ϵ1 := γ/(10nd). As a result, L = O (B + log(nd) + log(1/γ)), since Bbµ B and Bϵ = O(log(1/ϵ1)) = O(log(nd) + log(1/γ)).

Furthermore, under the event E2, Algorithm 13 computes a 10ϵnd-optimal solution, as guaranteed by Lemma 4, with bµ = µ.

Thus, with probability at least 1 η, Algorithm 13 computes a γ-optimal solution in:

γ log2 (1/η) d8B + d d + n d

rounds, concluding the proof.

We notice that, differently from the case with an unknown prior, it is possible to achieve a O (log(1/η)/γ) upper bound with respect to the input parameters γ, η > 0. Finally, we show that such a dependence is tight, as shown in the following theorem. Theorem 8. Given γ, η > 0 no algorithm is guaranteed to return an γ-optimal signaling scheme with probability of at least 1 η employing Ω(log(1/η)/γ) rounds, even when the prior distribution is known to the sender.

Proof. We consider two instances characterized by two states of nature and three receiver s actions. The two instances share the same prior distribution, defined as µθ1 = 4γ and µθ2 = 1 4γ. In both the instances the sender s utility is given by us θ(a1) = 0, us θ(a2) = 1/2 and us θ(a3) = 1 for all θ Θ. Furthermore, we assume that the receiver s utility in the two instances are given by:

uθ1(a1) = 1, uθ2(a1) = 1/2 uθ1(a2) = 1/2, uθ2(a2) = 1 uθ1(a3) = 1, uθ2(a3) = 0

uθ1(a1) = 1, uθ2(a1) = 1/2 uθ1(a2) = 1/2, uθ2(a2) = 1 uθ1(a3) = 1/2, uθ2(a3) = 0

We observe that the only case in which the sender receives different feedback in the two instances is when they induce the posterior distribution ξ1 := (1, 0). This is because, when the sender induces ξ1 in the first instance, the receiver plays the action a3 A, breaking ties in favor of the sender, while in the second instance, the receiver plays the action a1 A. Such a posterior can be induced, in both the two instances, with a probability of at most γ to be consistent with the prior.

We also observe that in the first instance the optimal sender s signaling scheme γ is such that γ(ξ1) = 4γ and γ(ξ2) = 1 4γ, where we let ξ2 := (0, 1). Furthermore, the sender s expected utility in γ is equal to (1 + 4γ)/2. In the second instance, the optimal sender s signaling scheme γ

is such that γ(ξ2) = 1 8γ and γ(ξ3) = 8γ, with ξ3 := (1/2, 1/2). It is easy to verify that such a signaling scheme achieves an expected utility of 1/2.

In the following, we let P1 and P2 be the probability measures induced by the interconnection of a given algorithm executed in the first and in the second instances, respectively. Furthermore, we introduce the event EN, under which, during the first N rounds, the sender never observes the action a3 A. It is easy to verify that such an event holds with a probability of at least P1(EN) (1 4γ)N in the first instance. This is because, at each round, the action a3 can be observed with a probability of at most 4γ. In contrast, since in the second instance the receiver never plays the action a3, it holds P2(EN) = 1 (1 4γ)N.

Then, by letting γN 1 (γN 2 ) be the signaling scheme returned after N rounds in the first (second) instance, we have:

P1 γN 1 (ξ1) 2γ + P2 γN 2 (ξ1) 2γ P1 γN 1 (ξ1) 2γ, EN + P2 γN 2 (ξ1) 2γ, EN

P1 γN 1 (ξ1) 2γ |EN P1 (EN) + P2 γN 2 (ξ1) 2γ |EN P2 (EN)

P1 γN 1 (ξ1) 2γ |EN + P2 γN 2 (ξ1) 2γ |EN P1 (EN)

(1 4γ)N 2η. (13)

The above result holds observing that, under the event EN, the behaviour of any algorithm working in the first instance coincides with the behaviour of the same algorithm working in the second one, as they receive the same feedback. Furthermore, Inequality 13 holds when N is such that:

N log(1/2η)

10γ log(2η) log(1 4γ),

Finally, we observe that if γN 1 (ξ1) 2γ, then the sender s expected utility in the first instance is of most 1/2 + γ. This is because, for any signaling scheme γN 1 , we have:

us(γN 1 ) γN 1 (ξ1)us(ξ1) + 1

2 1 γN 1 (ξ1) = 1

Thus, if γN 1 (ξ1) 2γ the the signaling scheme γN 1 is at most γ-optimal.

Equivalently, if the sender s final signaling scheme γN 2 in the second instance is such that γN 2 (ξ1) 2γ, then the sender s utility in the second instance is of at most 1/2 γ. Thus, if γN 2 (ξ1) 2γ the the signaling scheme γN 2 is at most γ-optimal. Then, we either have:

P1 γN 1 (ξ1) 2γ η or P2 γN 2 (ξ1) 2γ η,

showing that if the number of rounds N log(1/2η)/(10γ), there exists an instance such that no algorithm is guaranteed to return a γ-optimal signaling scheme with probability greater than or equal to 1 η.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: The abstract and the introduction state all the main contributions of this work.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: All the assumptions are stated in the paper.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: The assumptions needed are reported in the statements of the theorems and lemmas, while all the proofs are reported in the appendices. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example 4.1 If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. 4.2 If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. 4.3 If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). 4.4 We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer:[NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: The paper conforms with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: There is no societal impact of the work performed, since the work is mainly theoretical.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: The paper does not release new assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.