# online_bayesian_persuasion__859ac260.pdf Online Bayesian Persuasion Matteo Castiglioni Politecnico di Milano matteo.castiglioni@polimi.it Andrea Celli Facebook Core Data Science andreacelli@fb.com Alberto Marchesi Politecnico di Milano alberto.marchesi@polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it In Bayesian persuasion, an informed sender has to design a signaling scheme that discloses the right amount of information so as to influence the behavior of a self-interested receiver. This kind of strategic interaction is ubiquitous in real-world economic scenarios. However, the seminal model by Kamenica and Gentzkow makes some stringent assumptions that limit its applicability in practice. One of the most limiting assumptions is, arguably, that the sender is required to know the receiver s utility function to compute an optimal signaling scheme. We relax this assumption through an online learning framework in which the sender repeatedly faces a receiver whose type is unknown and chosen adversarially at each round from a finite set of possible types. We are interested in no-regret algorithms prescribing a signaling scheme at each round of the repeated interaction with performances close to that of a best-in-hindsight signaling scheme. First, we prove a hardness result on the per-round running time required to achieve no-α-regret for any α < 1. Then, we provide algorithms for the full and partial feedback models with regret bounds sublinear in the number of rounds and polynomial in the size of the instance. 1 Introduction Bayesian persuasion was first introduced by Kamenica and Gentzkow [23] as the problem faced by an informed sender trying to influence the behavior of a self-interested receiver via the strategic provision of payoff-relevant information. In Bayesian persuasion, the agents beliefs are influenced only by controlling who gets to know what . This sweet talk is ubiquitous among all sorts of economic activities, and it was famously attributed to a quarter of the GDP in the United States by Mc Closkey and Klamer [28]. 2 The computational study of Bayesian persuasion has been largely driven by its application in domains such as auctions and online advertisement [7, 19, 11], voting [1, 14, 16], traffic routing [9, 32], recommendation systems [26], security [30, 34], and product marketing [6, 13]. In the model by Kamenica and Gentzkow [23], the sender s and receiver s payoffs are determined by the receiver s action and a set of parameters collectively termed the state of nature. Unlike the receiver, the sender observes the realized state of nature drawn from a shared prior distribution. The sender uses this private information to determine a signal for the receiver according to a publicly known signaling scheme, i.e., a mapping from states of nature to probability distributions over signals. In this paper, we focus on arguably one of the most severe limitations of the basic model: the sender must know exactly the receiver s utility function to compute an optimal signaling scheme. The work was conducted while the author was a postdoc at Politecnico di Milano. 2A more recent estimate by Antioch and others [2] places this figure at 30%. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Our model and results We deal with uncertainty about the receiver s type by framing the Bayesian persuasion problem in an online learning framework. We study a repeated Bayesian persuasion problem where, at each round, the receiver s type is adversarially chosen from a finite set of types. Our goal is the design of an online algorithm that recommends a signaling scheme at each round, guaranteeing an expected utility for the sender close to that of the best-in-hindsight signaling scheme. We study this problem under two models of feedback: in the full information model, the sender selects a signaling scheme and later observes the type of the best-responding receiver; in the partial information model, the sender only observes the actions taken by the receiver. First, in Section 4, we provide a negative result that rules out, even in the full information setting, the possibility of designing a no-regret algorithm with polynomial per-round running time. Furthermore, the same hardness result holds when adopting the notion of no-α-regret (in the additive sense) for any α < 1. Then, we focus on the problem of designing no-regret algorithms by relaxing the running time constraint. We show that it is possible to achieve a regret polynomial in the size of the problem instance and sublinear in the number of rounds T under both full (with O(T 1/2)) and partial feedback (with O(T 1/5)). We present these results in Sections 5 and 6, respectively. Related works The closest line of research to ours is the one studying online learning problems in Stackelberg games. In these games, a leader commits to a probability distribution over a set of actions, and a follower plays an action maximizing her/his utility given the leader s commitment [33]. In this setting, Letchford et al. [25] and Blum et al. [10] study the problem of computing the best leader s strategy against an unknown follower using a polynomial number of best-response queries. Marecki et al. [27] study the problem with a single follower with type drawn from a Bayesian prior. Balcan et al. [8] study how to minimize the leader s regret in an online setting in which the follower s type is unknown and chosen adversarially from a finite set. Although the problem is conceptually similar to ours, the Bayesian persuasion framework presents a number of additional challenges: the solution to a Stackelberg game consists of a point in a finite-dimensional simplex, while the solution to a Bayesian persuasion problem is a probability distribution with potentially infinite support size. This probability distribution is subject to additional consistency constraints, which (under partial feedback) rule out the possibility of exploiting unbiased estimators of the sender s expected utility. Finally, it is worth mentioning that known online learning algorithms (for either the full or partial feedback setting) do not provide any guarantee in the case of Bayesian persuasion. Indeed, the regret bounds of those algorithms depend linearly or sublinearly in the number of actions, but the action space in Bayesian persuasion is infinite. A large body of previous works in other fields resolves the issue of dealing with an infinite action space by requiring specific assumptions (e.g., linear or convex utility function [4, 12, 22, 35]). However, in the online Bayesian persuasion setting, these assumptions do not hold as the sender s utility depends on the receiver s best response, which yields a function that is not linear nor convex (or even continuous in the space of signaling schemes). 2 Preliminaries The receiver has a finite set of m actions A := {ai}m i=1 and a set of n possible types K := {ki}n i=1. For each type k K, the receiver s payoff function is uk : A Θ [0, 1], where Θ := {θi}d i=1 is a finite set of d states of nature. For notational convenience, we denote by uk θ(a) [0, 1] the utility observed by the receiver of type k K when the realized state of nature is θ Θ and she/he plays action a A. The sender s utility when the state of nature is θ Θ is described by the function us θ : A [0, 1]. As it is customary in Bayesian persuasion, we assume that the state of nature is drawn from a common prior distribution µ int( Θ), which is explicitly known to both the sender and the receiver. 3 Moreover, the sender can commit to a signaling scheme φ, which is a randomized mapping from states of nature to signals for the receiver. Formally φ : Θ S, where S is a finite set of signals. We denote by φθ the probability distribution employed by the sender when the state of nature is θ Θ, with φθ(s) being the probability of sending signal s S. A one-shot interaction between the sender and the receiver goes on as follows: (i) the sender commits to a publicly known signaling scheme φ and the receiver observes the commitment; (ii) the sender 3int(X) is the interior of set X and X is the set of all probability distributions over X. Vectors are denoted by bold symbols. For any vector x, the value of its i-th component is denoted by xi. observes the realized state of nature θ µ; (iii) the sender draws a signal s φθ and communicates it to the receiver; (iv) the receiver observes s and rationally updates her/his prior beliefs over Θ according to the Bayes rule; (v) the receiver selects an action maximizing her/his expected utility. Let Ξ := Θ be the set of receiver s posterior beliefs over the states of nature. In step (iv), after observing s S, the receiver performs a Bayesian update and infers a posterior belief ξ Ξ over the states of nature such that the component of ξ corresponding to state of nature θ Θ is: ξθ := µθ φθ(s) P θ Θ µθ φθ (s). (1) After computing ξ, the receiver solves a decision problem to find an action maximizing her/his expected utility given the current posterior. Letting a A be the receiver s choice, the receiver observes payoff uk θ(a), where k K is the receiver s type, while the sender observes payoff us θ(a). 2.1 Working in the space of posterior distributions It is oftentimes useful to represent signaling schemes as convex combinations of posterior beliefs they can induce. First, we describe such interpretation (see [24] for further details). Then, we define the receiver s best response given an arbitrary posterior belief. Representing signaling schemes Given a signaling scheme φ, each signal realization s S leads to a posterior belief ξs Ξ, whose components are defined as in Equation (1). Accordingly, each signaling scheme leads to a distribution over posterior beliefs. We denote a distribution over posteriors by w Ξ. We say that a signaling scheme φ : Θ S induces w Ξ if, for every ξ Ξ, the component of w corresponding to ξ is defined as follows: θ Θ µθ φθ(s). (2) Intuitively, if φ induces w, then wξ represents the probability that φ induces the posterior ξ Ξ. We let supp(w) := {ξ Ξ | wξ > 0} be the set of posteriors induced with strictly positive probability. We say that a distribution over posteriors w Ξ is consistent (i.e., intuitively, there exists a valid signaling scheme φ inducing w) if the following hods: X ξ supp(w) wξ ξθ = µθ, for all θ Θ. (3) We let W Ξ be the set of distributions over posteriors that are consistent according to Equation (3). In the remainder of the paper, we equivalently employ φ or w to denote an arbitrary signaling scheme. Receiver s best-response set After observing a signal s S that induces a posterior ξ Ξ, the receiver best responds by choosing an action that maximizes her/his expected utility (step (v)). The set of actions maximizing the receiver s expected utility given posterior ξ is defined as follows: Definition 1 (BR-set). Given posterior ξ Ξ and type k K, the best-response set (BR-set) is: Bk ξ := arg max a A θ Θ ξθ uk θ(a). We denote by bk ξ the action belonging to the BR-set Bk ξ played by the receiver. When the receiver is indifferent among multiple actions for a given posterior ξ, we assume that the receiver breaks ties in favor of the sender, i.e., she/he chooses an action bk ξ arg maxa Bk ξ P θ ξθ us θ(a). 4 We conclude the section by introducing some additional notation. We denote by us(ξ, k) := P θ ξθ us θ(bk ξ) the sender s expected utility when she/he induces a posterior ξ Ξ and the receiver is of type k K. Moreover, we use us(φ, k) to denote the sender s expected utility achieved with the signaling scheme φ. Formally, us(φ, k) := P ξ supp(w) wξ us(ξ, k), where w Ξ is the distribution over posteriors induced by φ. Analogously, we write us(w, k). Finally, letting OPT be the sender s optimal expected utility, we say that a signaling scheme is α-optimal (in the additive sense) if it provides the sender with a utility at least as large as OPT α. 4This assumption is customary in settings involving commitments, such as Stackelberg games [17, 18, 29]. State G State I (µG = .3) (µI = .7) A A 0 0 0 1 C 1 1 1 0 Realized state State G State I S s1 0 4/7 s2 1 3/7 State of nature State G State I w supp(w ) ξ1 0 1 2/5 ξ2 1/2 1/2 3/5 Figure 1: Left: The prosecutor/judge game. Rows represent the judge s actions. For each possible state of nature {G, I}, the first column is the prosecutor s payoff while the second is the judge s payoff. Center: The optimal signaling scheme φ . Each column describes the probability with which the two signals are drawn given the realized state of nature. Right: Representation of φ as the convex combination of posteriors w . 2.2 Example We illustrate the key notion of signaling scheme in a simple example with a single receiver type (i.e., |K| = 1) inspired by Kamenica and Gentzkow [23]: a prosecutor (the sender) is trying to convince a rational judge (the receiver) that a defendant is guilty. The judge has two available actions: to acquit or to convict the defendant (denoted by A and C, respectively). There are two possible states of nature: the defendant is either guilty (denoted by G) or innocent (denoted by I). The prosecutor and the judge share a prior belief µG = .3. Moreover, the prosecutor gets utility 1 if the judge convicts the defendant and 0 otherwise, regardless of the state of nature. The prosecutor gets to observe the realized state of nature (i.e., whether the defendant is guilty or innocent). The she/he can exploit this information to select a signal from set S = {s1, s2} and send it to the judge. The judge has a unique type and she/he gets utility 1 for choosing the just action (convict when guilty and acquit when innocent) and utility 0 for choosing the unjust action (see Figure 1-Left). Figure 1-Center depicts a sender-optimal signaling scheme φ obtained via the following LP: arg max φ 0 us(φ, k) s.t. X s S φθ(s) = 1 θ Θ, where k is the unique type of the judge. When the sender acts according to φ , signal s1 (resp., s2) originates posterior ξ1 (resp., ξ2; see Figure 1-Right). Applying Equation (3) yields the equivalent representation of φ as a convex combination of posteriors, i.e., w ξ1 = 2/5 and w ξ2 = 3/5. By unpacking the objective function of the above LP (and dropping the dependency on k) we have: Bξ1 = {A} and Bξ2 = {A, C}. Therefore, if the posterior is ξ1, the judge will acquit the defendant, i.e., bξ1 = A. Otherwise, if the posterior is ξ2, we have bξ2 = C since the receiver breaks ties in favor of the sender. This highlights an intuitive interpretation of the signaling problem: the two signals may be interpreted as action recommendations. Signal s1 (resp., s2) is interpreted by the judge as a recommendation to play A (resp., C). Then, our definition of best-response set (Definition 1) implies that it is in the receiver s best interest to follow the action recommendations. The best-response conditions can be formulated in terms of linear constraints on φθ as follows: X θ Θ µθ φθ(s1) uθ(A) uθ(ˆa) 0 and X θ Θ µθ φθ(s2) uθ(C) uθ(ˆa) 0 ˆa {A, C}. 3 The online Bayesian persuasion framework We consider the following online setting. The sender plays a repeated game in which, at each round t [T], she/he commits to a signaling scheme φt, observes a state of nature θt µ, and she/he sends signal st φt θt to the receiver. 5 Then, a receiver of unknown type updates her/his prior distribution and selects an action at maximizing her/his expected reward (in the one-shot interaction at round t). We focus on the problem in which the sequence of receiver s types k := {kt}t [T ] is selected beforehand by an adversary. After the receiver plays at, the sender receives a feedback on her/his choice at round t. In the full information feedback setting, the sender observes the receiver s type kt. Therefore, the sender can compute the expected payoff for any signaling scheme she/he could have chosen other than φt. Instead, in the partial information feedback setting, the sender only observes the action at played by the receiver at round t. 5Throughout the paper, the set {1, . . . , x} is denoted by [x]. We are interested in algorithms computing φt at each round t. The performance of one such algorithm is measured using the average per-round regret computed with respect to the best signaling scheme in hindsight. Formally: RT := max φ us(φ, kt) E us(φt, kt) ) where the expectation is on the randomness of the online algorithm (i.e., the probability distribution which is used by the sender to draw the signaling scheme at round t) and T is the number of rounds. Ideally, we would like to find an algorithm that generates a sequence {φt}t [T ] with the following properties: (i) the regret is polynomial in the size of the problem instance, i.e., poly(n, m, d), and goes to zero as a polynomial of T; (ii) the per-round running time is poly(t, n, m, d). An algorithm satisfying property (i) is usually called a no-regret algorithm. In the case in which requiring no-regret is too limiting, we use the following relaxed notion of regret. An algorithm has no-α-regret if there exists a constant c > 0 such that: RT α + 1 T c poly(n, m, d). The idea of no-α-regret is that the regret approaches α after a sufficiently large number of rounds (polynomial in the size of the game). 4 Hardness of sub-linear regret Our first result is negative: for any α < 1, it is unlikely (i.e., technically, it is not the case unless NP RP) that there exists a no-α-regret algorithm for the online Bayesian persuasion problem requiring a per-round running time polynomial in the size of the instance. In order to prove the result, we provide an intermediate step, showing that the problem of approximating an optimal signaling scheme is computationally intractable even in the offline Bayesian persuasion problem in which the sender knows the probability distribution over the receiver s types (see Theorem 1 below). Definition 2 (OPT-SIGNAL). Given an offline Bayesian persuasion problem in which the distribution over the receiver s types ρ K is uniform, i.e., ρk = 1 n for all k K, we call OPT-SIGNAL the problem of finding an optimal signaling scheme φ : Θ S, i.e., one maximizing the sender s expected utility 1 k K us(φ, k). Then, we can prove the following result (the omitted proofs can be found in Appendix B). Theorem 1. For every 0 α < 1, it is NP-hard to compute an α-optimal solution to OPT-SIGNAL. Now, we use the approximation-hardness of the offline Bayesian persuasion problem to provide lower bounds on the α-regret in the online setting. In order to do this, we employ a set of techniques introduced by Roughgarden and Wang [31], which lead to the following result. 6. Theorem 2. For every α < 1, there is no polynomial-time algorithm for the online Bayesian persuasion problem providing no-α-regret, unless NP RP. 5 Full information feedback setting The negative result of the previous section (Theorem 2) rules out the possibility of designing an algorithm which satisfies the no-regret property and requires a poly(t, n, m, d) per-round running time. A natural question is whether it is possible to devise a no-regret algorithm for the online Bayesian persuasion problem by relaxing the running-time constraint. This is not a trivial problem because, at every round t, the sender has to choose a signaling scheme among an infinite number of alternatives and her/his utility depends on the receiver s best response, which yields a function that is not linear nor convex (or even continuous in the space of the signaling schemes). However, we show that it is possible to provide a no-regret algorithm for the full information setting by restricting the sender s action space to a finite set of posteriors. All the omitted proofs are in Appendix C. First, we show that it is always possible to design a sender-optimal signaling scheme defined as a convex combination of a specific finite set of posteriors. For each type k K and action a A, we define Ξk a Θ as the set of posterior beliefs in which a is a receiver s best response. Formally, 6Theorem 2 can be obtained as a corollary of Theorem 6.2 by Roughgarden and Wang [31]. Guilty Innocent Space of posteriors ξ1 = (1, 0) ξ2 = (.5, .5) ξ3 = (0, 1) Ξ1 A = ΞA Ξ1 C = ΞC w = (1, 0, 0) w = (0, 1, 0) w = (0, 0, 1) Figure 2: Left: Subdivision of the space of posteriors Ξ in the two best-response regions. If ξ ΞA (resp., ξ ΞC) then the judge s best response under ξ is acquitting (resp., convicting) the defendant. When ξ = ξ2, the judge is indifferent among her/his available actions. We have ˆΞ = {ξ1, ξ2, ξ3}. Right: Visual depiction of ˆΞ, ˆ W ˆΞ, and W = V ( ˆ W). The set ˆ W comprises of the distributions over posteriors in ˆΞ consistent with the prior µ = (.3, .7) and it is obtained by intersecting ˆΞ with [ξ1 | ξ2 | ξ3] w µ. As a result, we obtain ˆ W = conv{(.3, 0, .7) , (0, .6, .4) }. Finally, W = V ( ˆ W) = {(.3, 0, .7) , (0, .6, .4) }. Ξk a := n ξ Ξ | a Bk ξ o . Let a = (ak)k K k K A be a tuple specifying one action for each receiver s type k. Then, for each tuple a, let Ξa Θ be the (potentially empty) polytope such that each action ak is optimal for the corresponding type k, i.e., Ξa := T k K Ξk ak. The polytope Ξa has a simple interpretation: a probability distribution over posteriors in Ξa yields a signaling scheme such that, for every type k, the receiver has no interest in deviating from ak in the induced posteriors Ξa (i.e., the constraints analogous to those of the example in Section 2.2 are satisfied). Then, let ˆΞ Ξ be the set of posteriors defined as ˆΞ := S a k K A V (Ξa). 7 Finally, we define the following set of consistent (according to Equation (3)) distributions over posteriors in ˆΞ: ξ ˆΞ wθξθ = µθ, θ Θ By letting M be a suitably defined |Θ| |ˆΞ|-dimensional matrix with one column for each ξ ˆΞ, then the affine hyperplanes defined by Equation (3) are in the form M w = µ. Since w ˆΞ, we can safely rewrite the consistency constraints as M w µ (see the example below for a better intuition). Then, ˆW can be seen as the intersection between the simplex ˆΞ and a finite number of half-spaces. Therefore, ˆW is a convex polytope, whose vertices compose the finite action space that will be employed by the no-regret algorithm. Specifically, let W := V ( ˆW). (5) Example Consider the game of Section 2.2 (see Figure 1 Left) where the receiver has a single type (type 1). We obtain ˆΞ by partitioning the space of posteriors in different best response regions and by taking the vertices of the resulting polytopes (see Figure 2 Left). Then, we provide a visual depiction of ˆW and W , which are obtained, respectively, by intersecting ˆΞ with the hyperplanes corresponding to consistency constraints (see Equation (4)), and then taking the vertices of the resulting polytope (see Figure 2 Right). Another example, with two receiver s types, is provided in Appendix A. For an arbitrary sequence of receiver s types, we show that there exists w W guaranteeing to the sender an expected utility that is equal to the best-in-hindsight signaling scheme. Lemma 1. For every sequence of receiver s types k = {kt}t [T ], it holds t=1 us(w, kt) = max w W t=1 us(w , kt). 7V (X) denotes the set of vertices of polytope X. The size of the sender s finite action space grows exponentially in the number of states of nature d. Lemma 2. The size of W is |W | O (n m2 + d)d . Now, by letting η [0, 1] be the maximum absolute payoff value, we can employ any algorithm satisfying RT O η p log |A|/T as a black box (see, e.g., Polynomial Weights [15] and Follow the Lazy Leader [22]). By taking W as the sender action space, we obtain the following. Theorem 3. Given an online Bayesian persuasion problem with full information feedback, there exists an online algorithm such that, for every sequence of receiver s types k = {kt}t [T ]: d log(n m2 + d) Notice that any no-regret algorithm working on W requires a per-round running time polynomial in n, m and exponential in d (see the bound in Lemma 2). This shows that the source of the hardness result in Theorem 2 is the number of states of nature d, while achieving no-regret in polynomial time is possible when the parameter d is fixed. 6 Partial information feedback setting In this setting, at every round t, the sender can only observe the action at played by the receiver. Therefore, the sender has no information on the utility us(w, kt) that she/he would have obtained by choosing any signaling scheme w W other than wt. We show how to design no-regret algorithms with regret bounds that depend polynomially in the size of the problem instance by exploiting a reduction from the partial information setting to the full information one. 8 The main idea is to use a full-information no-regret algorithm in combination with a mechanism to estimate the sender s utilities corresponding to signaling schemes different from the one recommended by the algorithm. In particular, the overall time horizon T is split into a given number of equally-sized blocks, each corresponding to a window of time simulating a single round of a full information setting. During this window, the strategy suggested by the full-information algorithm is played in most of the rounds (exploitation phase), while only few rounds are chosen uniformly at random and used by the mechanism that estimates the utilities provided by other signaling schemes (exploration phase). Algorithm 1 provides a sketch of the overall procedure, where Z (Line 1) denotes the number of blocks, which are the intervals of consecutive rounds {Iτ}τ [Z] defined in Line 4. The FULL-INFORMATION( ) sub-procedure is a black box representing a no-regret algorithm for the full information setting, working on a subset W W of signaling schemes. After the execution of all the rounds of each block τ [Z], it takes as input the utility estimates computed during Iτ and returns a recommended strategy qτ+1 W for the next block Iτ+1 (see Line 14). Algorithm 1 ONLINE BAYESIAN PERSUASION WITH PARTIAL INFORMATION FEEDBACK Input: Full-information no-regret algorithm FULL-INFORMATION( ) working on W W ; subset of signaling schemes W W used for exploration See Appendix D.2 for the definitions of W and W 1: Let Z be defined as in Theorem 3 2: Let q1 W be the uniform distribution over W 3: for τ = 1, . . . , Z do 4: Iτ (τ 1) T Z + 1, . . . , τ T 5: Choose a random permutation π : [|W |] W and t1, . . . , t|W | rounds at random from Iτ 6: for t = (τ 1) T Z + 1, . . . , τ T Z do 7: if t = tj for some j [|W |] then 8: qt q W such that qw = 1 for the signaling scheme w = π(j) Exploration phase 9: else 10: qt qτ Exploitation phase 11: Play a signaling scheme wt W randomly drawn from qt 12: Observe sender s utility us(wt, kt) and receiver s action at A 13: Compute estimators us Iτ (w) of us Iτ (w) := 1 |Iτ | P t [T ]:t I us(w, kt) for all w W 14: qτ+1 FULL-INFORMATION us Iτ (w) 8The reduction is an extension of those proposed by Balcan et al. [8] and Awerbuch and Mansour [5]. During each block Iτ with τ [Z], Algorithm 1 alternates between two tasks: (i) exploration (Line 8), trying all the signaling schemes in a subset W W given as input, so as to compute the required estimates of the sender s expected utilities; and (ii) exploitation (Line 10), playing strategy qτ recommend by FULL-INFORMATION( ) for Iτ. Our main result is the proof that Algorithm 1 achieves the no-regret property. Formally: Theorem 4. Given an online Bayesian persuasion problem with partial feedback, there exist W W , W W , and estimators us Iτ (w) such that Algorithm 1 provides the following regret bound: nm2/3d log1/3 (mn + d) In order to prove this result, we show that Algorithm 1 provides a regret bound that depends on the number |W | of signaling schemes used for exploration, the logarithm of |W |, and the range and bias of the estimators us Iτ (w). To do this, we extend a result shown by Balcan et al. [8, Lemma 6.2] to the more general case in which only biased utility estimators are available, rather than unbiased ones. This result can be generalized to any partial information setting (beyond online Bayesian persuasion). In any block Iτ with τ [Z], for every w W , we assume that Algorithm 1 has access to an estimator us Iτ (w) of the sender s average utility us Iτ (w) = 1 |Iτ | P t [T ]:t I us(w, kt) obtained by committing to w during the block Iτ, with the following properties: (i) the bias is bounded by a given constant ι (0, 1), i.e., it holds us Iτ (w) E us Iτ (w) ι; (ii) the range is limited, i.e., there exists a η R such that us Iτ (w) [ η, +η]. Lemma 3. Suppose that Algorithm 1 has access to estimators us Iτ (w) with properties (i) and (ii) for some constants ι (0, 1) and η R, for every signaling scheme w W and block Iτ with τ [Z]. Moreover, let Z := T 2/3|W | 2/3η2/3 log1/3 |W |. Then, Algorithm 1 guarantees regret: |W |1/3η2/3 log1/3 |W | Lemma 3 shows that even if utility estimators have small bias, we can still hope for a no-regret algorithm. However, we have to guarantee that W has a polynomial size, and that the estimator has a limited range. These requirements can be achieved by estimating sender s utilities indirectly by means of other related estimates, at the cost of giving up on the unbiasedness of the estimators. The key observation that allows to get the desired estimators us Iτ (w) by only exploring a polynomiallysized set W is that the utilities us Iτ (w) that we wish to estimate are not independent, but they all depend on the frequency of each receiver s type during block Iτ. Thus, only these (polynomially many) quantities need to be estimated. In order to do so, we use the concept of barycentric spanners [4] (see Appendix D.2 for the details). A direct application of barycentric spanners to our setting would require being able to induce any receiver s posterior during the exploration phase. Unfortunately, this is not possible as the sender is forced to play consistent signaling schemes (see Equation (2)), which could prevent her from inducing certain posteriors. We achieve the goal of keeping the bias and the range of the estimators small by adopting the following two technical caveats: (i) we focus on posteriors that can be induced by a signaling scheme with at least some ( not too small ) probability, which ensures that the resulting estimators have a limited range; and (ii) we restrict the full-information algorithm to signaling schemes W W inducing a small number of posteriors, which guarantees to have estimators with a small bias. We provide our complete technical results in Appendix D. 7 Discussion and future works We proposed the online Bayesian persuasion framework as a natural extension of the original model by Kamenica and Gentzkow [23]. This is, to the best of our knowledge, the first work relaxing the assumption that the sender has a perfect knowledge of the receiver s utility function. We proved that any no-regret algorithm for this setting has to require an exponential per-round running time, and we designed no-regret algorithms for the partial and full information feedback settings with adversarially chosen sequences of types. In the future, it would be interesting to study what happens if the receiver can play, at each round, an approximate best response (ϵ-best response) to the sender signal. We conjecture that in this case it should be possible to build a no-regret algorithm with quasi-polynomial per-round running time. Broader Impact Bayesian persuasion is a fascinating model that suffers from some limiting assumptions, which prevented a widespread use of the framework in practical applications. This work tries to amend one of such limitations, by relaxing the constraint that the sender has to have a perfect knowledge of the payoff structure of the game. This goes in the direction of developing a complete theory of Bayesian persuasion from data as a framework based solely on sender s and receiver s historical observations. In the future, an application of this framework at scale (e.g., on large social platforms) could raise some societal challenges (see, e.g., recent works on Bayesian persuasion as an electionmanipulation tool). Therefore, future research in this direction should prioritize the study of how to protect receivers from excessive information garbling, and how to incentivize senders to work towards a socially-acceptable outcome. Acknowledgments and Disclosure of Funding This work has been partially supported by the Italian MIUR PRIN 2017 Project ALGADIMAR Algorithms, Games, and Digital Market . [1] Ricardo Alonso and Odilon Câmara. Persuading voters. American Economic Review, 106(11):3590 3605, 2016. [2] Gerry Antioch et al. Persuasion is now 30 per cent of US GDP: Revisiting Mc Closkey and Klamer after a quarter of a century. Economic Round-up, (1):1, 2013. [3] Benjamin Assarf, Ewgenij Gawrilow, Katrin Herr, Michael Joswig, Benjamin Lorenz, Andreas Paffenholz, and Thomas Rehn. Computing convex hulls and counting integer points with polymake. Mathematical Programming Computation, 9(1):1 38, 2017. [4] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97 114, 2008. [5] Baruch Awerbuch and Yishay Mansour. Adapting to a reliable network path. In Proceedings of the twenty-second annual symposium on Principles of distributed computing, pages 360 367, 2003. [6] Yakov Babichenko and Siddharth Barman. Algorithmic aspects of private Bayesian persuasion. In Innovations in Theoretical Computer Science Conference, 2017. [7] Ashwinkumar Badanidiyuru, Kshipra Bhawalkar, and Haifeng Xu. Targeting and signaling in ad auctions. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2545 2563, 2018. [8] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia. Commitment without regrets: Online learning in stackelberg security games. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, page 61 78, 2015. [9] Umang Bhaskar, Yu Cheng, Young Kun Ko, and Chaitanya Swamy. Hardness results for signaling in bayesian zero-sum and network routing games. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 479 496, 2016. [10] Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Learning optimal commitment to overcome insecurity. In Advances in Neural Information Processing Systems, pages 1826 1834. 2014. [11] Peter Bro Miltersen and Or Sheffet. Send mixed signals: earn more, work less. In Proceedings of the 13th ACM Conference on Electronic Commerce, pages 234 247, 2012. [12] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1 122, 2012. [13] Ozan Candogan. Persuasion in networks: Public signals and k-cores. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 133 134, 2019. [14] Matteo Castiglioni, Andrea Celli, and Nicola Gatti. Persuading voters: It s easy to whisper, it s hard to speak loud. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 1870 1877, 2020. [15] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. [16] Yu Cheng, Ho Yee Cheung, Shaddin Dughmi, Ehsan Emamjomeh-Zadeh, Li Han, and Shang Hua Teng. Mixture selection, mechanism design, and signaling. In 56th Annual Symposium on Foundations of Computer Science, pages 1426 1445, 2015. [17] Vincent Conitzer and Dmytro Korzhyk. Commitment to correlated strategies. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, page 632 637, 2011. [18] Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In Proceedings of the 7th ACM Conference on Electronic Commerce, page 82 90, 2006. [19] Yuval Emek, Michal Feldman, Iftah Gamzu, Renato Paes Leme, and Moshe Tennenholtz. Signaling schemes for revenue maximization. ACM Transactions on Economics and Computation, 2(2):1 19, 2014. [20] Ewgenij Gawrilow and Michael Joswig. polymake: a framework for analyzing convex polytopes. In Polytopes combinatorics and computation (Oberwolfach, 1997), volume 29 of DMV Sem., pages 43 73. Birkhäuser, Basel, 2000. [21] Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742 765, 2009. [22] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005. [23] Emir Kamenica and Matthew Gentzkow. Bayesian persuasion. American Economic Review, 101(6):2590 2615, 2011. [24] Emir Kamenica. Bayesian persuasion and information design. Annual Review of Economics, 11:249 272, 2019. [25] Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In International Symposium on Algorithmic Game Theory, pages 250 262, 2009. [26] Yishay Mansour, Aleksandrs Slivkins, Vasilis Syrgkanis, and Zhiwei Steven Wu. Bayesian exploration: Incentivizing exploration in bayesian games. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 661 661, 2016. [27] Janusz Marecki, Gerry Tesauro, and Richard Segal. Playing repeated stackelberg games with unknown opponents. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, page 821 828, 2012. [28] Donald Mc Closkey and Arjo Klamer. One quarter of GDP is persuasion. The American Economic Review, 85(2):191 195, 1995. [29] Praveen Paruchuri, Jonathan P. Pearce, Janusz Marecki, Milind Tambe, Fernando Ordonez, and Sarit Kraus. Playing games for security: An efficient exact algorithm for solving bayesian stackelberg games. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems, page 895 902, 2008. [30] Zinovi Rabinovich, Albert Xin Jiang, Manish Jain, and Haifeng Xu. Information disclosure as a means to security. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 645 653, 2015. [31] Tim Roughgarden and Joshua R. Wang. Minimizing regret with multiple reserves. In Proceedings of the 2016 ACM Conference on Economics and Computation, page 601 616, 2016. [32] Shoshana Vasserman, Michal Feldman, and Avinatan Hassidim. Implementing the wisdom of waze. In Twenty-Fourth International Joint Conference on Artificial Intelligence, pages 660 666, 2015. [33] Bernhard Von Stengel and Shmuel Zamir. Leadership games with convex strategy sets. Games and Economic Behavior, 69(2):446 457, 2010. [34] Haifeng Xu, Rupert Freeman, Vincent Conitzer, Shaddin Dughmi, and Milind Tambe. Signaling in bayesian stackelberg games. In Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems, pages 150 158, 2016. [35] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, pages 928 936, 2003.