# multisender_persuasion_a_computational_perspective__ba813376.pdf

Multi-Sender Persuasion: A Computational Perspective

Safwan Hossain * 1 Tonghan Wang * 1 Tao Lin * 1 Yiling Chen 1 David C. Parkes 1 Haifeng Xu 2

Abstract We consider multiple senders with informational advantage signaling to convince a single selfinterested actor to take certain actions. Generalizing the seminal Bayesian Persuasion framework, such settings are ubiquitous in computational economics, multi-agent learning, and machine learning with multiple objectives. The core solution concept here is the Nash equilibrium of senders signaling policies. Theoretically, we prove that finding an equilibrium in general is PPAD-Hard; in fact, even computing a sender s best response is NP-Hard. Given these intrinsic difficulties, we turn to finding local Nash equilibria. We propose a novel differentiable neural network to approximate this game s non-linear and discontinuous utilities. Complementing this with the extra-gradient algorithm, we discover local equilibria that Pareto dominates full-revelation equilibria and those found by existing neural networks. Broadly, our theoretical and empirical contributions are of interest to a large class of economic problems.

1. Introduction

Bayesian Persuasion (BP) (Kamenica & Gentzkow, 2011) has emerged as a seminal concept in economics and decision theory. At its heart, it is a principal-agent problem that models an informed sender strategically revealing some information to affect the decisions of a self-interested receiver. Both parties are assumed to be Bayesian and have distinct utilities that depend on some realized state of nature, and the action taken by the receiver. The sender privately observes the state and can commit to selectively disclosing this information through a randomized signaling policy. The receiver updates their posterior belief based on the realized signal and best responds with an optimal action for this belief. The

*Equal contribution 1Harvard University 2University of Chicago. Correspondence to: Safwan Hossain, Tonghan Wang, Tao Lin <{shossain, twang1, tlin}@g.harvard.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

sender s goal is to maximize their utility by designing a signaling policy that nudges the receiver toward decisions preferred by the sender. This information design problem has found widespread applicability in a myriad of domains including recommendation systems (Mansour et al., 2015; 2016), auctions and advertising (Bro Miltersen & Sheffet, 2012; Emek et al., 2014; Badanidiyuru et al., 2018), social networks (Candogan & Drakopoulos, 2020; Acemoglu et al., 2021), and reinforcement learning (Castiglioni et al., 2020; Wu et al., 2022).

The standard BP model is however significantly constrained by a strong assumption: the presence of only one sender. In the applications mentioned above and indeed more broadly in settings like multi-agent learning (Balduzzi et al., 2018) and machine learning with multiple objectives (Pfau & Vinyals, 2016; Jaderberg et al., 2017), it is natural to have multiple parties who wish to influence the receiver toward their respective, possibly conflicting goals. As a demonstrative example, consider two ride-sharing firms, Uber and Lyft, and a dual-registered driver. While the driver is unaware of real-time demand patterns, both firms have access to and can strategically signal this to the driver and influence them toward certain pick-ups. The platforms goals however are not aligned, with each wishing to direct the driver to their respective optimal pick-ups. The driver is also self-interested and may prefer pick-ups that are on the way home. Our work aims to study this tension induced by multiple informed parties attempting to influence a self-interested receiver s decision-making, within the BP paradigm. Crucially, while the sender-receiver relation still outlines a sequential game, the interaction among the multiple senders in our setting forms a simultaneous game, with the resulting Nash Equilibrium (NE) being of core interest.

While this setup has been modeled in economic literature Gentzkow & Kamenica (2017b); Ravindran & Cui (2022), the multi-sender persuasion problem has not been formally studied from a computational perspective and presents distinct challenges. In standard single-sender BP, the optimal signaling policy for the sender can be computed efficiently by a linear program (Dughmi & Xu, 2016), which no longer holds in the multi-sender case where we need to compute a sender s best-responding signaling policy given others policies. We give a non-convex optimization program for the best response problem (Proposition 2) and through an

Multi-Sender Persuasion

involved reduction, prove that computing best response is in-fact NP-Hard in multi-sender persuasion games (Theorem 3). For the equilibrium computation, we significantly generalize a specific characterization from prior works to show that a trivial equilibrium can be found easily under certain conditions, but it might offer poor utility to the senders (Theorem 4). We then prove that finding an equilibrium in general settings is PPAD-hard (Theorem 5). These computational hardness results are our main theoretical contribution.

The intrinsic difficulty of finding (global) equilibrium in multi-sender persuasion motivates us to propose a deeplearning approach to finding ϵ-local equilibria (no unilateral deviation in a limited range is beneficial). This spiritually straddles two bodies of work - the emergent area of differentiable economics that builds a parameterized representation for optimal economic design (D utting et al., 2023), and the rich literature on learning in games (Bowling, 2004; Balduzzi et al., 2018; Azizian et al., 2020; Fiez et al., 2020; Bai et al., 2021; Bichler et al., 2021; Haghtalab et al., 2022; Goktas et al., 2023). Mirroring the obstacles encountered in theoretical analysis, the non-differentiable and indeed discontinuous nature of the utility functions (Proposition 1) also pose hurdles to identifying even local equilibria. To address this, we propose a novel end-to-end differentiable network architecture that is expressive enough to model the abrupt changes in utilities. Once trained, these networks can complement algorithms like extra-gradient (Korpelevich, 1976; Jelassi et al., 2020) to locate ϵ-local NE. The quality of the approximated utility landscape confirms the superior expressive capacity of our networks. Further, we demonstrate that this improvement helps to discover ϵ-local NE that Pareto dominates the full-revelation equilibria (Theorem 4) and the ϵ-local NE found in both synthetic and real-world scenarios by existing continuous and discontinuous (Wang et al., 2023) networks. Our novel techniques may be of independent interest for learning in general games with discontinuous and non-linear utilities.

1.1. Additional Related Work

The study of Bayesian persuasion and its various iterations has been extensively explored in the literature, as evidenced by the comprehensive surveys of Dughmi (2017); Kamenica (2019); Bergemann & Morris (2019). Among these, the most closely aligned with our work are the investigations involving multiple senders. The model of Gentzkow & Kamenica (2017b) explores a scenario where senders can arbitrarily correlate their signals, whereas Li & Norman (2021) consider sequential senders who choose signaling policies after observing those of previous senders. These two models differ from ours wherein the senders send signals to the receiver independently and simultaneously conditioning on the realized state. Further, they do not provide significant computational insights. Ding et al. (2023) study

multi-sender information design in a special Pandora Box setup which differs significantly from ours and is thus not comparable.

Ravindran & Cui (2022) also study Bayesian persuasion games featuring multiple independent and simultaneous senders but assume that the senders have zero-sum utilities. They show that with a sufficiently large signaling space, the only Nash equilibrium is full revelation, wherein the state of nature is fully revealed to the receiver. However, many multi-sender persuasion games do not conform to a zero-sum utility framework. Consequently, two important structural questions arise from their work: (1) whether such a full-revelation equilibrium exists under a limited signaling space, and (2) whether multi-sender persuasion games with general utility structures give rise to other types of equilibria that extend beyond full revelation. We provides affirmative answers to these queries, as delineated in Theorem 4.

Bayesian persuasion is subsumed within the broader principal-agent model (Gan et al., 2024), a concept that addresses a multitude of economics problems, including contract design (Zhu et al., 2023) and Stackelberg games (Myerson, 1982). In economic theory, the notion of incorporating multiple principals, analogous to senders in our context, has been proposed to model a range of important settings (Waterman & Meier, 1998; Hu et al., 2023). However, similar to the existing work on multi-sender persuasion, these contributions typically retain a conceptual focus from an economic perspective. Our work diverges by taking a computational lens. We introduce rigorous hardness guarantees for the best-response computation and equilibrium determination and propose a novel deep learning approach for identifying ϵ-local equilibria that may hold wider applicability.

Preliminaries There are n senders {1, . . . , n} and a receiver. Let Ωbe a finite set of possible states, with ω Ω denoting an arbitrary one. All senders and the receiver share a common prior distribution µ0 over the states Ω. We use µ (Ω) to denote a distribution over states. Receiver takes some action a A whose utility depends on the realized state ω, and is given by v : Ω A R. The receiver s utility can also be represented as an |Ω| |A| matrix V , with V [i, j] denoting the utility the receiver has for action j at state i. The utility function of the jth sender uj : Ω A R also depends on the realized state and the receiver s action. While the receiver only knows the prior, senders privately observe the state realization ω µ0 and can use this informational advantage to alter the receiver s belief and persuade it to take certain actions.

Persuasion We model the interaction between senders and the receiver using the seminal Bayesian Persuasion

Multi-Sender Persuasion

Figure 1. Discontinuous utility functions in a multi-sender persuasion game with 2 senders, 2 signals, 2 actions, and 2 states. In each subplot: the x-axis represents the probability of Sender1 transmitting Signal1 at State1, the y-axis shows the probability of Sender2 emitting Signal1 at State1, and the z-axis quantifies Sender2 s utility. Signaling strategies of both senders at State2 are set to (0.5, 0.5) in the top row and to (0.2, 0.8) and (0.8, 0.2) in the bottom row. In each column, we show the groundtruth ex-ante utility, and the approximation results achieved by our method, Re LU, and De LU (Wang et al., 2023) networks, respectively.

(BP) framework. Senders can leverage their private observation of ω by strategically signaling the receiver. Formally, letting S be a finite signal space, each sender j has an independent signaling policy πj(sj|ω) which specifies the probability of sending signal sj S when the realized state is ω. From the receiver s perspective, it observes a joint signal s = (s1, . . . , sn) sampled from the joint conditional distribution π(s|ω) = Qn j=1 πj(sj|ω). While many works on Bayesian persuasion assume the signal space |S| |A| (Kamenica & Gentzkow, 2011; Dughmi & Xu, 2016), we study the multi-sender problem in full generality (allowing |S| < |A|), since in many settings, action space can be arbitrarily large or even continuous (common in economic literature), but signaling/communication space may be limited. Consistent with the classical BP model, we assume that senders announce and commit to their signaling policies before observing state realizations. The receiver is considered Bayesian rational, and upon signal realization, it updates its belief about the state and takes a resulting optimal action according to its utility. We denote the interaction between senders and the receiver as a multi-sender persuasion game and summarize it as follows:

All senders simultaneously announce their signaling policies π = (π1, . . . , πn).

State ω µ0 is observed by senders but not the receiver.

Each sender j simultaneously draws a signal sj πj( |ω) to send to the receiver. For s = (s1, . . . , sn), π(s|ω) = Qn j=1 πj(sj|ω) is the joint signal probability.

After observing joint signal s, the receiver forms poste-

rior belief µs about the state (µs(ω) = µ0(ω)π(s|ω)

π(s) for every ω Ω) and takes an optimal action

a (µs) = arg max a A Eω µs v(ω, a).

Each sender j obtains utility uj(ω, a (µs)).

The senders attempt to use signaling to maximize their exante utility, described below. Definition 1. The ex-ante utility for sender j under joint signaling policy π = (πj, π j) is uj(π) = P

s Sn π(s|ω)uj(ω, a (µs)), where µs is the posterior distribution induced by joint signal s and policy π, and a (µs) is the receiver s optimal (utilitymaximizing) action at belief µs.

The relationship between senders and the receiver forms a multi-leader-single-follower game since senders reveal their policies first and the receiver subsequently best responds to joint signal realizations generated by these policies. While the senders and the receiver have a sequential relationship, the senders choose their signaling policies simultaneously. Thus, we consider Nash equilibria among the senders: Definition 2. A Nash equilibrium (NE) for the multi-sender persuasion game is a profile of signaling policies π = (π1, . . . , πn) such that for any sender j and deviating policy π j, uj(π) uj(π j, π j).

The equilibrium defined above is in fact a subgame perfect equilibrium of the extensive-form game among the senders and the receiver. We use the term Nash equilibrium to emphasize the simultaneity of the senders interaction.

Multi-Sender Persuasion

3. Theoretical Results

We now look to theoretically understand the equilibrium properties of the multi-sender persuasion game. We first consider the canonical best response problem and show that solving it, even approximately, is NP-Hard. We then extend and generalize a known equilibrium characterization that relies on revealing maximal information to the receiver. This equilibrium is generally not ideal for senders and is possible only under certain conditions. Furthermore, in the general case, we show that equilibrium computation is PPAD Hard. Cumulatively, our strong intractability results together suggest that developing provably efficient algorithms for finding global equilibria would be extremely challenging in our setting.

3.1. Best Response

We first consider the best response problem for an sender; namely, fixing other senders signaling schemes π i, what is the optimal signaling scheme πi that maximizes the exante utility ui(πi, π i) of sender i? The best-response problem is essential to verifying whether a given joint signaling scheme π = (π1, . . . , πn) is a Nash equilibrium. Further, standard equilibrium solving techniques often rely on simulating best response dynamics.

In normal-form games, fixing others strategies x i, the utility ui(xi, x i) of a player is linear in xi, so the best response problem can be solved by a linear program efficiently. In persuasion, a sender s signaling policy changes the induced posteriors, which changes the optimal action the receiver takes since the receiver maximizes expected utility. Correspondingly, a sender s utility function ui(πi, π i) is piece-wise linear with discontinuities corresponding to signaling schemes wherein the mapping from signal realization to optimal receiver actions changes. This is more generally formalized in Proposition 1 (with proof in Appendix A.1).

Proposition 1 [Discontinuous Utility]. The sender s utility function ui(π) is discontinuous and piecewise non-linear in (π1, . . . , πn). Fixing π i, ui(πi, π i) is discontinuous and piecewise linear in πi.

Maximizing ui(πi, π i) by enumerating all linear pieces is infeasible because, by a rough estimate, the number of linear pieces can be as large as O (|S|n|A|2)|Ω||S| . Instead of enumerating all O (|S|n|A|2)|Ω||S| linear pieces, we design a continuous bi-linear program to solve the best response problem (with proof in Appendix A.2).

Proposition 2 [Best Response Program]. Let v(ω, a, a ) v(ω, a) v(ω, a ) for actions a, a . Then given others signaling schemes π i, sender i s best response can be solved by the following optimization program with |Ω||S| +

|S|n|A| continous variables and O(|S|n|A|) constraints:

s Sn µ0(ω)π i(s i|ω)πi(si|ω) X

a A ui(ω, a)ys,a

s πi(si|ω) = 1 and si, ω : πi(si|ω) 0

a A ys,a = 1 and s, a A : ys,a [0, 1]

µ0(ω)π i(s i|ω)πi(si|ω) v(ω, a, a )ys,a 0.

The ys,a {0, 1} in the above program means whether the receiver takes action a given joint signal s, which can be relaxed to the continuou range [0, 1]. Briefly, the program above takes inspiration from the persuasion setting with (1) a single sender and (2) |S| = |A|, where signals can be interpreted as an action recommendation and optimal signaling expressed as a linear program with an incentive compatibility constraint to ensure that the receiver follows the recommended action. In our setting, even if |S| = |A|, the receiver observes joint signals of size |S|n and a single sender cannot unilaterally specify the joint scheme; only the marginal. Correspondingly, the program needs to resolve the action taken by the receiver and becomes a bi-linear optimization problem.

We next show that the best-response problem is NP-Hard, even with just two senders. This means that the above bi-linear program is not computationally tractable. This is a key result in our work and rules out even additively approximating to the best-response in polynomial time.

Theorem 3 [NP-hardness of Best Response]. It is NP-hard to solve the best-response problem in multi-sender persuasion, even with additive approximation error 1 |Ω|6 and only n = 2 senders (while |Ω| and |A| are large).

The proof (in Appendix A.3) is technical and based on a non-trivial reduction from the NP-hard problem public persuasion with multiple receivers (Dughmi & Xu, 2017). Intuitively, each signal s i from other non-responding senders induces a different belief about the state. From the bestresponding sender s perspective, this can be correspondingly interpreted as facing multiple receivers with different prior beliefs and needing to design a single signaling scheme πi for all of them. With carefully crafted utilities, this problem can encode the public persuasion problem (Dughmi & Xu, 2017), which involves multiple receivers with the same belief but different utility functions. Our proof formally establishes this connection, which leads to the NP-hardness of our problem.

The NP-hardness of computing best response, however, does not imply the hardness of equilibrium verification: i.e., determining whether a given strategy profile of the senders

Multi-Sender Persuasion

constitutes a Nash equilibrium. We conjecture that the equilibrium verification problem is Co-NP hard, whose formal proof would be an intriguing direction for future work.

3.2. Equilibrium Characterization

A simple observation from previous works on multi-sender Bayesian persuasion (Gentzkow & Kamenica, 2017b; Ravindran & Cui, 2022) is, if for every state ω Ωthere is a unique optimal action for the receiver, then a simple equilibrium can be achieved by all senders fully revealing the state - i.e. S = Ωand πi(si = ω|ω) = 1, i. Observe that this reveals the exact state realization to the receiver and thus no sender can unilaterally affect the receiver s belief and thus their action. However, for this equilibrium to exist, every sender s signal space must be as large as the state space (|S| |Ω|), which is impractical if there are many states. Theorem 4 relaxes this assumption, and shows that an equivalent equilibrium exists under a much weaker assumption,

|S| min(|A| 1 n 1 , |Ω| 1 n 1 ), which can be easily satisfied when there are many senders. The proof of the theorem (in App. A.4) is constructive and builds a mapping between signals and actions inspired by grey codes (Wilf, 1989).

Theorem 4 [Full-Revelation Equilibrium]. Suppose |S| min(|Θ|1/(n 1)|, A|1/(n 1), and for every state ω Ω there is a unique optimal action for the receiver. Then, the multi-sender persuasion game has an NE that fully reveals the optimal action for the realized state to the receiver. This equilibrium, however, is not necessarily unique.

While the result above generalizes an explicit equilibrium characterization to a much larger setting with limited signals, the corresponding equilibrium is optimal for the receiver and not necessarily the senders. Indeed, if the preferences of the senders do not perfectly align with the receiver (which is common and in-fact the premise behind persuasion), this will not be beneficial for the senders. Further, the construction above is based on the assumption that for every state, the receiver has a unique optimal action. This may not hold in many scenarios, with receivers being indifferent between multiple actions. We show that in such scenarios and thus the general case, finding an equilibrium is PPAD-Hard, even with constant number of senders, states, and actions. The proof (in App. A.5) relies on a reduction from finding equilibrium in two-player games with binary utilities.

Theorem 5 [PPAD-Hardness]. In multi-sender persuasion games that do not satisfy the condition the receiver has a unique optimal action for every state ω , under some tiebreaking rules, finding NE is PPAD-hard. This holds even if n = 2, |Ω| = 2, |A| = 4 (while |S| is large).

4. Deep Learning for Local Equilibrium

The strong computational hardness results established in the previous section motivate us to find methods to efficiently calculate local Nash equilibrium, a strategy profile wherein any small unilateral deviation cannot improve a player s utility. This has been promoted as an attractive solution concept for a plethora of settings (Fiez et al., 2020; 2021; Jin et al., 2020). In doing so, we also relax the assumption of having access to the exact utility model and take a sample-based approach popularized in the nascent literature on differentiable economics. This is especially prescient as it gracefully generalizes to settings where action space is rich (or even continuous) and it may only be possible to sample utilities for arbitrary policies. Correspondingly, we introduce a computational framework based on deep learning. It consists of a novel discontinuous neural network architecture approximating the senders utility functions and a local equilibrium solver running extra-gradients on the learned discontinuous networks. We describe the learning framework in detail and compare the found local NE against those obtained by strong baseline network structures as well as the full revelation solution (Theorem 4).

Definition 3 [ϵ-Local Nash Equilibrium]. An ϵ-local Nash equilibrium for a multi-sender persuasion game is a profile of signaling policies π = (π1, . . . , πn) such that for any sender j and deviating policy π j {π | π πj ϵ}, it holds that uj(π) uj(π j, π j).

4.1. Method

We aim to use the extra-gradient (Korpelevich, 1976) method to find an ϵ-local NE. However, the major challenge in applying this, or indeed any other gradient-based learning algorithm, is that the senders utility function is discontinuous and non-differentiable in their signaling policy, as per Proposition 1. Conventional neural networks well approximate continuous functions but are not expressive enough to express discontinuous functions (Scarselli & Tsoi, 1998). To solve this problem, we extend a fully connected feedforward network with Re LU activation (Agarap, 2018) to learn a differentiable representation of discontinuous functions. To describe our method, we first introduce the activation pattern and the piecewise linearity of Re LU networks.

Re LU networks Suppose there are L hidden layers. Layer l has weights W (l) Rnl nl 1 and biases b(l) Rnl. n0 = d is the input dimension. The output layer has weights W (L+1) Rd n L and biases b(L+1) Rd . With input x Rd, we have the preand post-activation output of layer l: h(l)(x) = W (l)o(l 1)(x) + b(l) and o(l)(x) = σ h(l)(x) , where σ(x) = max{x, 0} is the Re LU activation. For each hidden unit, the Re LU activation status has two values, defined as 1 when pre-activation h

Multi-Sender Persuasion

DNL (Ours) Re LU De LU

# Signals # States # Actions

Sender 1 Utility

Sender 1 Utility

Sender 1 Utility

Sender 2 Utility

Sender 2 Utility

Sender 2 Utility

Influence of # Signals Influence of # States Influence of # Actions

Figure 2. Our method finds better ϵ-local Nash equilibrium than the baseline De LU (Wang et al., 2023) and Re LU networks.

is positive and 0 when h is strictly negative. The activation pattern of the entire network is defined as follows.

Definition 4 [Activation Pattern]. An activation pattern of a Re LU network is a binary vector r = [r(1), , r(L)] {0, 1} PL l=1 nl, where r(l) is a layer activation pattern including the activation status of each unit in layer l.

The activation pattern depends on the input x. Given an activation pattern r(x), the Re LU network is a linear function (Croce et al., 2019)

h(L+1)(x) = M (L+1)x + z(L+1),

where M (L+1) = W (L+1)(QL k=1 R(L+1 k)(x)W (L+1 k)), z(L+1) = b(L+1)+PL k=1(QL k j=0 W (L+1 j)R(L j)(x))b(k), and R(k) is a diagonal matrix with diagonal elements equal to the layer k s activation pattern r(k).

Previous work To introduce discontinuity, De LU (Wang et al., 2023) proposes to generate the bias of the last layer b(L+1) by an auxiliary network that is conditioned on the activation pattern r(x). The idea is that inputs with the same r(x) come from a polytope that is the intersection of half-spaces: D(x) = l=1, ,L i=1, ,nl Γl,i, where Γl,i corresponding to unit i of layer l defined as:

Γl,i = n y Rd| (l) i M (l) i y + z(l) i 0 o . (1)

Here M (l) i y +z(l) i is the output of unit i at layer l, and (l) i is 1 if h(l) i (x) is positive, and is -1 otherwise.

In this way, different pieces D(x) has different biases, introducing discontinuity at piece boundaries. However, since inputs in the same piece share the same weights, De LU is a linear function in a piece and does not have enough expressivity to represent the utility function in the multisender persuasion games, which is piecewise non-linear (Proposition 1).

Network architecture We enable a fully-connected network to be piecewise Discontinuous and Non-Linear (DNL) by dividing the network into a lower part and a higher part. The lower part consists of the first K < L linear layers and is a normal network with Re LU activation. During a forward pass, we get the activation pattern

r( K) = [r(1), , r(K)]

of this lower network and generate the weights and biases of the higher part via a hyper-network g whose input is r( K).

Looking at the lower part, inputs with the same r( K) reside in the intersection of half-spaces:

D( K)(x) = l=1, ,K i=1, ,nl Γl,i,

with Γl,i defined in Eq. 1. By introducing the hyper-network, inputs in D( K)(x) share a non-linear higher network. Therefore, within this piece, the utility approximation can be non-linear. Furthermore, different pieces have different r( K), so the higher part can be different, introducing discontinuity at boundaries.

Formally, we train a network fj(π; θj), parameterized by θj, for each sender j to approximate its ex-ante utility (Definition 1) under the joint signaling policy π. The input to the lower part of fj is the joint signaling policy π. The hyper-network g takes the activation pattern r( K)(x) of the lower part as input and outputs {(W k, bk)}L+1 k=K+1 as the weights and biases for layer K + 1 to L + 1. After obtaining its weights and biases, the higher part then takes the output of the lower part network as input and generates an approximation of the ex-ante utility. It is worth noting that K < L, and we have at least two linear layers at the higher part, so that piecewise non-linearity can be ensured. The whole network fj(π; θj) is end-to-end differentiable and updated by the MSE loss function:

L(θj) = Eπ [fj(π; θj) uj(π)]2 . (2)

Multi-Sender Persuasion

DNL (Ours) Full Revelation Equilibrium Initial Policy of Extra-Gradient Optimization

# Signals # States # Actions

Sender 1 Utility

Sender 1 Utility

Sender 1 Utility

Sender 2 Utility

Sender 2 Utility

Sender 2 Utility

Influence of # Signals Influence of # States Influence of # Actions

Figure 3. The ϵ-local Nash equilibria found by our method typically Pareto dominate the full revelation equilibria and improve the random initial policies of extra-gradient by a large margin.

DNL (Ours) Re LU De LU

# Signals # States # Actions

Approximation Error

Approximation Error

Approximation Error

Influence of # Signals Influence of # States Influence of # Actions

Figure 4. Our network achieves lower approximation errors compared to baseline network structures.

To calculate this loss, we uniformly sample joint policies π and obtain the corresponding ex-ante utility uj(π) by running a game simulator.

Extra-gradient With fj as a differentiable representation of the senders ex-ante utility, we can run extra-gradients to find ϵ-local NE. We directly parameterize the signaling policy πj of sender j by a learnable matrix ϕj residing in Φ R|Ω| |S|. A matrix in Φ has all of its elements in the range [0, 1], and each row summed to 1.

The extra-gradient update can be written as

(extrapolation) ϕτ+1/2 j = pΦ(ϕτ j γτ ϕτ j fj(πϕτ ; θj)),

(update) ϕτ+1 j = pΦ(ϕτ j γτ ϕτ+1/2 j fj(πϕτ+1/2; θj)).

Here, pΦ[ ] is the projection to the constraint set Φ, and we use a Soft Max projection in practice. The parameters θj of fj is fixed during extra-gradient updates. πϕ is the joint parameterized signaling policy, and γτ is the learning rate.

5. Empirical Results

In this section, we evaluate our deep learning method by comparing against continuous neural networks with Re LU activation, discontinuous neural networks De LU (Wang et al., 2023), and full-revelation strategies on a illustrative

example and a synthetic benchmark.

5.1. Didactic Example

We first demonstrate the representational capacity of our method on a simple multi-sender game with 2 senders, 2 signals, 2 actions, and 2 states. The utility matrix of the receiver is [ 1 0 0 1 ], where each row corresponds to a state and each column corresponds to an action. The utilities for two senders are 1 1 1 3 and [ 4 1 1 1 ], respectively.

In the first row of Fig. 1, we fix both of the two senders signaling policies at State 2 to (0.5, 0.5) and vary their signaling policies at State 1. The x-axis is the probability of Sender 1 sending Signal 1 at State 1, the y-axis is the probability of Sender 2 sending Signal 1 at State 1, and the z-axis is the (possibly approximated) ex-ante utility of Sender 2. The second row is similar to the first, but the two senders signaling policies at State 2 are (0.2, 0.8) and (0.8, 0.2), respectively.

The first column shows Sender 2 s actual ex-ante utility. This utility function displays discontinuities, effectively captured by our method (second column). In contrast, Re LU approximations in the third column are not accurate at piece boundaries, and we can observe that the approximated De LU function in the fourth column is linear in each piece,

Multi-Sender Persuasion

DNL (Ours) Re LU De LU Full Revelation Equilibrium Initial Policy of Extra-Gradient Optimization

# Signals # States # Actions

Social Welfare

Social Welfare

Social Welfare

Influence of # Signals Influence of # States Influence of # Actions

Figure 5. Our method achieves higher social welfare compared against baselines and full-revelation solutions in games with 4 senders.

limiting its representational power for this game.

To ensure a fair comparison, networks used in this study, including ours, Re LU, and De LU, are standardized in terms of architecture, featuring three hidden layers with 64 units each. The training process involves a dataset of 500,000 randomly selected samples (pairs of signaling policies and corresponding ex-ante utilities), over which the networks are trained for a total of 200 epochs. For our network, the lower part has the first hidden layer. This layer s activation pattern is used to generate the weights and biases of the subsequent two layers by a hyper-network, which is itself composed of two hidden layers, each containing 32 units.

5.2. Synthetic Benchmark

In this section, we generate synthetic problems to test whether our network can find ϵ-local Nash equilibria that are better (in terms of sender utility) than those found by baseline network architectures as well as the full-revelation equilibria mentioned in Theorem 4.

Setup The size of a problem is determined by the tuple (n, |Ω|, |S|, |A|), and our evaluation encompasses a range of problem sizes to thoroughly assess the efficacy of our method. Specifically, we consider 2 and 4 senders, and for each, (|Ω|, |S|, |A|) are drawn from a three-dimensional grid {2, 4, 6, 8, 10}3. For each problem size, we randomly generate 5 problem instances. In total, we have 1,250 problem instances to benchmark the proposed learning framework. The utility matrices for the receiver and senders, as well as the prior belief of states, are randomly sampled from a Gaussian distribution with variance at 100 and mean at 0, with a Soft Max applied to generate prior beliefs. We employ random numbers featuring significant variance to enhance the complexity of the benchmark, thereby facilitating a more effective evaluation of different solutions.

We standardize the network architecture of our method and baselines mirroring the configuration delineated in the didactic example to ensure a fair comparative analysis. The training setup is described in detail in Appendix B. To test whether a joint signaling policy profile π is an ϵ-local Nash equilibrium, we randomly sample K policies π j for each sender j in the neighborhood {π j | π j πϕj ϵ}

and check whether it can gain a higher utility at π j. In our experiments, the number of test samples K grows with the problem size. We set the neighborhood size ϵ to 0.005 and find that our experimental results are robust with the value of ϵ up to 0.01 as evidenced by more results in Appendix B.

Representational capacity In Fig. 4, we fix the number of senders to 2 and compare the approximation errors (Eq. 2) achieved by our method and the two baseline architectures. We show the influence of the numbers of signals, states, and actions in three subplots, respectively, by presenting the average (solid lines) and the 95% confidence interval (shaded areas) of approximation errors. In the first subplot for example, we iterate the number of signals, and present results on all problem instances for each number of signals.

The results suggest that our algorithm provides a more accurate approximation than Re LU and De LU. The advantage of our method is consistently maintained across all the range evaluated. It is also interesting to observe that the approximation error decreases for all three algorithms as the number of signals increases, but it increases as the numbers of states and actions increase. This observation indicates that the multi-sender persuasion game becomes more challenging with fewer signals, aligning with existing theoretical results on persuasion with limited signals (Dughmi et al., 2016).

Equilibrium In Fig. 2, we conduct a comparison between the equilibrium derived from our method against those produced by baselines. We run the verification process to ascertain whether the extra-gradient outcomes are indeed ϵ-local NE. We present the mean and the 95% confidence interval of the sender utilities at the best solutions that satisfy the criteria. Notably, our findings indicate that for each of the two senders, the ex-ante utility achieved in our model consistently outperforms that of the baselines, exhibiting Pareto dominance. In Fig. 3, we provide additional evidence demonstrating that extra-gradient with our trained networks can significantly enhance the senders utility from the initial starting points. Furthermore, DNL successfully generates solutions that surpass full-revelation equilibria by a large margin. This improvement underscores the synergisitic benefit of integrating the extra-gradient approach with our networks. Similar results can be observed for games with

Multi-Sender Persuasion

4 senders, and in Fig. 5, we show the welfare (the sum of senders utilities) in these games.

5.3. Real-World Scenarios

Setting In this section, we extend the evaluation of our method to the following real-world scenarios.

Scenario 1: Advertising of Quality Prior economic research on multi-sender persuasion explored an advertising problem (Gentzkow & Kamenica, 2017a). In this problem, a total of n competing firms (senders) market their products to a single consumer. The product of each firm i can be of high quality (ωi = 5) or low quality (ωi = 5). The consumer wants to buy at most one product. The quality of the products is the state, known to firms but not the consumer. By sending signals, i.e., verifiable advertisements about their product s quality, a firm tries to persuade the consumer into purchasing from it, which induces utility 1 for the firm. The firm s utility is 0 if the consumer doesn t purchase from it. The consumer is faced with n + 1 actions, purchasing from any one of the firms, or none at all. The consumer s utility of purchasing from firm i is ωi + ϵi, where ϵi is a shock (Gaussian-distributed zero-mean noise). If the consumer makes no purchase, their utility is 0. In our experiments, we set n to 7 and generate 20 instances randomly.

Scenario 2: Advertising of Multiple Products We make the previous advertising example more realistic by incorporating multiple products of different quality and prices. Specifically, we consider the following problem.

There are n firms (senders), each of which i sells a product of price pi and quality ωi. The true state is the prices and quality of all products. The consumer (receiver) has a partial observation of the state, as it has no access to the quality of products. The receiver has n+1 actions, which are buying a product from one of the firms or buying nothing. The utility of firm i is pi if the receiver buys from it, or -1 otherwise. The senders use signals to strategically reveal the quality information to the receiver, trying to sway their purchase decisions in their favor. The receiver wants to maximize its utility, which is ωi pi + ϵi if purchasing product i, or 0 if buying nothing. Here ϵi is the shock defined in the same way as in Scenario 1. We test the case with n = 2 firms. Price pi and quality ωi are uniformly random integers in the range [1, 10] and [-8, 12], respectively.

Scenario 3: Uber or Lyft In this last scenario, we move beyond advertising and consider the competition among real-world ride-hailing apps, and a single driver subscribed to both platforms. There are two senders, Uber and Lyft, who receive m and n orders from users, respectively. Each order has four features (1) The price charged to the user; (2) The payment to the driver; (3) The true utility to the app, which is the price minus the payment; and (4) The true cost

Table 1. The social welfare (avg 95% confidence interval) at the ϵ-local equilibria found by our method and baseline networks.

Scenario Re LU De LU Ours 1 0.498 0.004 0.599 0.003 0.699 0.003 2 0.407 0.176 0.467 0.179 0.526 0.004 3 3.216 0.790 3.783 0.894 4.344 0.885

for the driver, which is known to the app and is influenced by many factors, such as the user rating indicating whether they are friendly, the expected travel time and distance, the expected waiting time, etc.

The true state is the joint feature of all orders. Feature (4), the true cost to the driver, is invisible to the driver when they must decide the pickup. Uber and Lyft can send signals to strategically reveal this information in order to persuade the driver into picking up their orders. The driver has m+n+1 actions, which are picking up one of the m + n orders or doing nothing. The utility of the driver is the price minus the true cost of the selected order, or -1 if they don t select any order. In our experiments, we set m and n to 4 and the number of signals to m + 1.

Results We test the performance of our method and the baseline neural network structures. Table 1 shows the social welfare of the senders at the ϵ-local equilibria found by different methods. Mean and 95% confidence intervals with 20 random instances are presented. We can observe that our method consistently outperforms other methods, indicating that our discontinuous, piecewise nonlinear network structure allows us to effectively tackle these richer settings that prior literature could not.

6. Discussion

We provide a comprehensive computational study of multisender Bayesian persuasion, a model for a wide range of real-world phenomena. The complex interplay of simultaneous sender actions and sequential receiver responses makes this game challenging. Our work formalizes this challenge by proving computational hardness results for both best response and equilibrium computation. Relaxing the equilibrium concept, however, offers hope, even without complete information. We propose a novel class of neural networks that can approximate the non-linear, discontinuous utilities in this game; paired with the extra-gradient algorithm, it is highly effective at finding local equilibria. Indeed, our network may be of broader interest to many games with discontinuous utility as it facilitates any downstream optimization algorithm. More broadly, BP is part of the principal-agent model of economics which also includes problems like contract design and Stackelberg games. Insights developed here can be instrumental to multi-principal variants of those problems which, despite their importance, have long eluded robust computational solutions.

Multi-Sender Persuasion

Acknowledgements

We thank the reviewers for the helpful comments. This work is supported by NSF awards No. IIS-2147187 and No. CCF2303372, Army Research Office Award No. W911NF-23-10030 and Office of Naval Research Award No. N00014-231-2802.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Abbott, T., Kane, D., and Valiant, P. On the Complexity of Two-Player Win-Lose Games. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS 05), pp. 113 122, Pittsburgh, PA, USA, 2005. IEEE. ISBN 978-0-7695-2468-9. doi: 10.1109/SFCS. 2005.59. URL http://ieeexplore.ieee.org/ document/1530706/.

Acemoglu, D., Ozdaglar, A., and Siderius, J. A model of online misinformation. Technical report, National Bureau of Economic Research, 2021.

Agarap, A. F. Deep learning using rectified linear units (relu). ar Xiv preprint ar Xiv:1803.08375, 2018.

Aussel, D., Brotcorne, L., Lepaul, S., and von Niederh ausern, L. A trilevel model for best response in energy demand-side management. European Journal of Operational Research, 281(2):299 315, 2020.

Azizian, W., Mitliagkas, I., Lacoste-Julien, S., and Gidel, G. A tight and unified analysis of gradient-based methods for a whole spectrum of differentiable games. In International conference on artificial intelligence and statistics, pp. 2863 2873. PMLR, 2020.

Badanidiyuru, A., Bhawalkar, K., and Xu, H. Targeting and signaling in ad auctions. In Proceedings of the Twenty Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2545 2563. SIAM, 2018.

Bai, Y., Jin, C., Wang, H., and Xiong, C. Sample-efficient learning of stackelberg equilibria in general-sum games. Advances in Neural Information Processing Systems, 34: 25799 25811, 2021.

Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T. The mechanics of n-player differentiable games. In International Conference on Machine Learning, pp. 354 363. PMLR, 2018.

Basilico, N., Coniglio, S., and Gatti, N. Methods for finding leader follower equilibria with multiple followers. ar Xiv preprint ar Xiv:1707.02174, 2017.

Bergemann, D. and Morris, S. Information Design: A Unified Perspective. Journal of Economic Literature, 57 (1):44 95, March 2019. ISSN 0022-0515. doi: 10.1257/ jel.20181489. URL https://pubs.aeaweb.org/ doi/10.1257/jel.20181489.

Bichler, M., Fichtl, M., Heidekr uger, S., Kohring, N., and Sutterer, P. Learning equilibria in symmetric auction games using artificial neural networks. Nature machine intelligence, 3(8):687 695, 2021.

B ohmer, W., Kurin, V., and Whiteson, S. Deep coordination graphs. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Bowling, M. Convergence and no-regret in multiagent learning. Advances in neural information processing systems, 17, 2004.

Brero, G., Eden, A., Chakrabarti, D., Gerstgrasser, M., Li, V., and Parkes, D. C. Learning stackelberg equilibria and applications to economic design games. ar Xiv preprint ar Xiv:2210.03852, 2022.

Bro Miltersen, P. and Sheffet, O. Send mixed signals: earn more, work less. In Proceedings of the 13th ACM Conference on Electronic Commerce, pp. 234 247, 2012.

Calvete, H. I. and Gal e, C. Linear bilevel multi-follower programming with independent followers. Journal of Global Optimization, 39(3):409 417, 2007.

Candogan, O. and Drakopoulos, K. Optimal signaling of content accuracy: Engagement vs. misinformation. Operations Research, 68(2):497 515, 2020.

Castiglioni, M., Celli, A., Marchesi, A., and Gatti, N. Online bayesian persuasion. Advances in Neural Information Processing Systems, 33:16188 16198, 2020.

Chen, X., Deng, X., and Teng, S.-H. Settling the complexity of computing two-player Nash equilibria. Journal of the ACM, 56(3):1 57, May 2009. ISSN 0004-5411, 1557-735X. doi: 10.1145/1516512. 1516516. URL https://dl.acm.org/doi/10. 1145/1516512.1516516.

Cheng, C., Zhu, Z., Xin, B., and Chen, C. A multi-agent reinforcement learning algorithm based on stackelberg game. In 2017 6th Data Driven Control and Learning Systems (DDCLS), pp. 727 732. IEEE, 2017.

Christianos, F., Sch afer, L., and Albrecht, S. Shared experience actor-critic for multi-agent reinforcement learning.

Multi-Sender Persuasion

Advances in neural information processing systems, 33: 10707 10717, 2020.

Conitzer, V. and Sandholm, T. Computing the optimal strategy to commit to. In Proceedings of the 7th ACM conference on Electronic commerce, pp. 82 90, 2006.

Croce, F., Andriushchenko, M., and Hein, M. Provable robustness of Re LU networks via maximization of linear regions. In AISTATS 2019, pp. 2057 2066, 2019.

Ding, B., Feng, Y., Ho, C.-J., Tang, W., and Xu, H. Competitive information design for pandora s box. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 353 381. SIAM, 2023.

Dong, H., Wang, T., Liu, J., and Zhang, C. Low-rank modular reinforcement learning via muscle synergy. Advances in Neural Information Processing Systems, 35: 19861 19873, 2022.

Dong, H., Zhang, J., Wang, T., and Zhang, C. Symmetryaware robot design with structured subgroups. In International Conference on Machine Learning, pp. 8334 8355. PMLR, 2023.

Dughmi, S. Algorithmic information structure design: a survey. ACM SIGecom Exchanges, 15(2):2 24, February 2017. ISSN 1551-9031. doi: 10.1145/3055589. 3055591. URL https://dl.acm.org/doi/10. 1145/3055589.3055591.

Dughmi, S. and Xu, H. Algorithmic bayesian persuasion. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 412 425, 2016.

Dughmi, S. and Xu, H. Algorithmic Persuasion with No Externalities. In Proceedings of the 2017 ACM Conference on Economics and Computation, pp. 351 368, Cambridge Massachusetts USA, June 2017. ACM. ISBN 978-1-4503-4527-9. doi: 10.1145/3033274. 3085152. URL https://dl.acm.org/doi/10. 1145/3033274.3085152.

Dughmi, S., Kempe, D., and Qiang, R. Persuasion with limited communication. In Proceedings of the 2016 ACM Conference on Economics and Computation, pp. 663 680, 2016.

D utting, P., Feng, Z., Narasimhan, H., Parkes, D. C., and Ravindranath, S. S. Optimal auctions through deep learning: Advances in differentiable economics. Journal of the ACM, 2023.

Emek, Y., Feldman, M., Gamzu, I., Paes Leme, R., and Tennenholtz, M. Signaling schemes for revenue maximization. ACM Transactions on Economics and Computation (TEAC), 2(2):1 19, 2014.

Fiez, T., Chasnov, B., and Ratliff, L. Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study. In International Conference on Machine Learning, pp. 3133 3144. PMLR, 2020.

Fiez, T., Ratliff, L., Mazumdar, E., Faulkner, E., and Narang, A. Global convergence to local minmax equilibrium in classes of nonconvex zero-sum games. Advances in Neural Information Processing Systems, 34:29049 29063, 2021.

Gan, J., Elkind, E., Kraus, S., and Wooldridge, M. Mechanism design for defense coordination in security games. In Proceedings of the 19th International Conference on Autonomous Agents and Multi Agent Systems, pp. 402 410, 2020.

Gan, J., Han, M., Wu, J., and Xu, H. Generalized principalagency: Contracts, information, games and beyond, 2024.

Gentzkow, M. and Kamenica, E. Bayesian persuasion with multiple senders and rich signal spaces. Games and Economic Behavior, 104:411 429, 2017a.

Gentzkow, M. and Kamenica, E. Bayesian persuasion with multiple senders and rich signal spaces. Games and Economic Behavior, 104:411 429, July 2017b. ISSN 08998256. doi: 10.1016/j.geb.2017. 05.004. URL https://linkinghub.elsevier. com/retrieve/pii/S0899825617300817.

Gerstgrasser, M. and Parkes, D. C. Oracles & followers: Stackelberg equilibria in deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 11213 11236. PMLR, 2023.

Goktas, D., Parkes, D. C., Gemp, I., Marris, L., Piliouras, G., Elie, R., Lever, G., and Tacchetti, A. Generative adversarial equilibrium solvers. ar Xiv preprint ar Xiv:2302.06607, 2023.

Guan, D.-J. Generalized gray codes with applications. In PROC NATL SCI COUNC REPUB CHINA PART A PHYS SCI ENG, volume 22, pp. 841 848. Citeseer, 1998.

Guestrin, C., Koller, D., and Parr, R. Multiagent planning with factored mdps. In Advances in neural information processing systems, pp. 1523 1530, 2002a.

Guestrin, C., Lagoudakis, M., and Parr, R. Coordinated reinforcement learning. In ICML, volume 2, pp. 227 234. Citeseer, 2002b.

Haghtalab, N., Lykouris, T., Nietert, S., and Wei, A. Learning in stackelberg games with non-myopic agents. In Proceedings of the 23rd ACM Conference on Economics and Computation, pp. 917 918, 2022.

Multi-Sender Persuasion

Hu, K., Ren, Z., and Yang, J. Principal-agent problem with multiple principals. Stochastics, 95(5):878 905, 2023.

Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled neural interfaces using synthetic gradients. In International conference on machine learning, pp. 1627 1635. PMLR, 2017.

Jelassi, S., Domingo-Enrich, C., Scieur, D., Mensch, A., and Bruna, J. Extragradient with player sampling for faster nash equilibrium finding. In Proceedings of the International Conference on Machine Learning, 2020.

Jiang, A. X., Procaccia, A. D., Qian, Y., Shah, N., and Tambe, M. Defender (mis) coordination in security games. In Twenty-Third International Joint Conference on Artificial Intelligence, 2013.

Jiang, J., Dun, C., Huang, T., and Lu, Z. Graph convolutional reinforcement learning. In International Conference on Learning Representations, 2019.

Jin, C., Netrapalli, P., and Jordan, M. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pp. 4880 4889. PMLR, 2020.

Kamenica, E. Bayesian persuasion and information design. Annual Review of Economics, 11:249 272, 2019.

Kamenica, E. and Gentzkow, M. Bayesian persuasion. American Economic Review, 101(6):2590 2615, 2011.

Kang, Y., Wang, T., and de Melo, G. Incorporating pragmatic reasoning communication into emergent language. Advances in Neural Information Processing Systems, 33: 10348 10359, 2020.

Kang, Y., Wang, T., Yang, Q., Wu, X., and Zhang, C. Nonlinear coordination graphs. Advances in Neural Information Processing Systems, 35:25655 25666, 2022.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Korpelevich, G. M. The extragradient method for finding saddle points and other problems. Matecon, 12:747 756, 1976.

Kuba, J. G., Chen, R., Wen, M., Wen, Y., Sun, F., Wang, J., and Yang, Y. Trust region policy optimisation in multiagent reinforcement learning. In International Conference on Learning Representations, 2021.

Li, C., Wang, T., Wu, C., Zhao, Q., Yang, J., and Zhang, C. Celebrating diversity in shared multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 34:3991 4002, 2021.

Li, F. and Norman, P. Sequential persuasion. Theoretical Economics, 16(2):639 675, 2021.

Mansour, Y., Slivkins, A., and Syrgkanis, V. Bayesian incentive-compatible bandit exploration. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, pp. 565 582, 2015.

Mansour, Y., Slivkins, A., Syrgkanis, V., and Wu, Z. S. Bayesian exploration: Incentivizing exploration in bayesian games. ar Xiv preprint ar Xiv:1602.07570, 2016.

Myerson, R. B. Optimal coordination mechanisms in generalized principal agent problems. Journal of mathematical economics, 10(1):67 81, 1982.

Naghizadeh, P. and Liu, M. Voluntary participation in cyberinsurance markets. In Workshop on the Economics of Information Security (WEIS), 2014.

Peng, B., Rashid, T., Schroeder de Witt, C., Kamienny, P.-A., Torr, P., B ohmer, W., and Whiteson, S. Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34: 12208 12221, 2021.

Pfau, D. and Vinyals, O. Connecting generative adversarial networks and actor-critic methods. ar Xiv preprint ar Xiv:1610.01945, 2016.

Rashid, T., Samvelyan, M., Witt, C. S., Farquhar, G., Foerster, J., and Whiteson, S. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4292 4301, 2018.

Ravindran, D. and Cui, Z. Competing Persuaders in Zero Sum Games, June 2022. URL http://arxiv.org/ abs/2008.08517. ar Xiv:2008.08517 [econ].

Scarselli, F. and Tsoi, A. C. Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural networks, 11(1): 15 37, 1998.

Shi, Z., Yu, R., Wang, X., Wang, R., Zhang, Y., Lai, H., and An, B. Learning expensive coordination: An eventbased deep rl approach. In International Conference on Learning Representations, 2019.

Shu, T. and Tian, Y. M3RL: Mind-aware multi-agent management reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.

Tharakunnel, K. and Bhattacharyya, S. Leader-follower semi-markov decision problems: theoretical framework and approximate solution. In 2007 IEEE International

Multi-Sender Persuasion

Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 111 118. IEEE, 2007.

Wang, K., Xu, L., Perrault, A., Reiter, M. K., and Tambe, M. Coordinating followers to reach better equilibria: Endto-end gradient descent for stackelberg games. ar Xiv preprint ar Xiv:2106.03278, 2021a.

Wang, T., Wang, J., Wu, Y., and Zhang, C. Influence-based multi-agent exploration. In International Conference on Learning Representations, 2019a.

Wang, T., Wang, J., Zheng, C., and Zhang, C. Learning nearly decomposable value functions via communication minimization. In International Conference on Learning Representations, 2019b.

Wang, T., Dong, H., Lesser, V., and Zhang, C. ROMA: Multi-agent reinforcement learning with emergent roles. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Wang, T., Gupta, T., Mahajan, A., Peng, B., Whiteson, S., and Zhang, C. RODE: Learning roles to decompose multi-agent tasks. In Proceedings of the International Conference on Learning Representations (ICLR), 2021b.

Wang, T., Zeng, L., Dong, W., Yang, Q., Yu, Y., and Zhang, C. Context-aware sparse deep coordination graphs. In International Conference on Learning Representations, 2021c.

Wang, T., D utting, P., Ivanov, D., Talgam-Cohen, I., and Parkes, D. C. Deep contract design via discontinuous piecewise affine neural networks. ar Xiv preprint ar Xiv:2307.02318, 2023.

Wang, Y., Han, B., Wang, T., Dong, H., and Zhang, C. Dop: Off-policy multi-agent decomposed policy gradients. In Proceedings of the International Conference on Learning Representations (ICLR), 2021d.

Waterman, R. W. and Meier, K. J. Principal-agent models: an expansion? Journal of public administration research and theory, 8(2):173 202, 1998.

Wen, M., Kuba, J., Lin, R., Zhang, W., Wen, Y., Wang, J., and Yang, Y. Multi-agent reinforcement learning is a sequence modeling problem. Advances in Neural Information Processing Systems, 35:16509 16521, 2022.

Wilf, H. S. Combinatorial algorithms: an update. SIAM, 1989.

Wu, J., Zhang, Z., Feng, Z., Wang, Z., Yang, Z., Jordan, M. I., and Xu, H. Sequential information design: Markov persuasion process and its efficient reinforcement learning. ar Xiv preprint ar Xiv:2202.10678, 2022.

Yang, Q., Dong, W., Ren, Z., Wang, J., Wang, T., and Zhang, C. Self-organized polynomial-time coordination graphs. In International Conference on Machine Learning, pp. 24963 24979. PMLR, 2022.

Yu, C., Velu, A., Vinitsky, E., Gao, J., Wang, Y., Bayen, A., and Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 35:24611 24624, 2022.

Zhang, E., Zhao, S., Wang, T., Hossain, S., Gasztowtt, H., Zheng, S., Parkes, D. C., Tambe, M., and Chen, Y. Social environment design. ar Xiv preprint ar Xiv:2402.14090, 2024.

Zhang, H., Xiao, Y., Cai, L. X., Niyato, D., Song, L., and Han, Z. A multi-leader multi-follower stackelberg game for resource management in lte unlicensed. IEEE Transactions on Wireless Communications, 16(1):348 361, 2016.

Zheng, S., Trott, A., Srinivasa, S., Parkes, D. C., and Socher, R. The AI Economist: Taxation policy design via two-level deep multiagent reinforcement learning, 2022. URL https://www.science.org/doi/ abs/10.1126/sciadv.abk2607.

Zhu, B., Bates, S., Yang, Z., Wang, Y., Jiao, J., and Jordan, M. I. The sample complexity of online contract design. In EC 23, pp. 1188. ACM, 2023. ISBN 9798400701047. doi: 10.1145/3580507.3597673. URL https://doi. org/10.1145/3580507.3597673.

Multi-Sender Persuasion

A.1. Proof of Proposition 1

Proof. For a joint scheme π, each signal realization s induces a posterior belief µs, wherein receiver take optimal action a (µs). We can equivalently write the function a in terms π(s| ), ad note that a (π(s| )) A. When the signaling scheme changes sufficiently such that the new actions are optimal for a given realized posterior µs, the mapping a (π(s| )) changes accordingly. Thus, the function a (π(s| )) is piece-wise constant with the boundary between pieces representing this changed mapping. The utility of sender i is given by: X

s Sn µ0(ω)ui(ω, a (π(s|ω)))π(s|ω) (3)

s Sn µ0(ω)ui(ω, a (π(s|ω))) Y

i π(si|ω) (4)

where we note that since ui is essentially indexing a matrix, ui(ω, a (π(s|ω))) is piece-wise constant with the same boundaries as a (π(s| )). It is evident from the last expression above the utility is piece-wise bi-linear in (π1, . . . , πn) and upon fixing π i is it piecewise linear in πi.

A.2. Proof of Proposition 2

Proof. Note that π i refer to the signaling of others and is fixed, with the optimization variables being πi and ys,a. Next, observe that if ys,a {0, 1} then this optimization program can be interpreted as follows. πi denotes the signaling scheme of influence i, and ys,a denotes whether action a is the optimal action for the user upon receiving the joint signal s and computing the corresponding posterior belief. The sum constraint on ys,a ensure [ys,1, . . . , ys,|A|] is a one hot vector. To ensure that the choice of ys,a are indeed correct, we need to ensure incentive-compatible. That is, we require the following holds for the posterior induced by any joint signal s, and any action a : P

a A [v(a, ω) v(a , ω)] ys,a. By Bayes rule, P(ω|s) = π(s|ω)µ0(ω)

P (s) and since P(s) is constant for the whole sum, we can multiply both sides by P(s) and arrive at the first constraint in the above LP. Since this constraint enforces the choice of user action at each posterior indeed correct, the objective simply maximizes the sender s ex-ante expected utility.

The only difference between the presented optimization problem and the best-response sketched above is that the variables ys,a are now relaxed to within the continuous range [0, 1]. We now show that this relaxation does not change the optimal solution. That is, an optimal solution to the binary-constrained setting is also an optimal solution to the relaxed continuous setting. Fix any signaling scheme πi and any joint signal realization s. Let a s denote a best action for the user at the posterior induced by signal realization s with the schemes (πi, π i). Then we can rewrite the incentive compatibility constraint (first constraint) as follows (for brevity, we will write π(s|ω) = πi(si|ω) π is i|ω:

ω Ω µ0(ω)π(s|ω)v(w, a) X

ω Ω µ0(ω)π(s|ω)v(w, a s)

Note that the first summation term inside the inner bracket is proportional to the expected utility for action a under the posterior induced by s, while the second summation term is the expected utility for action a under this same posterior. If a s is the unique action that maximizes expected user utility at this posterior, then the only way this can be satisfied is by setting ys,a s = 1 and 0 to all others. If multiple actions may be optimal for the receiver at this belief, then let a s be the action among these that is most preferred by sender i (if there is a tie here, pick arbitrarily). Thus, by setting the corresponding ys,a s = 1 and 0 for the rest satisfies the constraint while also maximizing user utility. Thus it follows that relaxing the domain of ys,a does not change the optimal solution since these still occur at the endpoints 0 or 1, and it follows that the continuous bi-linear optimization problem above corresponds to sender i s best response.

Multi-Sender Persuasion

A.3. Proof of Theorem 3

Figure 6. Public persuasion with k receivers, each with binary actions

The proof uses a reduction from the following problem called public persuasion (Dughmi & Xu, 2017): Definition 5. A public persuasion (Pub) problem (with multiple receivers with binary actions) is described by tuple k, Ω, µ, {vj(ω), uj(ω)}j [k],ω Ω , where:

There are k receivers denoted by [k] = {1, . . . , k} each having two actions {+, }.

µ (Ω) is a prior distribution of states ω Ω.

Let vj(ω, +), vj(ω, ) [0, 1] be the utilities of receiver j [k] when taking actions +, and the state is ω. Let vj(ω) = vj(ω, +) vj(ω, ) [ 1, 1] be the utility difference.

uj(ω, +), uj(ω, ) [0, 1] are the utilities of the sender when receiver j [k] takes action +, , respectively.

Let π : Ω (S) be a signaling scheme of the sender. Let xs (Ω) denote the posterior distribution over states induced by signal s S:

xs(ω) = µ(ω)π(s|ω)

π(s) where π(s) = X

ω Ω µ(ω)π(s|ω), ω Ω.

In the public persuasion problem, given an induced posterior xs, each receiver j [k] is willing to take action + if and only if P

ω Ωxs(ω)vj(ω) 0. Let a j(xs) {+, } denote the action taken by receiver j [k] given posterior xs:

ω Ωxs(ω)vj(ω) 0 if P

ω Ωxs(ω)vj(ω) < 0. (6)

The sender s (expected) utility is the average utility obtained across all k receivers:

u Pub(π) = X

ω Ω xs(ω)uj(ω, a j(xs)). (7)

Multi-Sender Persuasion

Figure 7. The multi-sender persuasion problem reduced from public persuasion. k additional states are added with the sole receiver having 2k + 1 actions. The receiver and best-responding sender s utility are chosen such that for all possible k possible signal realization of non-best responding sender, the single receiver s plausible actions mimic that of the kth receiver in public persuasion.

The goal is to find a signaling scheme π to maximize u Pub(π).

Theorem 6 [Dughmi & Xu (2017)]. For any constant c [0, 1

9], it is NP-hard to solve, within additive approximation error c, public persuasion problems with |Ω| = k states and uniform prior µ(ω) = 1

We reduce the public persuasion problem to the best-response problem in multi-sender persuasion, which will prove that the latter problem is NP-hard. Given the public persuasion problem k, Ω, µ, {vj(ω), uj(ω)}j [k],ω Ω , we construct the following best-response problem with two senders, where we fix sender 2 s signaling scheme π2 and find sender 1 s best response: Let C, N > 0 and M N N 1 be some large numbers to be chosen later.

There are |Ω| + k states, denoted by Ω= Ω {ω1, . . . , ωj}, with prior µ(ω) = µ(ω)

2 for ω Ωand µ(ωj) = 1 2k for j = 1, . . . , k.

The (single) receiver has 2k + 1 actions, denoted by A = {a1+, a1 , . . . , ak+, ak } {a }.

The receiver s utility is:

For any state ω Ω, let

v(ω, aj+) = vj(ω)

v(ω, aj ) = 0

v(ω, a ) = N.

In words, for any state ω Ω, the receiver s two actions aj+, aj mimics receiver j s actions +, in the public persuasion problem. And a is very attractive to the receiver.

Multi-Sender Persuasion

For any state ωj, j [k], let

v(ωj, aℓ+) = v(ωj, aℓ ) = M for ℓ = j

v(ωj, aj+) = v(ωj, aj ) = 0

v(ωj, a ) = N.

In words, under state ωj, the receiver is extremely unwilling to take actions other than aj . And a is very harmful to the receiver.

Sender 1 s utility u1( , ) is:

For any state ω Ω,

u1(ω, aj+) = uj(ω, +) j [k]

u1(ω, aj ) = uj(ω, ) j [k]

u1(ω, a ) = C.

In words, when the receiver takes actions aj , the sender obtains the same utility as if receiver j takes action in the public persuasion problem. But the sender suffers a large loss if the receiver takes a . For any state ωj Ω, j [k],

u1(ωj, aℓ+) = u1(ωj, aℓ ) = 0 ℓ [k]

u1(ωj, a ) = C.

Sender 2 s signaling scheme π2 is the following: it sends k possible signals {t1, . . . , tk} with probability:

π2(tj|ω) = 1

k , j [k], ω Ω.

π2(tj|ωj) = 1, j [k].

We sketch both the public persuasion framework and the equivalent multi-sender construction outlined above in Fig 6 and 7.

A.3.1. USEFUL CLAIMS REGARDING RECEIVER S BEHAVIOR

Before proving Theorem 3, we present some useful claims regarding the receiver s taking-best-action behavior. First, we characterize the receiver s expected utilities when taking different actions in the multi-sender persuasion problem: Claim 1. Let x (Ω) be a distribution on the enlarged state space Ω. Suppose sender 2 sends signal tj. Then, the receiver s expected utilities of taking different actions a A, denoted by v(x, tj, a), are:

v(x, tj, aj+) = 1

ω Ωx(ω)vj(ω);

v(x, tj, aj ) = 0;

v(x, tj, aℓ+) = 1

k P ω Ωx(ω)vℓ(ω) x(ωj)M for ℓ = j;

v(x, tj, aℓ ) = 0 for ℓ = j;

v(x, tj, a ) = N 1

ω Ωx(ω) x(ωj) .

Proof. For any a A, by definition,

v(x, tj, a) = X

ω Ω x(ω)π2(tj|ω)v(ω, a)

k v(ω, a) +

ℓ=1 x(ωℓ)π2(tj|ωℓ)v(ωℓ, a)

ω Ω x(ω)v(ω, a) + x(ωj)v(ωj, a).

Plugging in the definitions of utilities v(ω, a) and v(ωj, a) for different a proves the claim.

Multi-Sender Persuasion

As corollaries of the above claim, we have some guarantees when the receiver takes a best action:

Claim 2. Given belief x (Ω) and sender 2 s signal tj, if the receiver does not take a as the best action, then it must be 1 k P

ω Ωx(ω) N N 1x(ωj).

Proof. If 1

ω Ωx(ω) > N N 1x(ωj), then by Claim 1, the receiver s utility of taking action a is

v(x, tj, a ) = N 1

ω Ω x(ω) x(ωj) > N 1

ω Ω x(ω) N 1

ω Ω x(ω) = 1

ω Ω x(ω) v(x, tj, a)

for any other actions a = a . So, the receiver should take a , a contradiction.

Claim 3. Given belief x (Ω) and sender 2 s signal tj, if the receiver is unwilling to take a , then the receiver s utility of taking aℓ for ℓ = j is v(x, tj, aℓ ) 0. So, we can assume that the receiver will take aj+ or aj . (Tie-breaking does not affect our conclusion.)

Proof. According to Claim 2, if the receiver is unwilling to take a , then 1

ω Ωx(ω) N N 1x(ωj). This implies that the receiver s utility of taking aℓ+ is, by Claim 1

v(x, tj, aℓ+) = 1

ω Ω x(ω)vℓ(ω) x(ωj)M N N 1x(ωj)vℓ(ω) x(ωj)M x(ωj) N N 1 M 0,

under the assumption of vℓ(ω) 1 and M N N 1.

Claim 4. Let x (Ω) be a belief on Ω. And let ex (Ω) be the conditional belief on Ω: ex(ω) = x(ω) P

ω Ωx(ω), ω Ω. Fix any j [k]. Suppose the receiver does not take action a under signal tj in the multi-sender persuasion problem. Then, the receiver takes action aj+ (and aj ) if and only if the receiver j in the public persuasion problem takes action + (and ) under belief ex.

Proof. By Claim 3, the receiver in the multi-sender problem must take action aj+ or aj if not taking a . The receiver is willing to take aj+ if and only if, by Claim 1,

ω Ω x(ω)vj(ω) 0 X

ω Ωx(ω)vj(ω) 0 X

ω Ω ex(ω)vj(ω) 0,

which means that the receiver j in the public persuasion problem is willing to take action + under belief ex (see (6)).

A.3.2. PROOF OF THEOREM 3

Consider a signaling scheme π1 : Ω (S) of sender 1, where S is the signal space. For a signal s S, let xs (Ω) be the posterior distribution over Ωgiven s. And let π1(s) = P

ω Ωµ(ω)π1(s|ω) be the probability of sender 1 sending signal s. A valid signaling scheme π1 must satisfy the following Bayesian plausibility condition:

s S π1(s)xs = µ

s S π1(s)xs(ω) = µ(ω) = µ(ω)

2 for ω Ω P s S π1(s)xs(ωj) = µ(ωj) = 1 2k for j [k] . (8)

Let a (xs, tj) be the best action that the receiver will take when the posterior induced by sender 1 is xs (namely, sender 1 sends signal s) and sender 2 sends signal tj. According to Claim 3, we have

a (xs, tj) {a , aj+, aj }. (9)

Let S be the set of signals of sender 1 for which the receiver will take action a given some signal tj from sender 2:

S = n s S a (xs, tj) = a for some j [k] o .

Since a is very harmful to sender 1 (causing utility C), we show that the total probability of S cannot be too large.

Multi-Sender Persuasion

Lemma 1. If sender 1 s expected utility under signaling scheme π1 is 0, then π1(S ) = P

s S π1(s) 2k

Proof. Sender 1 s expected utility is (fixing sender 2 s scheme),

s S π1(s) h X

j=1 π2(tj|ω)u1(ω, a (xs, tj)) i

s S π1(s) h X

1 k u1(ω, a (xs, tj)) +

j=1 xs(ωj)u1(ωj, a (xs, tj)) i

ω Ω xs(ω)u1(ω, a (xs, tj)) + xs(ωj)u1(ωj, a (xs, tj)) i (10)

ω Ω xs(ω)u1(ω, a (xs, tj)) + xs(ωj)u1(ωj, a (xs, tj)) i

s S\S π1(s)

ω Ω xs(ω)u1(ω, a (xs, tj)) + xs(ωj)u1(ωj, a (xs, tj)) i .

Since the utility u1(ω, a) is always 1, and when receiver takes action a sender 1 gets utility C,

s S π1(s) X

j:a (xs,tj)=a

ω Ω xs(ω)( C) + xs(ωj)( C) i

ω Ω xs(ω) 1 + xs(ωj) 1 i

s S\S π1(s)

ω Ω xs(ω) 1 + xs(ωj) 1 i

s S π1(s) h1

ω Ω xs(ω) + xs(ωj) i + X

ω Ω xs(ω) + xs(ωj) i

| {z } =1 by (12)

Using u1(π1) 0 and rearranging, we get P

s S π1(s) h 1 k P

ω Ωxs(ω) + xs(ωj) i 1

C , which implies

s S π1(s) h1

ω Ω xs(ω) i 1

s S π1(s) X

ω Ω xs(ω) k

By the Bayesian plausibility condition (8), we have

s S π1(s) X

ω Ω xs(ω) = X

s S π1(s)xs(ω) = X

s S\S π1(s) X

ω Ω xs(ω) = 1

s S π1(s) X

ω Ω xs(ω) 1

For any signal s S \ S , the receiver does not take a under any tj, which by Claim 2 implies

ω Ω xs(ω) N N 1xs(ωj) = xs(ωj) N 1

Multi-Sender Persuasion

for all j [k]. Moreover, because

ω Ω xs(ω) + xs(ωj) i = X

j=1 π2(tj|ω) = 1, (12)

we have for any s S \ S ,

ω Ω xs(ω) + N 1

ω Ω xs(ω) i = 2 1

ω Ω xs(ω) = X

ω Ω xs(ω) 1 2 1

From (11) and (13) we get

s S\S π1(s) 1 2 1

s S\S π1(s) 1 2k

which proves the lemma since P

s S π1(s) = 1.

We give another characterization of π1: for most of the signals in S \ S , the total posterior probability for states in Ω, xs(Ω) = P ω Ωxs(ω), should be close to 1

2. Inequality (13) has shown an upper bound P ω Ωxs(ω) 1 2 1

N . The following lemma gives a lower bound:

Lemma 2. Fix any > 0. Let

S = n s S \ S 1 2 1

ω Ω xs(ω) 1

2 o , S< = n s S \ S X

ω Ω xs(ω) < 1

(Note that S S< = S \ S by (13)). We have π1(S ) is large while π1(S<) is small:

s S π1(s) 1 2k

s S< π1(s) 1

Proof. By (11),

s S\S π1(s) X

ω Ω xs(ω) = X

s S π1(s) X

ω Ω xs(ω) + X

s S< π1(s) X

s S π1(s) + 1

s S< π1(s) + 1 2 1

s S< π1(s) + X

s S< π1(s) + 1 2 1

s S< π1(s) 1

Together with Lemma 1, this implies π1(S ) = 1 π1(S ) π1(S<) 1 2k

Multi-Sender Persuasion

Now, we construct from π1 a signaling scheme eπ : Ω (e S) for the public persuasion problem. The signal space of eπ is e S = S {s0}. For any s S , let the induced posterior exs (Ω) be

exs(ω) = xs(ω) P ω Ωxs(ω)

(where xs is the posterior induced by s in the signaling scheme π1), and denote

eπ(s) = π1(s) P s S π1(s) π1(s),

s S eπ(s) = 1. We will construct the posterior for s0 later.

Lemma 3. For any ω Ω, X

s S eπ(s)exs(ω) µ(ω) 4 + 2

1 2N 1 + 2k

Proof. On the one hand, X

s S eπ(s)exs(ω) X

s S π1(s) xs(ω) P

s S π1(s)xs(ω)

s S π1(s)xs(ω)

by (8) = 2 1

s S S< π1(s)xs(ω)

s S π1(s) X

by Lemma 1 and 2 2 1

1 2N 1 + 2k

1 2N 1 + 2k

On the other hand, X

s S eπ(s)exs(ω) = X

s S π1(s) xs(ω) P

(by definition of S ) X

s S π1(s) xs(ω) 1 2

= 2 1 2 1 P

s S π1(s)xs(ω)

s S π1(s)xs(ω)

by (8) = 2 1 2 1 P s S π1(s) µ(ω)

by Lemma 2 2 1 2 1 1 2k

1 2N 1 + 2k

µ(ω) + 4 + 4k

1 2N 1 + 2k

Two above two cases together prove the lemma.

Multi-Sender Persuasion

As shown in Lemma 3, the signaling scheme eπ with signals in S may not satisfy the Bayesian plausibility condition P

s S eπ(s)exs = µ(ω). That is why we need the additional signal s0. We want to find a posterior y (Ω) for signal s0, and a coefficient α [0, 1] such that the following convex combination of {exs}s S and y satisfies Bayesian plausibility:

s S eπ(s)exs + αy = µ. (14)

Lemma 4. Suppose minω Ωµ(ω) p0 2 4 + 2

( 1 2N 1 + 2k

C ) > 0. Then, there exists y (Ω) and α 2 p0 4 + 2

( 1 2N 1 + 2k

C ) that satisfy (14).

Proof. Let z = P

s S eπ(s)exs. By Lemma 3, we have

z µ = max ω Ω|z(ω) µ(ω)| 4 + 2

1 2N 1 + 2k

To satisfy (14), which is equivalent to

(1 α)z + αy = µ α(y z) = µ z,

we can let y be the intersection of the ray starting from z pointing towards µ and the boundary of (Ω). By doing this, y µ and z µ are in the same direction and

Since y is on the boundary of (Ω), some y(ω) must be 0. So,

y z min ω Ωz(ω) min ω Ωµ(ω) 4 + 2

1 2N 1 + 2k

This implies

1 2N 1 + 2k

With (14) satisfied, eπ now is a valid signaling scheme for the public persuasion problem, which sends signal s S with probability (1 α)eπ(s), inducing posterior exs, and sends signal s0 with probability α, inducing posterior y. Let s consider the sender s utility (7) in the public persuasion problem using eπ:

u Pub(eπ) = (1 α) X

ω Ω exs(ω)uj(ω, a j(exs)) + α1

ω Ω y(ω)uj(ω, a j(y))

ω Ω exs(ω)uj(ω, a j(exs)) α because 0 uj(ω, a) 1.

By Claim 4, the receiver j s best action a j(exs) is + (and ) if and only if the receiver in the multi-sender problem takes action a (xs, tj) = aj+ (and aj ) given posterior xs from sender 1 and signal tj from sender 2. So, by the definition of sender 1 s utility in the multi-sender problem,

uj(ω, a j(exs)) = u1(ω, a (xs, tj)).

Multi-Sender Persuasion

Then, we have

u Pub(eπ) X

ω Ω exs(ω)u1(ω, a (xs, tj)) α

ω Ωxs(ω)u1(ω, a (xs, tj)) α

ω Ω xs(ω)u1(ω, a (xs, tj)) α.

On the other hand, let s consider sender 1 s utility in the multi-sender problem with signaling scheme π1. By Equation (10),

ω Ω xs(ω)u1(ω, a (xs, tj)) + xs(ωj)u1(ωj, a (xs, tj)) i

s S S< π1(s) 1 because u1( , ) 1

ω Ω xs(ω)u1(ω, a (xs, tj)) + xs(ωj) u1(ωj, a (xs, tj)) | {z } =0 because a (xs, tj) {aj+, aj } from (9)

by Lemma 1 and 2

ω Ω xs(ω)u1(ω, a (xs, tj)) i

s S π1(s) Pk j=1 1

ω Ωxs(ω)u1(ω, a (xs, tj)) u1(π1) 2k

C . This implies

u Pub(eπ) 2 1

h u1(π1) 2k

C i α. (15)

Finally, we prove that if the signaling scheme π1 is nearly optimal for the multi-sender best-response problem, then the corresponding scheme eπ for the public persuasion problem must be nearly optimal as well.

Claim 5. If π1 is approximately optimal for sender 1 s best-response problem up to additive error c, then the eπ constructed above is approximately optimal for the public persuasion problem with additive error 2c + 4k

1 2N 1 + 2k

Proof. Let π be the optimal signaling scheme for the public persuasion problem, which induces posterior x s (Ω) at signal s S . Let π 1 be the following signaling scheme for sender 1 in the multi-sender problem: for any signal s S , the probability of the signal is π 1(s) = π (s) and the induced posterior x s (Ω) is

x s(ω) = x s(ω)

2 , x s(ωj) = 1

It is easy to verify that π 1 is valid (satisfying Bayesian plausibility (8)). We then note that, at each posterior x s, the receiver s utility of taking action a is always 0 regardless of sender 2 s signal tj:

v(x s, tj, a ) = N 1

ω Ω x s(ω) x s(ωj) = N 1

So, we can assume that the receiver will take aj+ or aj by Claim 3. Moreover, by Claim 4, the receiver takes aj+ and aj if and only if the receiver j in the public persuasion problem with belief x s takes action + and . So,

u1(ω, a (x s, tj)) = uj(ω, a j(x s)).

Multi-Sender Persuasion

This means that the utility of sender 1 in the multi-sender problem satisfies:

u1(π 1) = X

ω Ω x s(ω)u1(ω, a (xs, tj)) + x s(ωj) u1(ωj, a (x s, tj)) | {z } =0 because a (x s, tj) = a

2 uj(ω, a j(x s)) i

ω Ω x s(ω)uj(ω, a j(x s)) = 1

2u Pub(π ).

If π1 is approximately optimal up to additive error c in the multi-sender best-response problem, then

u1(π1) u1(π 1) c

Plugging this into (15),

u Pub(eπ) 2 1

h u1(π 1) c 2k

2u Pub(π ) c 2k

u Pub(π ) u Pub(π )

1 2N 1 + 2k

u Pub(π ) 2c 4k

1 2N 1 + 2k

This means that eπ is approximately optimal for the public persuasion problem up to additive error 2c+ 4k

1 2N 1 + 2k

We now prove Theorem 3. Let k, Ω, µ, {vj(ω), uj(ω)}j [k],ω Ω be any public persuasion problem with |Ω| = k states and uniform prior µ(ω) = 1

k = p0. Construct the multi-sender best-response problem as above (where the range of utility of sender 1 is [ C, 1]). If we can find an ϵ-approximately optimal signaling scheme π1 for sender 1 s best-response problem with utility range [ 1, 1], with

then π1 is a Cϵ-approximately optimal signaling scheme with utility range [ C, 1]. Then by Claim 5, the scheme eπ constructed above is approximately optimal for the public persuasion problem with additive error at most

1 2N 1 + 2k

by Lemma 4 2Cϵ + 4k

1 2N 1 + 2k

1 2N 1 + 2k

2Cϵ + 2k + 1 4 + 2

1 2N 1 + 2k

Let C = k5, N = k4, = 1 k2 .

2k5ϵ + 2k + 1 4

k5 + k2 1 2k4 1 + 2k

for sufficiently large k. Theorem 6 says that finding 1

9-approximation for the public persuasion is NP-hard. So, finding ϵ = 1 k6 -approximation for the multi-sender best-response problem is NP-hard.

Multi-Sender Persuasion

A.4. Proof of Theorem 4

Proof. Since at each state ω, there is a unique optimal action a it suffices to consider |A| |Ω|. Next, let the signal space be

|S| = |A| 1 n 1 k; we shall see this is without loss of generality when the signal space is larger. We first give a construction for a mapping α between all k-ary strings of length n (all possible joint signals) to |A|. Let ζ denote a subset of these strings such that for any two strings s1 ζ, s2 ζ the hamming distance between them is at least two - d H(s1, s2) 2. The k-ary Gray code is an ordering of all unique k-ary strings of length n such that any two consecutive strings are exactly 1 apart in hamming distance. Such a construction is always possible (Guan, 1998). Since there are k different values possible at any position, within at least every k strings in the grey code, we should have two strings that are hamming distance 2 apart. Thus |ζ| kn 1 = |A|. This is indeed tight since kn 1 is the total number of unique n 1 length k-ary strings possible - thus if |ζ| > kn 1, it would mean there are two strings where that match on n 1 positions, violating the construction of ζ. We construct α as follows: map each string in ζ to a unique action in A and assign the remaining joint signal strings arbitrarily to an action.

Under this mapping, we now give a constructive joint signaling scheme that is (1) a Pure Nash Equilibrium and (2) fully reveals the optimal action to the agent. Let α 1(a) map to the joint signal s ζ such that α(s) = a. Further, let f : Ω A denote the unique agent-optimal action under state ω, with its inverse f 1(a) denoting the set of states for which this action is agent-optimal. Next, consider the following joint signaling scheme: for all s ζ, π(s|ω) = 1 if ω f 1(α(s)). That is for any ω, the joint signal s ζ that corresponds to the optimal agent action under ω, i.e. α(s) = f(ω), is sent with probability 1. The agent can thus uniquely map each joint signal realization to a set of states wherein a fixed action is optimal. In other words, this fully reveals the optimal action for the agent at any state realization ω. To show this is a Nash equilibrium, observe that since all strings in ζ are hamming distance at least 2 apart, there is in fact a bijection between any n 1 sub-signal/sub-string within ζ and the action. Thus, each optimal action is fully specified by signals of just n 1 agents. So if a sender unilaterally shifts her signaling, the agent can observe that n 1 signals still uniquely map to states that share a common optimal action, and essentially ignore the deviating agent s signal. Thus, no change in agent belief or action occurs, leading the deviation to be non-beneficial. Since the choice of deviating agent here is arbitrary, this presented scheme is a pure Nash equilibrium.

However, full revelation equilibrium is not unique, which we show through an example. Consider n = 2 senders, with |A| = 4 actions, and |Ω| = 4 states, with the following prior: [0.15, 0.35, 0.15, 0.35]. Sender 1 has utility 1 whenever action 1 is taken and 0 otherwise. Similarly, sender 2 has utility 1 whenever action 3 is taken and 0 otherwise. Note both utilities are agnostic to the state ω. The receiver utility is given by the following matrix:

1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1

Under a full-revelation or optimal action revelation equilibrium, note that each sender would get utility 0.15. Now consider the following signaling scheme using only 3 signals, which we express as a |Ω| |S| matrix1.

0 1 0 4 7 3 7 0 1 0 0 1 0 0

0 0 1 0 0 1 0 1 0 4 7 3 7 0

Joint signal realizations 01 and 12 from such a scheme induces the following posterior beliefs with probability 0.3:

µ01 = [0, 0, 0.5, 0.5] µ12 = [0.5, 0.5, 0, 0] (18)

Note that for any tie-breaking rule that favors senders, the posteriors above give utility 0.3 to both senders. All other posteriors have dominant actions that give 0 utilities to both senders. We can use the optimization program presented in proposition 2 to verify this is an equilibrium.

1a scheme with 3 signals can without loss of generality be extended to a scheme with 4 signals, which is what the optimal receiver action revelation scheme uses.

Multi-Sender Persuasion

A.5. Proof of Theorem 5

Proof. It is known that finding Nash equilibria in 2-player games with 0/1 utilities is PPAD-hard (Abbott et al., 2005; Chen et al., 2009). We reduce this PPAD-hard problem to multi-sender persuasion, which proves that the latter problem is also PPAD-hard. Let bu1, bu2 {0, 1}m m be the utility matrices of the 2 players, where m is the number of actions of each player. We construct a multi-sender persuasion game as follows:

There are 2 states Ω= {0, 1} with prior µ0(0) = µ0(1) = 1/2, |A| = 4 actions for the receiver labeled as A = {a00, a01, a10, a11}, and n = 2 senders each having a signal space S = {1, . . . , m}.

The receiver s utility is 0 regardless of actions and states, so he is indifferent among taking all actions. Suppose the receiver breaks ties in the following way: given joint signal (s1, s2) from the 2 senders, take action

α(s1, s2) =

a00 if bu1(s1, s2) = 0, bu2(s1, s2) = 0; a01 if bu1(s1, s2) = 0, bu2(s1, s2) = 1; a10 if bu1(s1, s2) = 1, bu2(s1, s2) = 0; a11 if bu1(s1, s2) = 1, bu2(s1, s2) = 1.

The utility of each sender i is:

ui(a, ω = 1) =

( bui(s1, s2) if there exist s1, s2 S such that α(s1, s2) = a; 0 otherwise.

ui(a, ω = 0) = 0, a A.

We note that the first equation is well-defined, because for any different joint signals (s1, s2) and (s 1, s 2), if they both satisfy α(s1, s2) = α(s 1, s 2) = a, then they define the same utility ui(a, ω = 1) = bui(s1, s2) = bui(s 1, s 2).

We note that the expected utility of each sender i under signaling schemes π = (π1, π2) is equal to

s Sn µ0(ω)π(s|ω)ui(α(s), ω)

s1,s2 π1(s1|ω = 1)π2(s2|ω = 1)ui(α(s1, s2), ω = 1)

s1,s2 π1(s1|ω = 1)π2(s2|ω = 1)bui(s1, s2)

2 bui(x1, x2),

where bui(x1, x2) is the expected utility of player i in the 2-player 0/1 utility game when the two players use mixed strategies x1, x2 where player i samples action si {1, . . . , m} with probability xi(si) = πi(si|ω = 1). If we can find an NE (π1, π2) for the multi-sender persuasion game, then the corresponding mixed strategy profile (x1, x2) where xi(si) = πi(si|ω = 1) is an NE for the 2-player 0/1 utility game, which is PPAD-hard to find.

B. Find Local NE via Deep Learning

In this section, we describe the settings of our deep learning experiments and show more results.

For each problem instance, we collect a dataset comprising 50,000 randomly selected samples and train the networks for 30 epochs using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.01. For extra-gradient, we initiate the optimization process from a set of 300 random starting points. For each starting point, we run 20 iterations of extra-gradient updates with the Adam optimizer and a learning rate of 0.1. We then use the result with the highest social welfare to compare the performance of different algorithms.

Multi-Sender Persuasion

DNL (Ours) Re LU De LU

# Signals # States # Actions

Social Welfare

Social Welfare

Social Welfare

Influence of # Signals Influence of # States Influence of # Actions

DNL (Ours) Full Revelation Equilibrium Initial Policy of Extra-Gradient Optimization

# Signals # States # Actions

Social Welfare

Social Welfare

Social Welfare

Influence of # Signals Influence of # States Influence of # Actions

Figure 8. Our method retains its advantage and achieves higher social welfare compared against baselines and full-revelation solutions when we adopt a stricter standard for the local NE check procedure (increase ϵ from 0.005 to 0.01).

To evaluate if a joint signaling policy profile π derived from the extra-gradient algorithm constitutes a local NE, we randomly select K policies π j for each sender j within the vicinity {π j | π j πϕj ϵ}. We then verify if any of these deviations result in increased utility. The number of test samples K grows linearly with the problem size:

K = min {10000, 1000 (n 1)(|Ω| 1)(|S| 1)(|A| 1)} . (19)

In the main text, we set the neighborhood size ϵ to 0.005. Now we apply a more stringent criterion for ϵ-local NE, with ϵ set to 0.01 and the extra-gradient optimization step increased to 30 accordingly. We reassess our method under this setting against baselines in Fig. 8. As we can observe, the performance of our method is still significantly better than other algorithms.

C. More Related Works

Our work focuses on a type of Stackelberg game. Since the signaling strategy of principals could be continuous, this game has a continuous action space. Stackelberg games are employed in various real-world hierarchical scenarios, including taxation (Zheng et al., 2022), security (Jiang et al., 2013; Gan et al., 2020), and business strategies (Naghizadeh & Liu, 2014; Zhang et al., 2016; Aussel et al., 2020). These games typically involve a leader and a follower. In such games with discrete choices, Conitzer & Sandholm (2006) demonstrate that linear programming can efficiently find Stackelberg equilibria using the strategy spaces of both players. For continuous decision spaces, Jin et al. (2020); Fiez et al. (2020) introduce and define local Stackelberg equilibria through firstand second-order conditions, with Jin et al. (2020) also showing that gradient descent-ascent methods can achieve these equilibria under certain conditions, and Fiez et al. (2020) providing specific updating rules that guarantee convergence.

With multiple followers (Zhang et al., 2024), unless they operate independently (Calvete & Gal e, 2007), identifying Stackelberg equilibria is significantly harder and becomes NP-hard, even if followers have structured equilibria (Basilico et al., 2017). Wang et al. (2021a) suggest managing an arbitrary equilibrium that the follower may reach through differentiation. Meanwhile, Gerstgrasser & Parkes (2023) develop a meta-learning framework across various follower policies, facilitating quicker adaptations for the principal. This builds on Brero et al. (2022), who pioneered the Stackelberg-POMDP model.

The field of multi-agent reinforcement learning (Yu et al., 2022; Wen et al., 2022; Kuba et al., 2021; Christianos et al., 2020; Peng et al., 2021; Jiang et al., 2019; Wen et al., 2022; Rashid et al., 2018; Wang et al., 2020; 2021b; 2019b; Kang et al., 2020; Li et al., 2021; Wang et al., 2021d; Guestrin et al., 2002b;a; B ohmer et al., 2020; Kang et al., 2022; Wang et al., 2021c;

Multi-Sender Persuasion

Yang et al., 2022; Dong et al., 2022; 2023; Wang et al., 2019a) is expanding the application of Stackelberg concepts to more complex, realistic settings. Tharakunnel & Bhattacharyya (2007) introduced a Leader-Follower Semi-Markov Decision Process for sequential learning in Stackelberg settings. Cheng et al. (2017) developed a method known as Stackelberg Q-learning, albeit without proving convergence. Furthermore, Shu & Tian (2019); Shi et al. (2019) have empirically examined these leader-follower dynamics, focusing on the leader s use of deep learning models to predict follower actions.