# heterogeneous_multiagent_bandits_with_parsimonious_hints__08779707.pdf

Heterogeneous Multi-Agent Bandits with Parsimonious Hints

Amirmahdi Mirfakhar1, Xuchuang Wang1, Jinhang Zuo2, Yair Zick1, Mohammad Hajiesmaili1

1University of Massachusetts Amherst, 2City University of Hong Kong smirfakhar@umass.edu, xuchuangw@gmail.com, jinhang.zuo@cityu.edu.hk, yzick@umass.edu, hajiesmaili@cs.umass.edu

We study a hinted heterogeneous multi-agent multi-armed bandits problem (HMA2B), where agents can query low-cost observations (hints) in addition to pulling arms. In this framework, each of the M agents has a unique reward distribution over K arms, and in T rounds, they can observe the reward of the arm they pull only if no other agent pulls that arm. The goal is to maximize the total utility by querying the minimal necessary hints without pulling arms, achieving time-independent regret. We study HMA2B in both centralized and decentralized setups. Our main centralized algorithm, GP-HCLA, which is an extension of HCLA, uses a central decision-maker for arm-pulling and hint queries, achieving O(M 4K) regret with O(MK log T) adaptive hints. In decentralized setups, we propose two algorithms, HD-ETC and EBHD-ETC, that allow agents to choose actions independently through collisionbased communication and query hints uniformly until stopping, yielding O(M 3K2) regret with O(M 3K log T) hints, where the former requires knowledge of the minimum gap and the latter does not. Finally, we establish lower bounds to prove the optimality of our results and verify them through numerical simulations.

1 Introduction The multi-agent multi-armed bandit (MA2B) problem (Liu and Zhao 2010; Anandkumar et al. 2011) is a sequential decision making task consisting of K N+ arms and M N+ agents. In each of the total T N+ decision rounds, each agent selects one arm to pull and observes its reward if no other agent pulls the same arm (called no collision). This model has applications in wireless communication (Jouini et al. 2009, 2010), caching (Xu, Tao, and Shen 2020; Xu and Tao 2020), and edge computing (Wu et al. 2021). Among various models in MA2B, the heterogeneous multi-agent multiarmed bandit (Bistritz and Leshem 2018; Shi et al. 2021) is a more realistic variant for these applications where agents have different reward distributions over the arms, e.g., in a wireless communication scenario where agents have different channel qualities due to different geographical locations. In this heterogeneous MA2B model, the optimal action of all agents is a bipartite matching (between agents and arms) that maximizes the total reward, called the optimal matching. An

Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

algorithm s performance is evaluated by regret, the difference between the accumulative reward of keeping to choose the optimal matching in all decision rounds and the total reward of the bandit algorithm. A smaller expected regret implies a better algorithm. Recently, learning-augmented approaches are emerging, e.g., Lykouris and Vassilvitskii (2021); Bamas, Maggiori, and Svensson (2020); Bhaskara et al. (2023). This stream of research studies how to assist an algorithm with hints (a.k.a., predictions) queried from existing ML models, e.g., large language model (Achiam et al. 2023), deep convolutional neural network (Krizhevsky, Sutskever, and Hinton 2017), and deep reinforcement learning (François-Lavet et al. 2018). In this paper, we study the utilization of hint information in heterogeneous multi-agent multi-armed bandits. In addition to receiving feedback from pulling arms, agents can sequentially query hints about the potential rewards of other arms, assisting in their decision-making process. We call the model Hinted Heterogeneous Multi-Agent Multi-Armed Bandits (HMA2B). Specifically, we consider a simple and accurate hint mechanism where agents can query the reward of an arm without pulling it, with no regret incurred from the queried hint. Despite assuming accurate hints, this model poses challenges, such as balancing hint queries and armpullings while accounting for agent heterogeneity and potential future collisions. In addition to minimizing regret, we aim to reduce hint complexity, the total number of queried hints, as querying hints, such as via the GPT-4 API (Achiam et al. 2023), can be costly. Efficiently leveraging hints is crucial in scenarios where hint costs are significantly lower than the costs of taking actions. For instance, in labor markets, structured, low-cost interviews provide hints to improve applicant-role matching, reducing the risk of human resource misallocation. Similarly, in radio channel assignments, test signals serve as hints to allocate high-bandwidth channels effectively, preventing delays and disruptions in critical applications like disaster recovery, where drones depend on reliable communication channels. We study two scenarios of HMA2B: centralized and decentralized setups. In the centralized setup, an omniscient decision-maker determines which arm each agent should pull or query hints from, similar to decision-making in hiring processes where the employer has access to the applicants information to decide which of them to interview and which

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

Algorithm D/C Regret Queried Hints Communication

HCLA (Algorithm 1) C O MK2M O MKM log T N/A

G-HCLA (Algorithm 5) C O M 4K O M 2K log T N/A

GP-HCLA (Algorithm 2) C O M 4K O (MK log T) N/A

HD-ETC (Algorithm 3) D O M 3K2 O M 3K log(MT) O (log T)

EBHD-ETC (Algorithm 4) D O M 3K2 O M 3K log(MT) O (log T)

Table 1: Regret, Queried Hints, and Communication Bounds for Centralized and Decentralized Algorithms ( C and D stand for centralized and decentralized algorithms respectively, indicates that HD-ETC relies on the knowledge of minimum gap)

to hire. In the decentralized setup, agents independently decide their actions through collision-based communications, e.g., in radio channel allocation to stations. Designing an algorithm that achieves time-independent regret with a linear number of hints in T is straightforward. However, reducing the queried hints to a sub-linear number in T is challenging. To tackle this, we first analyze the fundamental limits of hint complexity in the centralized setup and propose GP-HCLA, a fine-tuned algorithm based on the advanced kl-UCB algorithm (Cappé et al. 2013), which achieves asymptotically optimal hint complexity (Appendix H). In decentralized setups, the essence of communication lies in the absence of a central decision maker, requiring collision-based signaling (Wang et al. 2020), where making or avoiding collisions encode 1 or 0 information bit. This method introduces inaccuracies from sending decimal statistics in binary and additional regret due to delayed exploration while balancing communication to determine the optimal matching. To address these challenges, we propose HD-ETC and EBHD-ETC, which achieve relatively similar bounds on hint complexity and regret. These algorithms use a round-robin hint querying strategy combined with an Explorethen-Commit (Garivier, Ménard, and Stoltz 2019) approach until a stopping condition is met. Finally, we discuss the optimality of the results in both centralized and decentralized setups in Appendix H.

1.1 Contributions For the centralized setup (Section 3), we propose two algorithms: HCLA and GP-HCLA. Both use empirical means to select a matching to pull and kl-UCB indices (Cappé et al. 2013) to identify another matching, querying a hint if the latter has a higher value. Additionally, we analyze an intermediate algorithm, G-HCLA (Appendix A), which operates similarly to HCLA but differs from GP-HCLA in how it selects the matching to hint after deciding to query. As summarized in Table 1, both GP-HCLA and G-HCLA extensions of HCLA achieve time-independent regret with an asymptotically optimal number of hints. We further prove that the upper bound on the hint complexity for GP-HCLA is tight, with both GP-HCLA and G-HCLA matching the established lower bounds. In the decentralized setup (Section 4), we introduce two algorithms: HD-ETC and EBHD-ETC. Both divide the time horizon into three phases: exploration, communication, and exploitation, with a key difference in how they transition

to the exploitation phase. In the exploitation phase, no further communication, exploration, or hint querying occurs, and the two algorithms handle this transition differently. In HD-ETC, agents know the minimum gap the smallest utility difference between the optimal and other matchings and the time horizon T, allowing them to switch to exploitation at a fixed time step T0. Conversely, EBHD-ETC does not require this knowledge, using an edge elimination strategy to determine the transition point, which makes it a random variable. This results in slightly higher hint queries and regret compared to HD-ETC. We provide regret bounds for both algorithms that align with the lower bounds, accounting for uncertainties due to delayed communication.

1.2 Related Works Heterogeneous MMAB (HMMAB) HMMAB is one of the standard models in multi-player multi-armed bandits with collision literature; to name a few, Rosenski, Shamir, and Szlak (2016); Boursier and Perchet (2019); Mehrabian et al. (2020); Bistritz and Leshem (2018); Shi et al. (2021). Among them, Bistritz and Leshem (2018) was the first to study the HMMAB, where they proposed a decentralized algorithm with O(log2 T) regret. Later on, the regret bound of this model was improved to O(M 3K log T) by Mehrabian et al. (2020) and further to O(M 2K log T) by Shi et al. (2021) that is the state-of-the-art result. We are the first to introduce the hint mechanism to HMMAB.

Bandits with Hints Learning algorithms with hints (or predictions) are part of the emerging literature on learningaugmented methods, as seen in works like (Lykouris and Vassilvitskii 2021; Purohit, Svitkina, and Kumar 2018; Mitzenmacher and Vassilvitskii 2022), etc. The hint mechanism was initially explored in the basic stochastic multi-armed bandits model by Yun et al. (2018). Later, Lindståhl, Proutiere, and Johnsson (2020) examined a more realistic hint mechanism, which includes failure noise, for the same model. Additionally, Bhaskara et al. (2023) investigated the impact of hints in adversarial bandits. We are the first to study the hint mechanism in a multi-agent scenario.

2 Hinted Heterogeneous Multi-Agent Multi-Armed Bandits Basic model A Hinted Heterogeneous Multi-agent Multi Armed Bandit (HMA2B) model consists of a set of K arms

K and a set of M agents M, such that M < K. Agents have heterogeneous rewards for arms. That is, for each agent m M, each arm k K is associated with a Bernoulli reward random variable Xm,k with mean µm,k := E[Xm,k]. The heterogeneous reward means are represented by a matrix µ [0, 1]M K, where each of its rows is denoted by µm = (µm,k)k K [0, 1]K.

Reward feedback Suppose that T N+ denotes the total number of decision rounds. At each time step t {1, 2, . . . , T}, every agent m chooses an arm km(t) to pull. The arms requested by the agents construct a bipartite graph characterized by M nodes (agents) on one side and K nodes (arms) on the other comprising M edges, ensuring that each node on the agent side is connected to exactly one arm. Let us define G as the set of all such graphs. Denote G(t) := (m, km(t))m M as the bipartite graph representing the arm pulling graph of the agents at time step t. We consider the collision setting (Boursier and Perchet 2019; Shi et al. 2021): that is, if there exist other agents pulling the arm km(t) at time step t, then agent m gets a reward of zero; otherwise, agent m gets a reward Xm,km(t)(t) sampled from the reward distribution of arm km(t), or formally, rm(t) := Xm,km(t)(t)1{ m = m : km (t) = km(t)}. This induces the optimal action to be a matching. Given a matching G G and a reward mean matrix µ [0, 1]M K, we define the expected utility as

U(G; µ) := E

m M µm,k G m1{ m = m : k G m = k G m},

where k G m denotes the matched arm of agent m under matching G. We denote the matching with the highest utility as the optimal matching G := max G G U(G; µ). We assume that G is unique, i.e., there does not exist any G = G in G such that U(G; µ) = U(G ; µ).

Hint mechanism At each time slot t, besides the pulled arm km(t), agent m can query another arm khint m (t) and observe the arm s reward realization Xm,khint m (t)(t) without regret cost. The hint graph then is denoted by Ghint(t) and k Ghint(t) m is the arm agent m queried a hint for in it. These hint observations do not impact the accumulative reward and regret, and the agent can decide whether to query for a hint, denoted by the indicator function ℓπ m(t) := 1{agent m query a hint at t under policy π}. We

denote Lπ(T) := E h P m M PT t=1 ℓπ m(t) i as the total number of times of agents querying hints, and we want to design a learning policy π minimizes the Lπ(T) while maintaining low regret.

Regret. We aim to find a policy π that maximizes the cumulative reward of all agents by determining G(t) at each time step for T rounds. To evaluate the performance of π, we define the regret of a policy as the difference between the total reward of all agents under the optimal matching

G in all decision rounds and the total reward of all agents following the policy π, as follows,

t=1 U(G ; µ) E [U(G(t); µ)] , (1)

where the expectation is taken over the randomness of the policy π. Last, we define the important parameter, the minimum gap, which is crucial and appears in our regret analysis. The minimum gap here represents the minimum difference between the utility of any matching G and G , i.e., match min := min G =G G U(G ; µ) U(G; µ).

Main goal and motivating examples Our goal is to design learning policies that use hints one per agent at a time to reduce the large regret bounds established in previous works (Shi et al. 2021; Mehrabian et al. 2020; Wang et al. 2020; Boursier and Perchet 2019) to a preferably time-independent regret, while minimizing the number of hints queried. We assume that hints are sampled from the same distributions as the rewards from pulling arms. Our algorithms query these hints strategically, only when exploring a sub-optimal matching is necessary before committing to the optimal one. This approach minimizes the costs of direct exploration and improves performance by separating the exploration of suboptimal matchings from the exploitation of the optimal one. In practical scenarios, hints are typically much cheaper than direct actions. For instance, in labor markets, a low-cost interview process can provide valuable insights into candidate suitability without the high costs of hiring mistakes. Similarly, in communication networks, using test signals to estimate bandwidth needs can prevent wasting high-quality channels on low-demand stations. These examples demonstrate how the hint-based approach in HMA2B can improve decisionmaking across various applications.

3 Algorithms for Centralized Hinted Heterogeneous Multi-Armed Bandits In the Centralized Hinted Heterogeneous Multi-Armed Bandit (C_HMA2B) setup, we consider an omniscient decision maker who selects both the matching and the hint graph at each round. The agents then follow the decision maker s instructions to pull arms and query hints. We propose two learning policies for this setup: the Hinted Centralized Learning Algorithm (HCLA) and the Generalized Projection-based Hinted Centralized Learning Algorithm (GP-HCLA). Under both policies, the decision maker treats each matching G G as a super arm for hint inquiries. However, the handling of observations differs between the two: in HCLA, observations are maintained for each matching, while in GP-HCLA, they are treated at the edge level. This distinction allows us to reduce the potentially exponential regret relative to the size of G to a polynomial regret upper bound in the number of edges, MK. We first introduce the statistics maintained by agents in HCLA and GP-HCLA, aiding the central decision maker in deciding when and how to query hints. Next, we describe HCLA as a baseline for designing GP-HCLA, our main algorithm. We also present an intermediate algorithm, G-HCLA,

Figure 1: Set of covering matchings R for M = 3 and K = 4: R1, R2, R3 and R4 are depicted in (a), (b), (c) and (d).

as a direct extension of HCLA. Finally, we detail GP-HCLA, which requests hints more efficiently than G-HCLA. G-HCLA is further discussed in Appendix A.

3.1 Preliminaries Beyond the generic empirical means matrix ˆµ, the decision maker employs kl-UCB indices d (Cappé et al. 2013) as upper confidence bounds for µ in the C_HMA2B setup to determine when to query a hint. These indices are defined as:

ˆµG(t) := Pt t =1 1{G(t ) = G}U(G(t ); r(t ))

N π G(t) , (2)

ˆµm,k(t) := Pt t =1 1{(m, k) G(t )}rm(t )

N π m,k(t) , (3)

d G(t) := sup {q 0 : N π G(t) kl (ˆµG(t), q) f(t)} , (4)

dm,k(t) := sup q 0 : N π m,k(t) kl (ˆµm,k(t), q) f(t) , (5)

for any matching G G and edge (m, k) M K, respectively, where kl is the Kullback-Leibler divergence, and f(t) = log t + 4 log log t. Here, N π G(t) and N π m,k(t) represent the number of times matching G or edge (m, k) has been pulled or hinted. Before detailing the algorithm, we define a fixed set R := {R1, . . . , RK} of K pairwise edge-disjoint matchings that cover all edges (m, k) M K, referred to as covering matchings. For uniquely labeled agents and arms in [M] and [K], Ri R is the matching where agent m is paired with arm (m + i 1) mod K, as shown in Figure 1 for M = 3 and K = 4. By pulling or hinting each covering matching in R at least once, agents can observe all G G at least once. This set serves as a hint pool, from which all hint graphs Ghint will be selected.

3.2 Warm-up: The HCLA Algorithm As noted earlier, the HCLA algorithm treats each G G as a super arm and maintains separate statistics: empirical mean ˆµG(t), kl-UCB index d G(t), and counters N HCLA G (t). At each time step t, the central decision maker selects a matching G(t) with the maximum empirical mean ˆµG(t) and another matching G (t) with the maximum d G(t) (Lines 3 4). If d G (t)(t) > ˆµG(t)(t), the decision maker chooses Ghint(t) as either G (t) or a uniformly at random chosen matching Ghint 2 (t), each with probability 1/2. It then queries a hint from

Algorithm 1: Hinted Centralized Learning Algorithm (HCLA)

Input: agent set M, arm set K, number of agents M, matching set G, time horizon T 1: Initialization: t 0, ˆµG(t) 0, d G(t) 0, NG(t) = 0 for each matching G G 2: for t [T] do 3: G(t) arg max G G ˆµG(t) 4: G (t) arg max G G d G(t) 5: if d G (t)(t) > ˆµG(t)(t) then 6: Ghint 1 (t) G (t) 7: Ghint 2 (t) pick a matching out of G uniformly at random

8: Ghint(t) Ghint 1 (t), w.p. 1

2 Ghint 2 (t), w.p. 1

2 9: Each agent m asks for a hint from k Ghint(t) m 10: Update ˆµGhint(t)(t + 1) according to the observation of Ghint(t)

11: Each agent m pulls k G(t) m 12: Update ˆµG(t)(t + 1) and according to the reward observation of G(t) 13: Update d G(t + 1) for all G G

Ghint(t) and updates ˆµGhint(t)(t + 1) based on the hint observation (Lines 5 10). Finally, the decision maker pulls G(t), updates ˆµG(t)(t + 1) with the reward observation, and recalculates d G(t + 1) for all G G (Lines 11 13). The detailed pseudocode of the HCLA algorithm is provided in Algorithm 1. Next, we present the upper bounds for the timeindependent regret and the number of queried hints LHCLA(T) for the HCLA algorithm in Theorem 1. The detailed proof is presented in Appendix B.1.

Theorem 1. For 0 < δ < match min

2 and policy π = HCLA, the policy π has

1. time-independent regret Rπ(T) O MK2M ,

2. hint complexity Lπ(T) O MKM log T

where kl = kl(U(G ; µ) match min + δ, U(G ; µ) δ).

The regret of HCLA is time-independent, but the exponential constants in its regret and hint upper bounds are unsatisfactory. To address this, we propose a new algorithm called GP-HCLA, which provides a more refined analysis while maintaining the same hint inquiry and arm-pulling approach but using observations differently.

3.3 The GP-HCLA Algorithm We present GP-HCLA in Algorithm 2. The GP-HCLA algorithm follows steps similar to HCLA to identify G(t) and G (t). However, unlike HCLA, the central decision maker maintains statistics ˆµm,k(t) and dm,k(t) for each edge (m, k) M K. It then defines d G(t) := P (m,k) G dm,k(t), with a slight abuse of notation, enabling the use of the Hungarian algorithm (Kuhn 1955), which finds the matching with maximum additive utility in a

Algorithm 2: Generalized Projection-based Hinted Centralized Learning Algorithm (GP-HCLA)

Input: agent set M, arm set K, time horizon T, 1: Initialization: t 0, ˆµm(t) 0, dm(t) 0, N GP-HCLA m (t) 0 for each agent m M 2: for t T do 3: G(t) Hungarian (ˆµ(t)) 4: G (t) Hungarian (d(t)) 5: if U(G (t); d(t)) > U(G(t); ˆµ(t)) then 6: (m, k) arg min(m ,k ) G (t) N GP-HCLA m ,k (t)

7: Ghint 1 (t) {R R : (m, k) R} 8: Ghint 2 (t) pick a matching out of R uniformly at random

9: Ghint(t) Ghint 1 (t), w.p. 1

2 Ghint 2 (t), w.p. 1

2 10: Each agent m asks for a hint from k Ghint(t) m 11: Each agent m pulls k G(t) m 12: Update ˆµm(t+1), N GP-HCLA m (t + 1), and dm(t+1) for each agent m according to new observations

weighted bipartite graph. Accordingly, GP-HCLA utilizes Hungarian to compute G(t) and G (t), where the weights of the edges (m, k) are ˆµm,k(t) and dm,k(t), respectively (Lines 3 4). The decision maker employs a distinctly different approach from HCLA for selecting Ghint(t) and updating the statistics after each observation, whether from pulling an arm or querying a hint. As in HCLA, the algorithm queries for a hint if U(G (t); d(t)) > U(G(t); ˆµ(t)) (Line 5). However, instead of querying a hint directly from G (t), GP-HCLA projects G (t) onto a matching in R, a set of pairwise edgedisjoint covering matchings. By projection, we mean mapping G (t) G to a matching in R, which contains K covering matchings and is exponentially smaller. During this step, the algorithm selects the matching Ghint 1 (t) from R that contains the edge (m, k) G (t) with the fewest N GP-HCLA m,k (t), and a second matching Ghint 2 (t), chosen uniformly at random from R. The hint graph Ghint(t) is then set to either Ghint 1 (t) or Ghint 2 (t), each with probability 1/2 (Lines 6 10). In Theorem 2, we provide the bound for the regret and the asymptotically optimal bound for the number of hints. The detailed proof is presented in Appendix B.2.

Theorem 2. For 0 < δ < match min

2 and policy π = GP-HCLA, the policy π has

1. time-independent regret Rπ(T) O M 4K regret,

2. hint complexity Lπ(T) O MK log T

where kl = kl(U(G ; µ) match min + δ, U(G ; µ) δ).

Theorem 2 highlights the impact of maintaining edgelevel statistics in GP-HCLA, reducing the exponential timeindependent regret bound to a polynomial. It also shows that projection in hint inquiries minimizes hints, achieving asymptotic optimality (matching the lower bound given by Theorem 7 in the Appendix). We study G-HCLA, an extension of HCLA, which updates statistics like GP-HCLA but

skips projection, using Ghint(t) as in HCLA. Theorem 6 (Appendix) shows G-HCLA can have up to M-times higher hint complexity than GP-HCLA, highlighting the importance of projection. Experiments (Appendix H, Figure 3b) confirm GP-HCLA outperforms G-HCLA on small problem instances. The exact tightness of this gap remains open due to the complexity of the kl-UCB index.

4 Algorithms for Decentralized Hinted Heterogeneous Multi-Armed Bandits We study the Decentralized Hinted Heterogeneous Multi Armed Bandits (D_HMA2Bs), where no central decision maker coordinates agents to avoid collisions while learning the optimal matching G . Theorem 3 demonstrates that sublinear regret is unattainable in a decentralized setup without agents sharing statistics, making communication essential in D_HMA2Bs. To enable communication, agents intentionally collide to exchange statistics like ˆµs, while non-colliding agents continue pulling their assigned arms k G m from the matching G G without interference (Shi et al. 2021; Wang et al. 2020). Communication order is determined by unique agent ranks, as discussed below. Theorem 3 (Necessity of Communication). No decentralized learning algorithm can achieve sub-linear instanceindependent regret in HMA2Bs without communication. Building on Theorem 3, we propose a cooperative learning framework for the Hinted Decentralized Explore-then Commit (HD-ETC) and Elimination-Based Hinted Decentralized Explore-then-Commit (EBHD-ETC) algorithms. These divide time into Initialization, Exploration, and Communication phases, where agents request hints in a round-robin manner until meeting a stopping condition, after which they transition to the Exploitation phase with no further hints or communication.

4.1 Decentralized Learning Framework We first outline the common framework for the HD-ETC and EBHD-ETC algorithms. Both divide the T decisionmaking rounds into alternating exploration and communication phases. A counter ρ tracks exploration epochs, and N ρ m,k records the number of times agent m has pulled or hinted at arm k by the start of epoch ρ. The decentralized learning framework for HMA2B consists of four phases:

Initialization phase: Assigning unique ranks among the agents. The detailed rank assignment procedure and analysis follows Wang et al. (2020) (detailed in Appendix D). Exploration phase: Agents use the gathered statistics to identify the best matching Gρ at the start of each epoch ρ using Hungarian algorithm. They then commit to their corresponding arm k Gρ m for K rounds until the epoch ends. At the end of epoch ρ, agents signal the communication phase by creating collisions on arms pulled by other agents. Communication phase: Before each exploration epoch ρ, agents transmit their statistics ˆµ to others, denoted as ˆµρ. This communication is realized via the intentional collision signals, where a collision represents a 1 and its absence a 0 information bits (Boursier and Perchet 2019), with agents

relying on their unique ranks to identify senders and receivers. Since ˆµρ often contains decimal values, agents transmit a quantized version, µρ, optimized for binary communication at the cost of minor information loss. To further reduce communication length and minimize information loss, agents employ the Differential communication (Shi et al. 2021), sending only the differences δρ = µρ µρ 1 at the start of epoch ρ. This method reduces communication-induced regret through tρ com O(M 2K) communication rounds. It enables agents to synchronize actions and exchange critical information efficiently via the Send2All( δρ) routine, detailed in Appendix E. Exploitation phase: Agents stop communicating, exploring, and querying for hints after a specific time T π 0 , which depends on the policy π being used. After that, they agree on a matching G and commit to it for the rest of the time.

Unlike Shi et al. (2021) employing exponentially increasing exploration epoch lengths summing to O(log T) epochs, our approach simplifies this by assigning each epoch the same length K, resulting in potentially O T

K epochs. However, with stop conditions, our algorithms reduce the number of epochs and transition to the exploitation phase while maintaining O(log T) exploration epochs.

Hint inquiry mechanism HD-ETC and EBHD-ETC employ a round-robin approach for querying hints, setting Ghint(t) = R(t%K)+1, and follow an Explore-then-Commit exploration style. By evenly distributing hint queries over K rounds, this method reduces communication costs and prevents time-dependent regret. In comparison, the decentralized HCLA queries hints on demand, requiring constant communication and potentially incurring linear regret.

Regret decomposition We decompose the regret as follows to analyze its components separately:

Rπ(T) = Rπrank(T) + Rπexp(T) + Rπcom(T),

where Rπrank(T), Rπexp(T), and Rπcom(T) represent the regret due to rank assignment, exploration, and communication, respectively, under policy π. Under this framework, we introduce the HD-ETC and EBHD-ETC algorithms in the following sections.

4.2 Warm-Up: The HD-ETC algorithm The HD-ETC algorithm builds on the learning framework in Section 4.1, extending the Explore-then-Commit (ETC) method in bandits literature. To follow this method, agents uniformly query hints for covering matchings R R , Lines 9 11, until time step T HD-ETC 0 , determined by the assumed knowledge of match min . At T HD-ETC 0 where ρ is the index of the last exploration epoch, agents run Hungarian ( µρ) to identify the matching G , which they commit to for all t > T HD-ETC 0 , i.e., G(t) = G (Lines 18 20). Theorem 4 establishes that with a properly chosen T HD-ETC 0 , which depends on match min , the algorithm achieves timeindependent exploration regret while ensuring asymptotically optimal hint and communication usage. Detailed proofs are provided in Appendix F.1.

Algorithm 3: Hinted Decentralized Explore then Commit (HD-ETC) : agent m

Input: agent m, agent set M, arm set K, number of agents M, time horizon T, time threshold for hint inquiry T HD-ETC 0 1: Initialization: t 0, ρ 0, ˆµm(t) 0, N HD-ETC m (t) 0, µρ m 0 for each m M 2: while t < T HD-ETC 0 do 3: for each epoch ρ do 4: Gρ Hungarian ( µρ) 5: t0 t Exploration Phase 6: for t t0 + K do 7: Pull the arm k Gρ m 8: Update ˆµm,k Gρ m (t + 1) and N HD-ETC m,k Gρ m (t + 1) Hint Inquiry 9: Ghint(t) R(t%K)+1

10: Ask for a hint from k Ghint(t) m 11: Update ˆµ m,k Ghint(t) m (t+1) and N HD-ETC

m,k Ghint(t) m (t+1)

12: t t + 1 Communication Phase 13: for k [K] do 14: δρ+1 m,k µρ+1 m,k µρ m,k 15: Send2All( δρ+1 m,k )

16: t t + tρ+1 com 17: ρ ρ + 1 Exploitation Phase 18: G Hungarian ( µρ) 19: while t T do 20: Pull the arm k G m

Theorem 4. Assuming knowing the minimum gap match min , for the policy π = HD-ETC and T π 0 = 9M2K log(2MT )

( match min )2 , the policy π has

1. exploration regret Rπexp(T) O M 3K2 . 2. hint complexity Lπ(T) O (MT π 0 ) 3. communication regret Rπcom(T) O M 2T π 0 .

Although HD-ETC performs well, it assumes agents know match min , an unrealistic requirement in many settings. To overcome this, we propose EBHD-ETC, an elimination-based algorithm that achieves similar bounds without relying on the minimum gap. The simplicity of the Explore-then-Commit structure in HD-ETC necessitates knowledge of match min to determine the stopping time T HD-ETC 0 . Removing this assumption requires a more advanced algorithm design. In the next section, we introduce an elimination-based approach within the decentralized learning framework that operates without this gap assumption.

4.3 The EBHD-ETC algorithm In EBHD-ETC, agents transition into the exploitation phase differently compared to HD-ETC. Accordingly, each agent maintains a set of active edges Cρ, which includes edges likely to be in Gρ for the upcoming epoch ρ, initially

0.0 0.2 0.4 0.6 0.8 1.0 Time 1e5

(a) Exploration Regret

0.0 0.2 0.4 0.6 0.8 1.0 Time 1e5

(b) Centralized Hint Complexity

0.0 0.2 0.4 0.6 0.8 1.0 Time 1e5

(c) Decentral. Hint Complexity

0.0 0.2 0.4 0.6 0.8 1.0 Time 1e5

Communication

(d) Communication Regret

Figure 2: Figure 2a plots Rπ(T) and Rπexp(T) for both centralized and decentralized setups. Figures 2b and 2c reflects the Lπ(T) for centralized and decentralized algorithms respectively and Figure 2d shows Rπcom(T) for decentralized algorithms.

C0 = {(m, k) M K}. The set Cρ maintained by each agent is the same across the others, determined by µ. Agents enter the exploitation phase when |Cρ| = M (Line 2). Edge removal from Cρ is guided by the error variable ϵρ, defined

η) ρ , with η = q

2 MT 2 . At the beginning of

each epoch, agents determine Gρ and check whether there exists a matching Gρ (m,k) = Gρ such that

U(Gρ, µρ) U(Gρ (m,k), µρ) 4Mϵρ, (6)

for all (m, k) Cρ. Each Gρ (m,k) is constructed as Gρ (m,k) :=

{(m, k)} Hungarian µρ m, k , where µρ m, k is the matrix µρ with row m and column k removed. Agents update Cρ+1 by removing edges (m, k) for which Gρ (m,k) does not satisfy (6) (Lines 6 9).This process continues until the stopping condition |Cρ| = M is met. At that point, agents transition to the exploitation phase, committing to G , which corresponds to either Hungarian ( µρ) or the matching formed by the edges in Cρ (both refer to the same matching) for the remainder of the rounds (Lines 11 13). Theorem 5. The policy π = EBHD-ETC has

1. exploration regret Rπexp(T) O M 3K2 .

2. hint complexity Lπ(T) O M3K log(T

2M) ( match min ) 2

3. communication regret Rπcom(T) O M4K log(T

2M) ( match min ) 2

Theorem 5 proves that EBHD-ETC achieves mostly similar but slightly weaker bounds in terms of coefficients compared to HD-ETC, even without the unrealistic assumption of knowing match min . A proof is given in Appendix F.2.

5 Experiments We executed the algorithms HCLA, GP-HCLA, G-HCLA, HD-ETC, and EBHD-ETC with M = 4, K = 4, and match min 0.18, averaging regret and hint complexity over 50 replications for 105 rounds. We also benchmarked the Beacon algorithm (Shi et al. 2021) and compared its regret RBeaconexp(T) with ours in Fig. 2a, showing that our hint-augmented algorithms significantly reduce regret. Figs. 2b and 2c demonstrate that our centralized algorithms achieve better hint-query efficiency than HD-ETC and

Algorithm 4: Elimination-Based Hinted Decentralized Explore then Commit (EBHD-ETC) : agent m

Input: agent m, agent set M, arm set K, number of agents M, time horizon T 1: Initialization: t 0, ρ 0, ˆµm(t) 0, N HD-ETC m (t) 0, µρ m 0 for each m M, Cρ {(m, k) M K} 2: while |Cρ| > M do 3: Gρ Hungarian ( µρ) 4: Execute Lines 6-16 of Algorithm 3 5: Cρ+1 Updating Active Edges Set 6: for (m, k) Cρ do

7: Gρ (m,k) (m, k) Hungarian µρ m, k

8: if U(Gρ; µρ) U(Gρ (m,k); µρ) 4Mϵρ then

9: Cρ+1 Cρ+1 (m, k)

10: ρ ρ + 1 Exploitation Phase 11: G Hungarian ( µρ) or G Cρ

12: while t T do 13: Pull the arm k G m

EBHD-ETC while maintaining similar regret. Fig. 2d highlights Beacon s advantage in communication regret due to its exponentially growing epoch lengths. These results confirm the effectiveness of centralized algorithms in balancing regret and hint complexity. Further extended experiments are presented in Appendix G.

6 Conclusion In this paper,we studied how hints enhance learning in the HMA2B problem, with heterogeneous rewards and collisions. We proposed both centralized and decentralized algorithms, analyzing their regret and hint usage. An interesting future work is extending these methods to two-sided matching markets, where both sides have preferences, and ties are broken due to them while collision occurs. This could enable the design of decentralized algorithms without communication to learn matchings with the same regret and hint optimality.

Acknowledgments

The work of Mohammad Hajiesmaili is supported by NSF CNS-2325956, CAREER-2045641, CPS-2136199, CNS2102963, and CNS-2106299. The work of Jinhang Zuo was supported by City U 9610706. Xuchuang Wang is the corresponding author.

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774. Anandkumar, A.; Michael, N.; Tang, A. K.; and Swami, A. 2011. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4): 731 745. Bamas, E.; Maggiori, A.; and Svensson, O. 2020. The primaldual method for learning augmented algorithms. Advances in Neural Information Processing Systems, 33: 20083 20094. Bhaskara, A.; Cutkosky, A.; Kumar, R.; and Purohit, M. 2023. Bandit online linear optimization with hints and queries. In International Conference on Machine Learning, 2313 2336. PMLR. Bistritz, I.; and Leshem, A. 2018. Distributed multi-player bandits-a game of thrones approach. Advances in Neural Information Processing Systems, 31. Boursier, E.; and Perchet, V. 2019. SIC-MMAB: Synchronisation involves communication in multiplayer multi-armed bandits. Advances in Neural Information Processing Systems, 32. Cappé, O.; Garivier, A.; Maillard, O.-A.; Munos, R.; and Stoltz, G. 2013. Kullback-Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 1516 1541. François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M. G.; Pineau, J.; et al. 2018. An introduction to deep reinforcement learning. Foundations and Trends in Machine Learning, 11(3-4): 219 354. Garivier, A.; Ménard, P.; and Stoltz, G. 2019. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2): 377 399. Jouini, W.; Ernst, D.; Moy, C.; and Palicot, J. 2009. Multiarmed bandit based policies for cognitive radio s decision making issues. In 2009 3rd International Conference on Signals, Circuits and Systems (SCS), 1 6. IEEE. Jouini, W.; Ernst, D.; Moy, C.; and Palicot, J. 2010. Upper confidence bound based decision making strategies and dynamic spectrum access. In 2010 IEEE International Conference on Communications, 1 5. IEEE. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2017. Image Net classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84 90. Kuhn, H. W. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2): 83 97.

Lindståhl, S.; Proutiere, A.; and Johnsson, A. 2020. Predictive bandits. In 2020 59th IEEE Conference on Decision and Control (CDC), 1170 1176. IEEE. Liu, K.; and Zhao, Q. 2010. Distributed learning in multiarmed bandit with multiple players. IEEE transactions on signal processing, 58(11): 5667 5681. Lykouris, T.; and Vassilvitskii, S. 2021. Competitive caching with machine learned advice. Journal of the ACM (JACM), 68(4): 1 25. Mehrabian, A.; Boursier, E.; Kaufmann, E.; and Perchet, V. 2020. A practical algorithm for multiplayer bandits when arm means vary among players. In Proceedings of the 23th International Conference on Artificial Intelligence and Statistics (AISTATS), 1211 1221. PMLR. Mitzenmacher, M.; and Vassilvitskii, S. 2022. Algorithms with predictions. Communications of the ACM, 65(7): 33 35. Purohit, M.; Svitkina, Z.; and Kumar, R. 2018. Improving online algorithms via ML predictions. Advances in Neural Information Processing Systems, 31. Rosenski, J.; Shamir, O.; and Szlak, L. 2016. Multi-player bandits a musical chairs approach. In International Conference on Machine Learning, 155 163. PMLR. Shi, C.; Xiong, W.; Shen, C.; and Yang, J. 2021. Heterogeneous multi-player multi-armed bandits: Closing the gap and generalization. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems (Neur IPS), 22392 22404. Wang, P.-A.; Proutiere, A.; Ariu, K.; Jedra, Y.; and Russo, A. 2020. Optimal algorithms for multiplayer multi-armed bandits. In Proceedings of the 2020 International Conference on Artificial Intelligence and Statistics (AISTATS), 4120 4129. Wu, B.; Chen, T.; Ni, W.; and Wang, X. 2021. Multi-agent multi-armed bandit learning for online management of edgeassisted computing. IEEE Transactions on Communications, 69(12): 8188 8199. Xu, X.; and Tao, M. 2020. Decentralized multi-agent multiarmed bandit learning with calibration for multi-cell caching. IEEE Transactions on Communications, 69(4): 2457 2472. Xu, X.; Tao, M.; and Shen, C. 2020. Collaborative multiagent multi-armed bandit learning for small-cell caching. IEEE Transactions on Wireless Communications, 19(4): 2570 2585. Yun, D.; Proutiere, A.; Ahn, S.; Shin, J.; and Yi, Y. 2018. Multi-armed bandit with additional observations. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2(1): 1 22.