# lagma_latent_goalguided_multiagent_reinforcement_learning__b9f76a34.pdf

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

Hyungho Na 1 Il-Chul Moon 1 2

In cooperative multi-agent reinforcement learning (MARL), agents collaborate to achieve common goals, such as defeating enemies and scoring a goal. However, learning goal-reaching paths toward such a semantic goal takes a considerable amount of time in complex tasks and the trained model often fails to find such paths. To address this, we present LAtent Goal-guided Multi-Agent reinforcement learning (LAGMA), which generates a goal-reaching trajectory in latent space and provides a latent goal-guided incentive to transitions toward this reference trajectory. LAGMA consists of three major components: (a) quantized latent space constructed via a modified VQ-VAE for efficient sample utilization, (b) goal-reaching trajectory generation via extended VQ codebook, and (c) latent goal-guided intrinsic reward generation to encourage transitions towards the sampled goal-reaching path. The proposed method is evaluated by Star Craft II with both dense and sparse reward settings and Google Research Football. Empirical results show further performance improvement over state-of-the-art baselines.

1. Introduction

Centralized training and decentralized execution (CTDE) paradigm (Oliehoek et al., 2008; Gupta et al., 2017) especially with value factorization framework (Sunehag et al., 2017; Rashid et al., 2018; Wang et al., 2020a) has shown its success on various cooperative multi-agent tasks (Lowe et al., 2017; Samvelyan et al., 2019). However, in more complex tasks with dense reward settings, such as super hard maps in Star Craft II Multi-agent Challenge (SMAC) (Samvelyan et al., 2019) or in sparse reward settings, as well

1Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea. 2summary.ai, Daejeon, Republic of Korea. Correspondence to: Hyungho Na <gudgh723@gmail.com>, Il-Chul Moon <icmoon@kaist.ac.kr>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

as Google Research Football (GRF) (Kurach et al., 2020); learning optimal policy takes long time, and trained models even fail to achieve a common goal, such as destroying all enemies in SMAC or scoring a goal in GRF. Thus, researchers focus on sample efficiency to expedite training (Zheng et al., 2021) and encourage committed exploration (Mahajan et al., 2019; Yang et al., 2019; Wang et al., 2019).

To enhance sample efficiency during training, state space abstraction has been introduced in both model-based (Jiang et al., 2015; Zhu et al., 2021; Hafner et al., 2020) and modelfree settings (Grze s & Kudenko, 2008; Tang & Agrawal, 2020; Li et al., 2023). Such sample efficiency can be more important in sparse reward settings since trajectories in a replay buffer rarely experience positive reward signals. However, such methods have been studied within a single-agent task without expanding to multi-agent settings.

To encourage committed exploration, goal-conditioned reinforcement learning (GCRL) (Kaelbling, 1993; Schaul et al., 2015; Andrychowicz et al., 2017) has been widely adopted in a single agent task, such as complex path finding with a sparse reward (Nasiriany et al., 2019; Zhang et al., 2020; Chane-Sane et al., 2021; Kim et al., 2023; Lee et al., 2023). However, GCRL concept has also been limitedly applied to multi-agent reinforcement learning (MARL) tasks since there are various difficulties: 1) a goal is not explicitly known, only a semantic goal can be found during training by reward signal; 2) partial observability and decentralized execution in MARL makes impossible to utilize path planning with global information during execution, only allowing such planning during centralized training; 3) most MARL tasks seek not the shortest path, but the coordinated trajectory, which renders single-agent path planning in GCRL be too simplistic in MARL tasks.

Motivated by methods employed in single-agent tasks, we consider a general cooperative MARL problem as finding trajectories toward semantic goals in latent space.

Contribution. This paper presents LAtent Goal-guided Multi-Agent reinforcement learning (LAGMA). LAGMA generates a goal-reaching trajectory in latent space and provides a latent goal-guided incentive to transition toward this reference trajectory during centralized training.

Modified VQ-VAE for quantized embedding space construction: As one measure of efficient sam-

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

ple utilization, we use Vector Quantized-Variational Autoencoder(VQ-VAE) (Van Den Oord et al., 2017) which projects states to a quantized vector space so that a common latent can be used as a representative for a wide range of embedding space. However, state distributions in high dimensional MARL tasks are quite limited to small feasible subspace unlike image generation tasks, whose inputs or states often utilize a full state space. In such a case, only a few quantized vectors are utilized throughout training when adopting the original VQ-VAE. To make quantized embedding vectors distributed properly over the embedding space of feasible states, we propose a modified learning framework for VQ-VAE with a novel coverage loss. Goal-reaching trajectory generation with extended VQ codebook: LAGMA constructs an extended VQ codebook to evaluate the states projected to a certain quantized vector and generate a goal-reaching trajectory based on this evaluation. Specifically, during training, we store various goal-reaching trajectories in a quantized latent space. Then, LAGMA uses them as a reference to follow during centralized training. Latent goal-guided intrinsic reward generation: To encourage coordinated exploration toward reference trajectories sampled from the extended VQ codebook, LAGMA presents a latent goal-guided intrinsic reward. The proposed latent goal-guided intrinsic reward aims to accurately estimate TD-target for transitions toward goal-reaching paths, and we provide both theoretical and empirical support.

2. Related Works

State space abstraction for RL State abstraction groups states with similar characteristics into a single cluster, and it has been effective in both model-based RL (Jiang et al., 2015; Zhu et al., 2021; Hafner et al., 2020) and model-free settings (Grze s & Kudenko, 2008; Tang & Agrawal, 2020). NECSA (Li et al., 2023) adopted the abstraction of gridbased state-action pair for episodic control and achieved state-of-the-art (SOTA) performance in a general single RL task. This approach could relax the limitations of inefficient memory usage in the conventional episodic control, but this requires an additional dimensionality reduction technique, such as random projection (Dasgupta, 2013) in high-dimensional tasks. Recently, EMU (Na et al., 2024) presented a semantic embedding for efficient memory utilization, but it still resorts to the episodic buffer, which requires storing both the states and the embeddings. This additional memory usage could be burdensome in tasks with large state space. In contrast to previous research, we employ VQ-VAE for state embedding and estimate the overall value of abstracted states. In this manner, a sparse or delayed reward signal can be utilized by a broad range of

states, particularly those in proximity. In addition, thanks to the discretized embeddings, the count-based estimation can be adopted to estimate the value of states projected to each discretized embedding. Then, we generate a reference or goal-reaching trajectory based on this evaluation in quantized vector space and provide an incentive for transitions that overlap with this reference.

Intrinsic incentive in RL In reinforcement learning, balancing exploration and exploitation during training is a paramount issue (Sutton & Barto, 2018). To encourage a proper exploration, researchers have presented various forms of methods in a single-agent case such as modified count-based methods (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017), prediction error-based methods (Stadie et al., 2015; Pathak et al., 2017; Burda et al., 2018; Kim et al., 2018), and information gain-based methods (Mohamed & Jimenez Rezende, 2015; Houthooft et al., 2016). In most cases, an incentive for exploration is introduced as an additional reward to a TD target in Q-learning or a regularizer to overall loss functions. Recently, diverse approaches mentioned earlier have been adopted in the multiagent environment to promote exploration (Mahajan et al., 2019; Wang et al., 2019; Jaques et al., 2019; Mguni et al., 2021). As an example, EMC (Zheng et al., 2021) utilizes episodic control (Lengyel & Dayan, 2007; Blundell et al., 2016) as regularization for the joint Q-learning, in addition to a curiosity-driven exploration by predicting individual Q-values. Learning with intrinsic rewards becomes more important in sparse reward settings. However, this intrinsic reward can adversely affect the overall policy learning if it is not properly annealed throughout the training. Instead of generating an additional reward signal solely encouraging exploration, LAGMA generates an intrinsic reward that guarantees a more accurate TD-target for Q-learning, yielding additional incentive toward a goal-reaching path.

Additional related works regarding goal-conditioned reinforcement learning (GCRL) and subtask-conditioned MARL are presented in Appendix C.

3. Preliminaries

Decentralized POMDP A general cooperative multi-agent task with n agents can be formalized as the Decentralized Partially Observable Markov Decision Process (Dec POMDP) (Oliehoek & Amato, 2016). Dec POMDP consists of a tuple G = I, S, A, P, R, Ω, O, n, γ , where I is the finite set of n agents; s S is the true state in the global state space S; A is the action space of each agent s action ai forming the joint action a An; P(s |s, a) is the state transition function determined by the environment; R is a reward function r = R(s, a, s ) R; O is the observation function generating an individual observation from observation space Ω, i.e., oi Ω; and finally, γ [0, 1) is a discount factor.

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

(c) Goal-reaching trajectory generation

(e) Standard CTED framework

(d) Intrinsic reward generation

intrinsic reward 𝑟𝐼

Goal-reaching trajectory generation

with VQ Codebook

Mixing Network 𝑄𝑡𝑜𝑡= 𝑓(𝑄1, 𝑄2, , 𝑄𝑛; 𝜃)

Environment

Controller 𝑄𝑖( |𝑜𝑖; 𝜃)

Intrinsic reward generation for

desirable transition via Eq. (8)

Train 𝑓𝜙and 𝑓𝜓via Eq. (6)

Gradients Mixing Coverage

(b) VQ Codebook generation

VQ Codebook

Figure 1: Overview of LAGMA framework.(a) VQ-VAE constructs quantized vector space with coverage loss, while (b) VQ codebook stores goal-reaching sequences from a given xq,t. Then, (c) the goal-reaching trajectory is compared with the current batch trajectory to generate (d) intrinsic reward. MARL training is done by (e) the standard CTDE framework.

In a general cooperative MARL task, an agent acquires its local observation oi at each timestep, and the agent selects an action ai A based on oi. P(s |s, a) determines a next state s for a given current state s and the joint action taken by agents a. For a given tuple of {s, a, s }, R provides an identical common reward to all agents. To overcome the partial observability in Dec POMDP, each agent often utilizes a local action-observation history τi T (Ω A) for its policy πi(a|τi), where π : T A [0, 1] (Hausknecht & Stone, 2015; Rashid et al., 2018). Additionally, we denote a group trajectory as τ =< τ1, ..., τn >.

Centralized Training with Decentralized Execution (CTDE) In fully cooperative MARL tasks, under the CTDE paradigm, value factorization approaches have been introduced by (Sunehag et al., 2017; Rashid et al., 2018; Son et al., 2019; Rashid et al., 2020; Wang et al., 2020a) and achieved state-of-the-art performance in complex multiagent tasks such as SMAC (Samvelyan et al., 2019). In value factorization approaches, the joint action-value function Qtot θ parameterized by θ is trained to minimize the following loss function.

L(θ) = Eτ,a,rext,τ D[ rext + γV tot θ (τ ) Qtot θ (τ, a) 2] (1) Here, V tot θ (τ ) = maxa Qtot θ (τ , a ) by definition; D represents the replay buffer; rext is an external reward provided by the environment; Qtot θ is a target network parameterized by θ for double Q-learning(Hasselt, 2010; Van Hasselt et al., 2016); and Qtot θ and Qtot θ include both mixer and individual policy network.

Goal State and Goal-Reaching Trajectory In general cooperative multi-agent tasks, undiscounted reward sum, i.e., R0 = ΣT 1 t=0 rt, is maximized as Rmax if agents achieve

a semantic goal, such as defeating all enemies in SMAC or scoring a goal in GRF. Thus, we define goal states and the goal-reaching trajectory in cooperative MARL as follows.

Definition 3.1. (Goal State and Goal-Reaching Trajectory) For a given task dependent Rmax and an episodic sequence T := {s0, a0, r0, s1, a1, r1, ..., s T }, when ΣT 1 t=0 rt = Rmax for rt T , we define such an episodic sequence as a goal-reaching sequence and denote as T . Then, for st T , τ st := {st, st+1, ...s T } is a goal-reaching trajectory and we define the final state of τ st as a goal state denoted by s T .

4. Methodology

This section introduces LAtent Goal-guided Multi-Agent reinforcement learning (LAGMA) (Figure 1). We first explain how to construct a proper (1) quantized embeddings via VQ-VAE. To this end, we introduce a novel loss term called coverage loss to distribute quantized embedding vectors across the overall embedding space. Then, we elaborate on the details of (2) goal-reaching trajectory generation with extended VQ codebook. Finally, we propose (3) a latent goal-guided intrinsic reward which guarantees a better TD-target for policy learning and thus yields a better convergence on optimal policy.

4.1. State Embedding via Modified VQ-VAE

In this paper, we adopt VQ-VAE as a discretization bottleneck (Van Den Oord et al., 2017) to construct a discretized low-dimensional embedding space. Thus, we first define nc-trainable embedding vectors (codes) ej RD in the codebook where j = {1, 2, ..., nc}. An encoder network fϕ in VQ-VAE projects a global state s toward D-dimensional

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

(a) Training without Lcvr (λcvr = 0.0).

(b) Training with Lall cvr (λcvr = 0.2).

(c) Training with Lcvr (λcvr = 0.2).

Figure 2: Visualization of embedding results via VQ-VAE. Under SMAC 5m vs 6m task, the size of codebook nc = 64, the latent dimension D = 8; this illustrates embeddings at training time at T=1.0M. Colored dots represent χ, which is a state presentation before quantization, and gray dots are quantized vector representations belonging to VQ codebook derived from the state representations. Colors from red to purple (rainbow) represent from small to large timestep within episodes.

vector, x = fϕ(s) RD. Instead of a direct usage of latent vector x, we use a discretized latent xq by the quantization process which maps an embedding vector x to the nearest embedding vector in the codebook as follows.

xq = ez, where z = argminj||x ej||2 (2)

Then, the quantized vector xq is used as an input to a decoder fψ which reconstructs the original state s. To train an encoder, a decoder, and embedding vectors in the codebook, we consider the following objective similar to (Van Den Oord et al., 2017; Islam et al., 2022; Lee et al., 2023).

LV Q(ϕ, ψ, e) = ||fψ([x = fϕ(s)]q) s||2 2 + λvq||sg[fϕ(s)] xq||2 2 + λcommit||fϕ(s) sg[xq]||2 2 (3) Here, [ ]q and sg[ ] represent a quantization process and stop gradient, respecitvely. λvq and λcommit are scale factor for correponsding terms. The first term in Eq. (2) is the reconstruction loss, while the second term represents VQobjective which makes an embedding vector e move toward x = fϕ(s). The last term called a commitment loss enforces an encoder to generate fϕ(s) similar to xq and prevents its output from growing significantly. To approximate the gradient signal for an encoder, we adopt a straight-through estimator (Bengio et al., 2013).

When adopting VQ-VAE for state embedding, we found that only a few quantized vectors e in the codebook are selected throughout an episode, which makes it hard to utilize such a method for meaningful state embedding. We presumed that the reason is the narrow projected embedding space from feasible states compared to a whole embedding space, i.e., RD. Thus, most randomly initialized quantized vectors e locate far from the latent space of states in the current replay

buffer D, denoted as χ = {x RD : x = fϕ(s), s D}, leaving only a few e close to x within an episode. To resolve this issue, we introduce the coverage loss which minimizes the overall distance between the current embedding x and all vectors in the codebook, i.e., ej for all j = {1, 2, ..., nc}.

Lall cvr(e) = 1

j=1 ||sg[fϕ(s)] ej||2 2 (4)

Although Lall cvr could lead embedding vectors toward χ, all quantized vectors tend to locate the center of χ rather than densely covering whole χ space. Thus, we consider a timestep dependent indexing J (t) when computing the coverage loss. The purpose of introducing J (t) is to make only sequentially selected quantized vectors close to the current embedding xt so that quantized embeddings are uniformly distributed across χ according to timesteps. Then, the final form of coverage loss can be expressed as follows.

Lcvr(e) = 1 |J (t)|

j J (t) ||sg[fϕ(s)] ej||2 2 (5)

We defer the details of J (t) construction to Appendix E. By considering the coverage loss in Eq. (5), not only the nearest quantized vector but also all vectors in the codebook move towards overall latent space χ. In this way, χ can be well covered by quantized vectors in the codebook. Thus, we consider the overall learning objective as follows.

Ltot V Q(ϕ, ψ, e) = LV Q(ϕ, ψ, e) + λcvr Lcvr(e) (6)

where λcov is a scale factor for Lcvr.

Figure 2 presents the visualization of embeddings by principal component analysis (PCA) (Wold et al., 1987). In Figure

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

2, the training without Lcvr leads to quantized vectors that are distant from χ.

In addition, embedding space χ itself distributes around a few quantized vectors due to the commitment loss in Eq. (3). Considering Lall cvr makes quantized vectors close to χ but they majorly locate around the center of χ rather than distributed properly.

Figure 3: Histogram of recalled quantized vector.

On the other hand, the proposed Lcvr results in welldistributed quantized vectors over χ space so that they can properly represent latent space of s D. Figure 3 presents the occurrence of recalled quantized vectors for state embeddings in Fig. 2. We can see that training with λcvr guarantees quantized vectors well distributed across χ. Appendix E presents the training algorithm for the proposed VQ-VAE.

4.2. Goal-Reaching Trajectory Generation with Extended VQ Codebook

After constructing quantized vectors in the codebook, we need to properly estimate the value of states projected to each quantized vector. Note that the estimated value of each quantized vector is used when generating an additional incentive to desired transitions, i.e., transition toward a goalreaching trajectory. Thanks to the quantized vectors in the codebook, we can resort to count-based estimation for the value estimation of a given state. For a given st, a cumulative return from st denoted as Rt = ΣT 1 i=t γi tri, and xq,t = [xt = fϕ(st)]q, the value of xq,t can be computed via count-based estimation as

Cq,t(xq,t) = 1 Nxq,t

j=1 Rj t(xq,t) (7)

Here, Nxq,t is the visitation count on xq,t. However, as an encoder network fϕ is updated during training, the match between a specific state s and x = fϕ(s) can break. Thus, it becomes hard to accurately estimate the value of s via the count-based visit on xq,t. To resolve this, we adopt a moving average with a buffer size of m when computing Cq,t(xq,t) and store the updated value in the extended codebook, DV Q. Appendix D presents structural details of DV Q.

After constructing DV Q, now we need to determine a goalreaching trajectory τ st, defined in Definition 3.1, in the latent space. This trajectory is considered as a reference trajectory to incentivize desired transitions. Let the state sequence from st and its corresponding latent sequence projected by fϕ as τst = {st, st+1, st+2, ..., s T } and τxt =

fϕ(τst), respectively. Then, latent sequence after quantization process can be expressed as τχt = [fϕ(τxt)]q = {xq,t, xq,t+1, xq,t+2, ..., xq,T }. To evaluate the value the of trajectory τχt, we use Cq,t value in the codebook DV Q of an initial quantized vector xq,t in τχt.

To encourage desired transitions contained in τχt with high Cq,t value, we need to keep a sequence data of τχt. For a given starting node xq,t, we keep top-k sequences in Dseq based on their Cq,t. Thus, Dseq consists of two parts; Dτχt stores top-k sequences of τχt and DCq,t stores their corresponding Cq,t values. Updating algorithm for a sequence buffer Dseq and structural details of Dseq are presented in Appendix D.

As in Definition 3.1, the highest return in cooperative multiagent tasks can only be achieved when the semantic goal is satisfied. Thus, once agents have achieved a common goal during training, goal-reaching trajectories starting from various initial positions are stored in Dseq. After we construct Dseq, a reference trajectory τ χt can be sampled out of Dτχt . For a given initial position xq,t in the quantized latent space, we randomly sample a reference trajectory or goal-reaching trajectory from Dseq.

4.3. Intrinsic Reward Generation

With a goal-reaching trajectory τ χt from the current state, we can determine the desired transitions that lead to a goalreaching path, simply by checking whether the quantized latent xq,t at each timestep t is in τ χt. However, before quantized vectors e in the codebook well cover the latent distribution χ, only a few e vectors are selected and thus the same xq will be repeatedly obtained. In such a case, staying in the same xq will be encouraged by intrinsic reward if xq τ χt. To prevent this, we only provide an incentive to the desired transition toward xq,t+1 such that xq,t+1 τ χt and xq,t+1 = xq,t. A remaining problem is how much we incentivize such a desired transition. Instead of an arbitrary incentive, we want to design an additional reward to guarantee a better TD-target, to converge on optimal policy.

Proposition 4.1. Provided that τ χt is a goal-reaching trajectory and s τ χt, an intrinsic reward r I(s ) := γ(Cq,t(s ) maxa Qθ (s , a )) to the current TD-target y = r(s, a) + γVθ (s ) guarantees a true TD-target as y = r(s, a) + γV (s ), where V (s ) is a true value of s .

Proof. Please see Appendix A.

According to Proposition 4.1, when τ χt is a goal-reaching trajectory and s τ χt, we can set a true TD-target by adding an intrinsic reward r I(s ) to the current TD-target y, yielding a better convergence on an optimal policy. In the case when a reference trajectory τ χt is not a goal-reaching

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

(a) Trajectory embedding.

Voronoi-cell

(b) Trajectory in quantized latent space.

(c) Intrinsic reward generation.

Figure 4: Intrinsic reward generation by comparing the current trajectory in quantized latent space (τχt) with a sampled goal-reaching trajectory (τ χt).

trajectory, r I(s ) incentivizes the transition toward the highreturn trajectory experienced so far. Thus, we define a latent goal-guided intrinsic reward r I as follows.

r I t (st+1) =γ(Cq,t(st+1) maxa Qθ (st+1, a )),

if xq,t+1 τ χt and xq,t+1 = xq,t (8)

Note that r I t (st+1) is added to yt = rt + γVθ not yt+1. In addition, we can make sure that r I becomes non-negative so that an inaccurate estimate of Cq,t(s ) in the early training phase does not adversely affect the estimation of Vθ . Algorithm 1 summarizes the overall method for goal-reaching trajectory and an intrinsic reward generation. Figure 4 illustrates the schematic diagram of quantized trajectory embeddings τχt via VQ-VAE and intrinsic reward generation by comparing it with a goal-reaching trajectory, τ χt.

Algorithm 1 Goal-reaching Trajectory and Intrinsic Reward Generation

Given: Sequences of the current batch [τ i χt]B i=1, a sequence buffer Dseq, an update interval nfreq for τ χt, and VQ-VAE codebook DV Q for i = 1 to B do

Compute Ri t for t = 0 to T do

Get index zt τ i χt if mod(t, nfreq) then

Run Algorithm 2 to update Dzt seq with Ri t Sample a reference trajectory τ χt from Dzt seq else

if zt τ χt and zt = zt 1 then Get Cq,t Dzt V Q.Cq,t (r I t 1)i γmax(Cq,t maxa Qθ (st, a ), 0) end if end if end for end for

4.4. Overall Learning Objective

This paper adopts a conventional CTDE paradigm (Oliehoek et al., 2008; Gupta et al., 2017), and thus any form of mixer structure can be used for value factorization. We use the mixer structure presented in QMIX (Rashid et al., 2018) similar to (Yang et al., 2022; Wang et al., 2021; Jeon et al., 2022) to construct the joint Q-value (Qtot) from individual Q-functions. By adopting the latent goal-guided intrinsic reward r I to Eq. (1), the overall loss function for the policy learning can be expressed as follows.

L(θ) = rext + r I + γmaxa Qtot θ (s , a ) Qtot θ (s, a) 2

Note that here r I does not include any scale factor to control its magnitude. For an individual policy via Q-function, GRUs are adopted to encode a local action-observation history τ to overcome the partial observability in POMDP similar to most MARL approaches (Sunehag et al., 2017; Rashid et al., 2018; Son et al., 2019; Rashid et al., 2020; Wang et al., 2020a). However, in Eq. (9), we express the equation with s instead of τ for the conciseness and coherence with the mathematical derivation. The overall training algorithm for both VQ-VAE training and policy learning is presented in Appendix E.

5. Experiments

In this section, we present experiment settings and results to evaluate the proposed method. We have designed our experiments with the intention of addressing the following inquiries denoted as Q1-3.

Q1. The performance of LAGMA in comparison to state-of-the-art MARL frameworks in both dense and sparse reward settings

Q2. The impact of the proposed embedding method on overall performance

Q3. The efficiency of latent goal-guided incentive compared to other reward design

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

Figure 5: Performance comparison of LAGMA against baseline algorithms on two easy and hard SMAC maps: 1c3s5z, 5m vs 6m, and two super hard SMAC maps: MMM2, 6h vs 8z. (Dense reward setting)

Figure 6: Performance comparison of LAGMA against baseline algorithms on four maps: 3m, 8m, 2s3z, and 2m vs 1z. (Sparse reward setting)

We consider complex multi-agent tasks such as SMAC (Samvelyan et al., 2019) and GRF (Kurach et al., 2020) as benchmark problems. In addition, as baseline algorithm, we consider various baselines in MARL such as QMIX (Rashid et al., 2018), RODE (Wang et al., 2021) and LDSA (Yang et al., 2022) adopting a role or skill conditioned policy, MASER (Jeon et al., 2022) presenting agent-wise individual subgoals from replay buffer, and EMC (Zheng et al., 2021) adopting episodic control. Appendix B presents further details of experiment settings and implementations, and Appendix G illustrates the resource usage and the computational cost required for the implementation and training of LAGMA. In addition, additional generalizability tests of LAGMA are presented in Appendix F. Our code is available at: https://github.com/aailabkaist/LAGMA.

5.1. Performance evaluation on SMAC

Dense reward settings For dense reward settings, we follow the default setting presented in (Samvelyan et al., 2019). Figure 5 illustrates the overall performance of LAGMA. Thanks to quantized embedding and latent goal-guided incentive, LAGMA shows significant performance improvement compared to the backbone algorithm, i.e., QMIX, and other state-of-the-art (SOTA) baseline algorithms, especially in super hard SMAC maps.

Sparse reward settings For a sparse reward setting, we follow the reward design in MASER (Jeon et al., 2022). Appendix B enumerates the details of reward settings. Similar

to dense reward settings, LAGMA shows the best performance in sparse reward settings thanks to the latent goalguided incentive. Sparse reward hardly generates a reward signal in experience replay, thus training with the experience of the exact same state takes a long time to find the optimal policy. However, LAGMA considers the value of semantically similar states projected onto the same quantized vector during training, so its learning efficiency is significantly increased.

Figure 7: Performance comparison of LAGMA against baseline algorithms on two GRF maps: 3 vs 1WK and Counter Attack(CA) easy. (Sparse reward setting)

5.2. Performance evaluation on GRF

Here, we conduct experiments on additional sparse reward tasks in GRF to compare LAGMA with baseline algorithms. For experiments, we do not utilize any additional algorithm for sample efficiency such as prioritized experience replay (Schaul et al., 2015) for all algorithms.

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

Figure 8: Qualitative analysis on SMAC MMM2 (red teams are RL-agents). Purple stars represent quantized embeddings of goal states in replay buffer D. Yellow dots indicate the quantized embeddings in a sampled goal-reaching trajectory starting from an initial state denoted by a green dot. Gray dots and transparent dots are the same as Figure 2. Blue and red dots indicate terminal embeddings of two trajectories, respectively.

EMC (Zheng et al., 2021) shows comparable performance by utilizing an episodic buffer, which benefits in generating a positive reward signal via additional episodic control term. However, LAGMA with a modified VQ codebook could guide a scoring policy without utilizing an additional episodic buffer as being required in EMC. Therefore, LAGMA achieves a similar or better performance with less memory requirement.

5.3. Ablation study

In this subsection, we conduct ablation studies to see the effect of the proposed embedding method and latent goal-guided incentive on overall performance. We compare LAGMA (ours) with ablated configurations such as

Figure 9: Ablation study considering the coverage loss (CL) on four SMAC maps: 3m and 2s3z. (Sparse reward setting)

Figure 10: Performance comparison of goal-guided incentive with other reward design choices on two SMAC maps: 3m and 2s3z. (Sparse reward setting)

LAGMA (CL-All) trained with λall cvr considering all quantized vectors at each timestep and LAGMA (No-CL) trained without coverage loss.

Figure 9 illustrates the effect of the proposed coverage loss in modified VQ-VAE on the overall performance. As shown in Fig. 9, the performance decreases when the model is trained without coverage loss or trained with λall cvr instead of λcvr. The results imply that, without the proposed coverage loss, quantized latent vectors may not cover χ properly and thus xq can hardly represent the projected states. As a result, a goal-reaching trajectory that consists of a few quantized vectors yields no incentive signal in most transitions.

In addition, we conduct an ablation study on reward design. We consider a sum of undiscounted rewards, Cq0 = ΣT 1 t=0 rt, for trajectory value estimation instead of Cq,t, denoted as LAMGA (Cq0). We also consider the LAMGA configuration with goal-reaching trajectory generation only at the initial state denoted by LAMGA (Cqt-No-Upd). Figure 10 illustrates the results. Figure 10 implies that the reward design of Cq,t shows a more stable performance than both LAMGA (Cq0) and LAMGA (Cqt-No-Upd).

5.4. Qualitative analysis

In this section, we conduct a qualitative analysis to observe how the states in an episode are projected onto quantized vector space and receive latent goal-guided incentive compared to goal-reaching trajectory sampled from Dseq. Figure 8 illustrates the quantized embedding sequences of two trajectories: one denoted by a blue line representing a battle-won trajectory and the other denoted by a red line representing a losing trajectory. In Fig. 8, a losing trajectory initially followed the optimal sequence denoted by yellow dots but began to bifurcate at t = 20 by losing Medivac and two more allies. Although the losing trajectory still passed through goal-reaching sequences during an episode,

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

it ultimately reached a terminal state without a chance to defeat the enemies at t = 40, as indicated by the absence of a purple star. On the other hand, a trajectory that achieved victory followed the goal-reaching path and reached a goal state at the end, as indicated by purple stars. Since only transitions toward the sequences on the goal-reaching path are incentivized, LAGMA can efficiently learn a goal-reaching policy, i.e., the optimal policy in cooperative MARL.

6. Conclusions

This paper presents LAGMA, a framework to generate a goal-reaching trajectory in latent space and a latent goalguided incentive to achieve a common goal in cooperative MARL. Thanks to the quantized embedding space, the experience of semantically similar states is shared by states projected onto the same quantized vector, yielding efficient training. The proposed latent goal-guided intrinsic reward encourages transitions toward a goal-reaching trajectory. Experiments and ablation studies validate the effectiveness of LAGMA.

Acknowledgements

This research was supported by AI Technology Development for Commonsense Extraction, Reasoning, and Inference from Heterogeneous Data (IITP) funded by the Ministry of Science and ICT (2022-0-00077).

Impact Statement

This paper primarily focuses on advancing the field of Machine Learning through multi-agent reinforcement learning. While there could be various potential societal consequences of our work, none of which we believe must be specifically highlighted here.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mc Grew, B., Tobin, J., Pieter Abbeel, O., and Zaremba, W. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.

Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems, 29, 2016.

Bengio, Y., L eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013.

Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman,

A., Leibo, J. Z., Rae, J., Wierstra, D., and Hassabis, D. Model-free episodic control. ar Xiv preprint ar Xiv:1606.04460, 2016.

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by random network distillation. ar Xiv preprint ar Xiv:1810.12894, 2018.

Chane-Sane, E., Schmid, C., and Laptev, I. Goalconditioned reinforcement learning with imagined subgoals. In International Conference on Machine Learning, pp. 1430 1440. PMLR, 2021.

Dasgupta, S. Experiments with random projection. ar Xiv preprint ar Xiv:1301.3849, 2013.

Ellis, B., Cook, J., Moalla, S., Samvelyan, M., Sun, M., Mahajan, A., Foerster, J., and Whiteson, S. Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.

Ghosh, D., Gupta, A., and Levine, S. Learning actionable representations with goal-conditioned policies. ar Xiv preprint ar Xiv:1811.07819, 2018.

Grze s, M. and Kudenko, D. Multigrid reinforcement learning with reward shaping. In International Conference on Artificial Neural Networks, pp. 357 366. Springer, 2008.

Gupta, J. K., Egorov, M., and Kochenderfer, M. Cooperative multi-agent control using deep reinforcement learning. In International conference on autonomous agents and multiagent systems, pp. 66 83. Springer, 2017.

Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mastering atari with discrete world models. ar Xiv preprint ar Xiv:2010.02193, 2020.

Hasselt, H. Double q-learning. Advances in neural information processing systems, 23, 2010.

Hausknecht, M. and Stone, P. Deep recurrent q-learning for partially observable mdps. In 2015 aaai fall symposium series, 2015.

Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. Vime: Variational information maximizing exploration. Advances in neural information processing systems, 29, 2016.

Islam, R., Zang, H., Goyal, A., Lamb, A., Kawaguchi, K., Li, X., Laroche, R., Bengio, Y., and Combes, R. T. D. Discrete factorial representations as an abstraction for goal conditioned reinforcement learning. ar Xiv preprint ar Xiv:2211.00247, 2022.

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

Jaques, N., Lazaridou, A., Hughes, E., Gulcehre, C., Ortega, P., Strouse, D., Leibo, J. Z., and De Freitas, N. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International conference on machine learning, pp. 3040 3049. PMLR, 2019.

Jeon, J., Kim, W., Jung, W., and Sung, Y. Maser: Multiagent reinforcement learning with subgoals generated from experience replay buffer. In International Conference on Machine Learning, pp. 10041 10052. PMLR, 2022.

Jiang, N., Kulesza, A., and Singh, S. Abstraction selection in model-based reinforcement learning. In International Conference on Machine Learning, pp. 179 188. PMLR, 2015.

Kaelbling, L. P. Learning to achieve goals. In IJCAI, volume 2, pp. 1094 8. Citeseer, 1993.

Kim, H., Kim, J., Jeong, Y., Levine, S., and Song, H. O. Emi: Exploration with mutual information. ar Xiv preprint ar Xiv:1810.01176, 2018.

Kim, J., Seo, Y., Ahn, S., Son, K., and Shin, J. Imitating graph-based planning with goal-conditioned policies. ar Xiv preprint ar Xiv:2303.11166, 2023.

Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016.

Kurach, K., Raichuk, A., Stanczyk, P., Zajkc, M., Bachem, O., Espeholt, L., Riquelme, C., Vincent, D., Michalski, M., Bousquet, O., et al. Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 4501 4510, 2020.

Lee, S., Cho, D., Park, J., and Kim, H. J. Cqm: Curriculum reinforcement learning with a quantized world model. ar Xiv preprint ar Xiv:2310.17330, 2023.

Lengyel, M. and Dayan, P. Hippocampal contributions to control: the third way. Advances in neural information processing systems, 20, 2007.

Li, Z., Zhu, D., Hu, Y., Xie, X., Ma, L., Zheng, Y., Song, Y., Chen, Y., and Zhao, J. Neural episodic control with state abstraction. ar Xiv preprint ar Xiv:2301.11490, 2023.

Liu, Y., Li, Y., Xu, X., Dou, Y., and Liu, D. Heterogeneous skill learning for multi-agent tasks. Advances in Neural Information Processing Systems, 35:37011 37023, 2022.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. Multi-agent actor-critic for mixed cooperativecompetitive environments. Neural Information Processing Systems (NIPS), 2017.

Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson, S. Maven: Multi-agent variational exploration. Advances in Neural Information Processing Systems, 32, 2019.

Mguni, D. H., Jafferjee, T., Wang, J., Perez-Nieves, N., Slumbers, O., Tong, F., Li, Y., Zhu, J., Yang, Y., and Wang, J. Ligs: Learnable intrinsic-reward generation selection for multi-agent learning. ar Xiv preprint ar Xiv:2112.02618, 2021.

Mohamed, S. and Jimenez Rezende, D. Variational information maximisation for intrinsically motivated reinforcement learning. Advances in neural information processing systems, 28, 2015.

Na, H., Seo, Y., and Moon, I.-c. Efficient episodic memory utilization of cooperative multi-agent reinforcement learning. ar Xiv preprint ar Xiv:2403.01112, 2024.

Nasiriany, S., Pong, V., Lin, S., and Levine, S. Planning with goal-conditioned policies. Advances in Neural Information Processing Systems, 32, 2019.

Oliehoek, F. A. and Amato, C. A concise introduction to decentralized POMDPs. Springer, 2016.

Oliehoek, F. A., Spaan, M. T., and Vlassis, N. Optimal and approximate q-value functions for decentralized pomdps. Journal of Artificial Intelligence Research, 32:289 353, 2008.

Ostrovski, G., Bellemare, M. G., Oord, A., and Munos, R. Count-based exploration with neural density models. In International conference on machine learning, pp. 2721 2730. PMLR, 2017.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pp. 2778 2787. PMLR, 2017.

Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G., Foerster, J., and Whiteson, S. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International conference on machine learning, pp. 4295 4304. PMLR, 2018.

Rashid, T., Farquhar, G., Peng, B., and Whiteson, S. Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. Advances in neural information processing systems, 33: 10199 10210, 2020.

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

Samvelyan, M., Rashid, T., De Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G., Hung, C.-M., Torr, P. H., Foerster, J., and Whiteson, S. The starcraft multi-agent challenge. ar Xiv preprint ar Xiv:1902.04043, 2019.

Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International conference on machine learning, pp. 1312 1320. PMLR, 2015.

Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International conference on machine learning, pp. 5887 5896. PMLR, 2019.

Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing exploration in reinforcement learning with deep predictive models. ar Xiv preprint ar Xiv:1507.00814, 2015.

Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., et al. Value-decomposition networks for cooperative multi-agent learning. ar Xiv preprint ar Xiv:1706.05296, 2017.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Tang, H., Houthooft, R., Foote, D., Stooke, A., Xi Chen, O., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. # exploration: A study of count-based exploration for deep reinforcement learning. Advances in neural information processing systems, 30, 2017.

Tang, Y. and Agrawal, S. Discretizing continuous action space for on-policy optimization. In Proceedings of the aaai conference on artificial intelligence, volume 34, pp. 5981 5988, 2020.

Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.

Wang, J., Ren, Z., Liu, T., Yu, Y., and Zhang, C. Qplex: Duplex dueling multi-agent q-learning. ar Xiv preprint ar Xiv:2008.01062, 2020a.

Wang, T., Wang, J., Wu, Y., and Zhang, C. Influencebased multi-agent exploration. ar Xiv preprint ar Xiv:1910.05512, 2019.

Wang, T., Dong, H., Lesser, V., and Zhang, C. Roma: Multiagent reinforcement learning with emergent roles. ar Xiv preprint ar Xiv:2003.08039, 2020b.

Wang, T., Gupta, T., Mahajan, A., Peng, B., Whiteson, S., and Zhang, C. Rode: Learning roles to decompose multi-agent tasks. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.

Wold, S., Esbensen, K., and Geladi, P. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37 52, 1987.

Yang, J., Borovikov, I., and Zha, H. Hierarchical cooperative multi-agent reinforcement learning with skill discovery. ar Xiv preprint ar Xiv:1912.03558, 2019.

Yang, M., Zhao, J., Hu, X., Zhou, W., Zhu, J., and Li, H. Ldsa: Learning dynamic subtask assignment in cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 35:1698 1710, 2022.

Zhang, T., Guo, S., Tan, T., Hu, X., and Chen, F. Generating adjacency-constrained subgoals in hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 33:21579 21590, 2020.

Zheng, L., Chen, J., Wang, J., He, J., Hu, Y., Chen, Y., Fan, C., Gao, Y., and Zhang, C. Episodic multi-agent reinforcement learning with curiosity-driven exploration. Advances in Neural Information Processing Systems, 34: 3757 3769, 2021.

Zhu, D., Chen, J., Shang, W., Zhou, X., Grossklags, J., and Hassan, A. E. Deepmemory: model-based memorization analysis of deep neural language models. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1003 1015. IEEE, 2021.

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

A. Mathematical Proof

Here, we provide the omitted proof of Proposition 4.1.

Proof. Let y = r(s, a) + γVθ be the current TD-target with the target network parameterized with θ . Adding an intrinsic reward r I(s ) to y yields y = y + r I(s ). Now, we need to check whether y accurately estimates y = r + γV (s ).

E[y ] = E[r(s, a) + γVθ + r I(s )]

= E[r(s, a) + γmaxa Qθ (s , a ) + γ(Cq,t(s ) maxa Qθ (s , a ))]

= E[r(s, a) + γ(Cq,t(s ))]

= E[r(s, a) + γ(E[

i=t+1 γi (t+1)ri])]

= r(s, a) + γE[rt+1 + γrt+2 + + γT t 2r T 1]

= r(s, a) + γV (s )

The last equality in Eq. (10) holds since s is on a goal-reaching trajectory, i.e., s τ χ0 whose return is maximized, and E[rt+1 + γrt+2 + ] is an unbiased Monte-Carlo estimate of V (s ).

B. Experiment Details

In this section, we present details of SMAC (Samvelyan et al., 2019) and GRF (Kurach et al., 2020), and we also list hyperparemeter settings of LAGMA for each task. Tables 1 and 2 present the dimensions of state and action spaces and the maximum episodic length.

Table 1: Dimension of the state space and the action space and the episodic length of SMAC

Task Dimension of state space Dimension of action space Episodic length

3m 48 9 60 8m 168 14 120 2s3z 120 11 120 2m vs 1z 26 7 150 1c3s5z 270 15 180 5m vs 6m 98 12 70 MMM2 322 18 180 6h vs 8z 140 14 150

Table 2: Dimension of the state space and the action space and the episodic length of GRF

Task Dimension of state space Dimension of action space Episodic length

3 vs 1WK 26 19 150 CA easy 30 19 150

In addition, Table 3 presents the task-dependent hyperparameter settings for all experiments. As seen from Table 3, we used similar hyperparameters across various tasks. For an update interval nfreq in Algorithm 1, we use the same value nfreq = 5 for all experiments. ϵT represents annealing time for exploration rate of ϵ-greedy, from 1.0 to 0.05.

After some parametric studies, adjusting hyperparameter for VQ-VAE training such as ncd freq and nvq freq, instead of varying λ values listed as λvq, λcommit, and λcvr, provides more efficient way of searching parametric space. Thus, we primarily adjust ncd freq and nvq freq according to tasks, while keeping the ratio between λ values the same.

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

For hyperparameter settings, we recommend the efficient bounds for each hyperparameter based on our experiments as follows:

Number of codebook, nc: 64-512

Update interval for VQ-VAE, nvq freq: 10-40 (under ncd freq = 10)

Update interval for extended codebook, ncd freq: 10-40 (under nvq freq = 10)

Number of reference trajectory, k: 10-30

Scale factor of coverage loss, λcvr: 0.25-1.0 (under λvq = 1.0 and λcommit = 0.5)

Note that larger values of nc and k, and smaller values of nvq freq and ncd freq will increase the computational cost.

Table 3: Hyperparameter settings for experiments.

task nc D λvq λcommit λcvr ϵT ncd freq nvq freq

SMAC (sparse)

3m 256 8 2.0 1.0 1.0 50K 10 40 8m 256 8 2.0 1.0 1.0 50K 20 10 2s3z 256 8 2.0 1.0 1.0 50K 10 40 2m vs 1z 256 8 2.0 1.0 1.0 500K 20 10

SMAC (dense)

1c3s5z 64 8 1.0 0.5 0.5 50K 40 10 5m vs 6m 64 8 1.0 0.5 0.5 50K 40 10 MMM2 64 8 1.0 0.5 0.5 50K 40 10 6h vs 8z 256 8 2.0 1.0 1.0 500K 40 10

GRF (sparse) 3 vs 1WK 256 8 2.0 1.0 1.0 50K 20 10 CA easy 256 8 2.0 1.0 1.0 50K 10 20

Table 4 presents the reward settings for SMAC (sparse) which follows the sparse reward settings from (Jeon et al., 2022).

Table 4: Reward settings for SMAC (sparse)

Condition Sparse reward

All enemies die (Win) +200 Each enemy dies +10 Each ally dies -5

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

C. Additional Related Works

Goal-conditioned RL vs. Subtask conditioned MARL In a single agent case, Goal-conditioned RL (GCRL) which aims to solve multiple tasks to reach given target-goals has been widely adopted in various tasks including tasks with a sparse reward (Kaelbling, 1993; Schaul et al., 2015; Andrychowicz et al., 2017). GCRL often utilizes the given goal as an additional input to action policy in addition to states (Schaul et al., 2015). Especially, goal-conditioned hierarchical reinforcement learning (HRL) (Kulkarni et al., 2016; Zhang et al., 2020; Chane-Sane et al., 2021) adopts hierarchical policy structure where an upper-tier policy determines subgoal or landmark and a lower-tier policy takes action based on both state and selected a subgoal or landmark.

As one technique, reaching to subgoals generates a reward signal via hindsight experience replay (Andrychowicz et al., 2017), and thus the goal-conditioned policy learn policy to reach the final goal with the help of these intermediate signals. Thus, many researchers (Nasiriany et al., 2019; Zhang et al., 2020; Chane-Sane et al., 2021; Kim et al., 2023; Lee et al., 2023) have studied on how to generate intermediate subgoals to reach final goals.

In the field of MARL, a subtask(Yang et al., 2022), role(Wang et al., 2020b; 2021) or skill(Yang et al., 2019; Liu et al., 2022) conditioned policy adopted in a hierarchical MARL structure has a structural commonality with a goal-conditioned RL in that lower-tier policy network use designated subtask by the upper-tier network as an additional input when determining individual action. In MARL tasks, such subtasks, roles, or skills are a bit different from subgoals in GCRL, as they are adopted to decompose action space for efficient training or for subtask-dependent coordination. Another major difference is that in a general MARL task, the final goal is not defined explicitly unlike a goal-conditioned RL. MASER (Jeon et al., 2022) adopts the subgoal generation scheme from goal-conditioned RL when it generates an intrinsic reward based on the Euclidean distance between actionable representations (Ghosh et al., 2018) of the current and subgoal observation. However, this signal does not guarantee the consistency with learning signal for the joint Q-function. In contrast to MASER, we adopt a latent goal-guided incentive during a centralized training phase based on whether visiting on the promising subgoals or goals in the latent space. Also, the generated incentive by LAGMA theoretically guarantees a better TD-target, yielding better convergence on the optimal policy.

D. Structure of Extended VQ Codebook

To compute Cq,t via a moving average, data is stored in a FIFO (First in, First Out) style to the codebook DV Q, similar to a replay buffer D. After computing Cq,t(xq,t) with the current Rt, we update the value of Cq,t(xq,t) in DV Q as Dzt V Q.Cq,t Cq,t(xq,t) where zt is an index of a quantized vector xq,t.

Extended VQ Codebook

𝒟𝑠𝑒𝑞 𝑧 = { 𝒟𝐶𝑞,𝑡, 𝒟𝜏𝜒𝑡}

Figure 11: VQ codebook structure.

In Algorithm 2, heap push and heap replace adopt the conventional heap space management rule, with a computational complexity of O(logk). The difference is that we additionally store the sequence information in Dτχt according to their Cq,t values.

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

Algorithm 2 Update Sequence Buffer Dseq

1: Dseq keep top k trajectory sequences based on their Cq,t 2: Input: A total reward sum Rt and a sequence τχt 3: Get an initial index zt τχt[0] 4: Get DCq,t, Dτχt from Dzt seq 5: if |DCq,t| < k then 6: heap push(DCq,t, Dτχt, Cq,t, τχt) 7: else 8: Cmin q,t DCq,t[0] 9: if Rt > Cmin q,t then 10: heap replace(DCq,t, Dτχt, Rt, τχt) 11: end if 12: end if

E. Implementation Details

In this section, we present further details of the implementation for LAGMA. Algorithm 3 presents the pseudo-code for a timestep dependent indexing J (t) used in Eq. (5). The purpose of a timestep dependent indexing J (t) is to distribute the quantized vectors throughout an episode. Thus, Algorithm 3 tries to uniformly distribute quantized vectors according to the maximum batch time of an episode. By considering the maximum batch time of each episode, Algorithm 3 can adaptively distribute quantized vectors.

Algorithm 3 Compute J (t)

1: Input: For given the maximum batch time T, the number of codebook nc, and the current timestep t 2: if t == 0 then 3: d = nc/T 4: r = nc mod T 5: is = dn T 6: Keep the values of d, r, is until the end of the episode 7: end if 8: if d 1 then 9: J (t) = d t : 1 : d (t + 1) 10: if t < r then 11: Append J (t) with is + t 12: end if 13: else 14: J (t) = t nc/T 15: end if 16: return J (t)

For given the maximum batch time tmax and the number of codebook nc, Line#4 and Line#5 in Algorithm 3 compute the quotient and remainder, respectively. Line#8 compute an array with increasing order starting from the index d t to d (t + 1). Line#10 additionally designate the remaining quantized vectors to the early time of an episode.

Algorithm 4 presents training algorithm to update encoder fϕ, decoder fψ, and quantized embeddings e in VQ-VAE. In Algorithm 4, we also present a separate update for DV Q, which estimates the value of each quantized vector in VQ-VAE. In addition, the overall training algorithm including training for VQ-VAE is presented in Algorithm 5.

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

Algorithm 4 Training algorithm for VQ-VAE and DV Q

Parameter: learning rate α and batch-size B Input: B sample trajectories [T ]B i=1 from replay buffer D, the current episode number nepi, an update interval nvq freq for VQ-VAE and ncd freq for DV Q update interval. for t = 0 to T do

for i = 1 to B do

if mod(nepi, ncd freq) then Get Ri t and update DV Q with Eq. (7). end if if mod(nepi, nvq freq) then Get current state st [T ]i=1 and compute J (t) via Algorithm 3 Compute Ltot V Q via Eq. (6) with fϕ, fψ, and e. end if end for end for if mod(nepi, nvq freq) then

Update ϕ ϕ α Ltot V Q ϕ , ψ ψ α Ltot V Q ψ , e e α Ltot V Q e end if

Algorithm 5 Training algorithm for LAGMA.

1: Parameter: Batch size B and the maximum training time Tenv 2: Input: Qi θ is individual Q-network of n agents, replay buffer D, extended VQ codebook DV Q, and sequence buffer Dseq 3: Initialize network parameters θ, ϕ, ψ, e 4: while tenv Tenv do 5: Interact with the environment via ϵ-greedy policy based on [Qi θ]n i=1 and get a trajectory T 6: Append T to D 7: Get B sample trajectories [T ]B i=1 D 8: For a given [T ]B i=1, run MARL training algorithm and Algorithm 1 to update θ with Eq. (9), and Algorithm 4 to update ϕ, ψ, e with Eq. (6) 9: end while

F. Generalizability of LAGMA

F.1. Policy robustness test

To assess the robustness of policy learned by our model, we designed tasks with the same unit configuration but highly varied initial positions, ones that agents had not encountered during training, i.e., unseen maps. With these settings, opponent agents will also experience totally different relative positions and thus will make different decisions. We set the different initial positions for this evaluation as follows.

As illustrated in Figure 12, the initial position of each task is significantly moved from the nominal position experienced during the training phase.

For the comparison, we conduct the same experiment for other baselines, such as QMIX (Rashid et al., 2018) and LDSA (Yang et al., 2022). The model by each algorithm is trained for Tenv = 1M in nominal MMM2 map (denoted as Nominal) and then evaluated under various problem settings, such as NW(hard), NW, SW, and SW(hard). Each evaluation is conducted for 5 different seeds with 32 tests and Table 5 shows the mean and variance of winrate of each case.

In Table 5, LAGMA shows not only the best performance but also the robust performance in various problem settings. The fast learning of LAGMA is attributed to the latent goal-guided incentive, which generates accurate TD-target by utilizing values of semantically similar states projected to the same quantized vector. Because LAGMA utilizes the value of semantically similar states rather than the specific states when learning Q-values, different yet semantically similar states tend to have similar Q-values, yielding generalizable policies. In this manner, LAGMA would enable further exploration, rather

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

Figure 12: Problem settings for policy robustness test. Team 1 represents the initial position of RL agents, while Team 2 is the initial position of opponents.

Table 5: Policy robustness test on SMAC MMM2 (super hard). All models are trained for Tenv = 1M. The percentage (%) in parentheses represents the ratio compared to a nominal mean value.

NW(hard) NW Nominal SW SW(hard)

LAGMA 0.275 0.064 (28.2%) 0.500 0.104 (51.3%) 0.975 0.026 0.556 0.051 (57.1%) 0.394 0.042 (40.4%) QMIX 0.050 0.036 (13.1%) 0.138 0.100 (36.1%) 0.381 0.078 0.194 0.092 (50.8%) 0.156 0.058 (41.0%) LDSA 0.000 0.000 ( 0.0%) 0.081 0.047 (18.3%) 0.444 0.107 0.063 0.049 (14.1%) 0.081 0.028 (18.3%)

than solely enforcing exploitation of an identified state trajectory. Thus, even though the transition toward a goal-reaching trajectory is encouraged during training, the policy learned by LAGMA does not overfit to specific trajectories and exhibits robustness to unseen maps.

F.2. Scalability test

LAGMA can be adopted to large-scale problems without any modifications. VQ-VAE takes a global state as an input to project them into quantized latent space. Thus, in large-scale problems, only the input size will differ from tasks with a small number of agents. In addition, many MARL tasks include high-dimensional global input size as presented in Table 1 in the manuscript. To assess the scalablity of LAGMA, we conduct additional experiments in 27m vs 30m SMAC task, whose state dimension is 1170. Figure 13 illustrates the performance of LAGMA. In Figure 13, LAGMA maintains efficient learning performance even when applied to large-scale problems, using identical hyperparameter settings as those for small-scale problems such as 5m vs 6m.

(a) Learning curve.

Figure 13: Performance on large-scale problem (27m vs 30m SMAC).

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

F.3. Generalizability test to problems with diverse semantic goals

To show the generalizability of LAGMA further, we conducted additional experiments on another benchmark such as SMACv2 (Ellis et al., 2024), which includes diversity in initial positions and unit combinations within the identical task. Thus, from the perspective of latent space, SMACv2 tasks may encompass distinct multiple goals, even within the same task. For evaluation, we adopt the same hyperparameters as those for 3 vs 1WK presented in Table 3 in the manuscript, except for D = 4 and ncd freq = 40.

(a) protoss 5 vs 5

(b) protoss 10 vs 11

Figure 14: Performance evaluation on SMACv2.

In Figure 14, LAGMA shows comparable or better performance than baseline algorithms, but it does not exhibit distinctively strong performance, unlike other benchmark problems. We deem that this result stems from characteristics of the current LAGMA capturing reference trajectories towards a similar goal in the early training phase.

Multi-objective (or multiple goals) tasks may require a diverse reference trajectory generation. The current LAGMA only considers the return of a trajectory when storing reference trajectories in Dseq. Thus, when trajectories toward different goals bifurcate from the same quantized vector, i.e., semantically similar states, they may not be captured by the current version of LAGMA algorithm if their return is relatively low compared to that of other reference trajectories already stored in Dseq. Thus, LAGMA may not exhibit strong effectiveness in such tasks until various reference trajectories toward different goals are stored for a given quantized vector.

To improve, one may also consider the diversity of a trajectory when storing a reference trajectory in Dseq. In addition, goal or strategy-dependent agent-wise execution would enhance coordination in such problem cases, but it may lead to delayed learning in easy tasks. The study regarding this trade-off would be an interesting direction for future research.

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

G. Computational cost analysis

G.1. Resource usage

The introduction of an extended VQ codebook in LAGMA requires additional memory usage to an overall MARL framework. Memory usage depends on the codebook number (nc), the number of a reference trajectory (index sequence) to save (k) in sequence buffer Dseq, its batch time length (T), the total number of data saved for moving average computation (m), and data type. Memory usage of Dτχt and DV Q are computed as follows.

Dτχt : byte(dtype) nc k T

DV Q : byte(dtype) nc m

For example, when m = 100, nc = 64, k = 30, T = Tmax = 150, i.e., the maximum timestep defined by the environment, and Dτχt and DV Q use data type int64 and float32, respectively, resource usages by introducing extended VQ codebook are computed as follows:

(Dτχt)max: 8(int64) 64 30 150 = 2.19Mi B

DV Q: 4(float32) 64 100 = 25.6Ki B

Here, (Dτχt)max value represents the possible maximum value and the actual value may vary based on the goal-reaching trajectory of each task. We can see that resource requirement due to the introduction of the extended codebook is marginal compared to that of the replay buffer and the GPU s memory capacity, such as 24Gi B in Ge Force RTX3090. Note that any of these memory usages do not depend on the dimension of states since only the index (z) of the quantized vector (xq) of a sequence is stored in Dτχt.

G.2. Training time analysis

In LAGMA, we need to conduct an additional update for VQ-VAE and the extended codebook. Thus, the update frequency of VQ-VAE and the extended codebook would affect the overall training time. In the manuscript, we utilize the identical update interval nvq freq = 10, indicating training once every 10 MARL training iterations for both VQ-VAE and codebook update. Table 6 represents the overall training time taken by various algorithms for diverse tasks. Ge Force RTX3090 is used for 5m vs 6m and Ge Force RTX4090 for 8m(sparse) and MMM2. In the case of 8m(sparse) task, the training time varies according to whether the learned model finds policy achieving a common goal. Thus, the training time of the successful case is presented for each algorithm.

Table 6: Training time for each model in various SMAC maps (in hours).

Model 5m vs 6m (2M) 8m(sparse) (3M) MMM2 (3M)

EMC 8.6 11.8 12.0 MASER 12.7 13.4 20.5 LDSA 5.6 11.0 9.8 LAGMA 10.5 12.6 17.7

Here, numbers in parenthesis represent the maximum training time (Tenv) according to tasks. In Table 6, we can see that training of LAGMA does not take much time compared to existing baseline algorithms. Therefore, we can conclude that the introduction of VQ-VAE and the extended codebook in LAGMA imposes an acceptable computational burden, with only marginal increases in resource requirements.