# wasserstein_unsupervised_reinforcement_learning__e87c9647.pdf

Wasserstein Unsupervised Reinforcement Learning

Shuncheng He, Yuhang Jiang, Hongchang Zhang, Jianzhun Shao, Xiangyang Ji

Tsinghua University hesc16@mails.tsinghua.edu.cn

Unsupervised reinforcement learning aims to train agents to learn a handful of policies or skills in environments without external reward. These pre-trained policies can accelerate learning when endowed with external reward, and can also be used as primitive options in hierarchical reinforcement learning. Conventional approaches of unsupervised skill discovery feed a latent variable to the agent and shed its empowerment on agent s behavior by mutual information (MI) maximization. However, the policies learned by MI-based methods cannot sufficiently explore the state space, despite they can be successfully identified from each other. Therefore we propose a new framework Wasserstein unsupervised reinforcement learning (WURL) where we directly maximize the distance of state distributions induced by different policies. Additionally, we overcome difficulties in simultaneously training N(N > 2) policies, and amortizing the overall reward to each step. Experiments show policies learned by our approach outperform MI-based methods on the metric of Wasserstein distance while keeping high discriminability. Furthermore, the agents trained by WURL can sufficiently explore the state space in mazes and Mu Jo Co tasks and the pre-trained policies can be applied to downstream tasks by hierarchical learning.

Introduction Autonomous agents can learn to solve challenging tasks by deep reinforcement learning, including locomotive manipulation (Lillicrap et al. 2015); (Haarnoja et al. 2018) and game playing (Mnih et al. 2015); (Silver et al. 2016). The reward signal specified by the task plays an important role of supervision in reinforcement learning. However, recent research reveals the possibilities that agents can acquire diverse skills or policies in the absence of reward signal (Eysenbach et al. 2019); (Gregor, Rezende, and Wierstra 2016); (Achiam et al. 2018). This setting is called unsupervised reinforcement learning. Practical applications of unsupervised reinforcement learning have been studied. The skills learned without reward can serve as primitive options for hierarchical RL in long horizon tasks (Eysenbach et al. 2019). Also the primitive options may be useful for transferring across different tasks. In model-based RL, the learned skills enables the

Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

agent to plan in the skill space (Sharma et al. 2020). Unsupervised learning methods may alleviate the cost of supervision: in certain cases, designing reward function requires human supervision (Christiano et al. 2017). The intrinsic reward derived from unsupervised learning can enhance exploration when combined with task reward (Houthooft et al. 2016); (Gupta et al. 2018). The key point of unsupervised reinforcement learning is how to learn a set of policies that can sufficiently explore the state space. Previous methods make use of a latent variable and maximize the mutual information (MI) between the latent variable and the behavior (Eysenbach et al. 2019). Consequently the diversity in the latent space is cast into the state space. These methods are able to obtain different skills which are distinguishable from each other. However, limitations of MI-based methods are pointed out as the diversity of learned skills is restricted by the Shannon entropy of the latent variable. In addition, discriminability of skills does not always lead to the goal of sufficient exploration of the environment. In this paper, we propose a new approach of unsupervised reinforcement learning which is essentially different form MI-based methods. The motivation of our method is to increase the discrepancy of learned policies so that the agents can explore the state space extensively and reach the state as far as possible compared to other policies. This idea incentivizes us to employ a geometry-aware metric to measure the discrepancy between the state distributions induced by different policies. In recent literature of generative modeling, the optimal transport (OT) cost is a new direction to measure distribution distance (Tolstikhin et al. 2018) since it provides a more geometry-aware topology than f-divergences, including GAN (Nowozin, Cseke, and Tomioka 2016). Therefore, we choose Wasserstein distance, a well-studied distance from optimal transport, to measure the distance between different policies in unsupervised reinforcement learning. By maximizing Wasserstein distance, the agents equipped with different policies may drive themselves to enter different areas of state space and keep as far as possible from each other to earn greater diversity. Our contributions are four-fold. First, we propose a novel framework adopting Wasserstein distance as discrepancy measure for unsupervised reinforcement learning. This framework is well-designed to be compatible with various

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Wasserstein distance estimation algorithms, both in primal form and in dual form. Second, as Wasserstein distance is defined on two distributions, we extend our framework to multiple policy learning. Third, to address the problem of sparse reward provoked by Wasserstein distance estimation, we devise an algorithm to amortize Wasserstein distance between two bunches of samples to stepwise intrinsic reward. Four, we empirically demonstrate our approach surpasses the diversity of MI-based methods and can cover the state space by incremental learning.

Distribution Discrepancy Measure In this Section, we briefly review Wasserstein distance and its estimation methods.

Wasserstein Distance and Optimal Transport Measuring discrepancy or distance between two probability distributions can be seen as a transport problem (Villani 2009). Consider two distributions p, q on domains X Rn and Y Rm. Let Γ[p, q] be the set of all distributions on product space X Y, with their marginal distributions on X and Y being p, q respectively. Therefore, given a proper cost function c(x, y) : X Y R for moving mass from x to y, the Wasserstein distance is defined as

Wc(p, q) = inf γ Γ[p,q]

X Y c(x, y)dγ. (1)

The joint distribution family Γ[p, q] essentially forms a family of bijective plans transporting probability mass from p to q. Minimizing the transport cost is an optimal transport problem. The optimization problem suffers from supercubic complexity, since the problem becomes linear programming when X and Y are finite discrete sets (Cuturi 2013); (Genevay et al. 2016). To circumvent the difficulties, a regularizer is added to the optimization objective. The smoothed Wasserstein distance is defined as

Wc(p, q) = inf γ Γ[p,q]

X Y c(x, y)dγ + βKL(γ|pq) .

(2) Minimizing the cost together with the KL divergence encourages the joint distribution γ(x, y) to move close to p(x)q(y). As β 0, the smoothed distance converges to Wc(p, q) (Pacchiano et al. 2020). The objective function is convex if c(x, y) is a proper cost function. Therefore the infimum can be calculated either by primal formulation or dual formulation. In the following sections, we introduce practical methods estimating Wasserstein distance from distribution samples.

Primal Form Estimation Solving the optimal transport problem from the primal formulation is hard. However, the problem has analytical solutions when the distributions are on one-dimensional Euclidean space and cost function is lp measure (p 0) (Rowland et al. 2019). Inspired by 1-D Wasserstein distance estimation, we may estimate Wasserstein distance in high dimensional Euclidean spaces by projecting distributions to R.

Suppose p, q are probability distributions on Rd. For a vector v on the unit sphere Sd 1 in Rd, the projected distribution Πv(p) is the marginal distribution along the vector by integrating p in the orthogonal space of v. Estimating Wasserstein distance on 1-D space results sliced Wasserstein distance (SWD) (Wu et al. 2019); (Kolouri, Rohde, and Hoffmann 2018): SW(p, q) = Ev U(Sd 1) [W(Πv(p), Πv(q))] , (3)

where U(Sd 1) means the uniform distribution on unit sphere Sd 1. In practical use, the projected distribution Πv(ˆp) of empirical distribution ˆp = 1 N PN n=1 δxn can be written as Πv(ˆp) = 1 N PN n=1 δ xn,v , where , denotes inner product and δ is Dirac distribution. To reduce estimation bias of SWD, Rowland et al. (2019) proposed projected Wasserstein distance (PWD) by disentangling coupling calculation and cost calculation. PWD obtains optimal coupling by projecting samples to R, and calculates costs in original space Rd rather than the projected space.

Dual Form Estimation Define set A = {(u, v)| (x, y) X Y : u(x) v(y) c(x, y)}. By Fenchel-Rockafellar duality, the dual form of Wasserstein distance is (Villani 2009) Wc(p, q) = sup (µ,ν) A Ex p(x),y q(y) [µ(x) ν(y)] , (4)

where µ : X R and ν : Y R are continuous functions on their domains. The dual formulation provides us a neural approach to estimate Wasserstein distance, circumventing the difficulties to find the optimal transport plan between two probability distributions. The dual form of smoothed Wasserstein distance shows more convenience since there are no constraints on µ, ν: Wc(p, q) = sup µ,ν Ex p(x),y q(y) [µ(x) ν(y)

β exp µ(x) ν(y) c(x, y)

Alternative dual formulation emerges when X = Y. It is the most common case that two distributions are defined in the same space. Under this assumption, Kantorovich-Rubinstein duality gives another dual form objective in which only one function with Lipschitz constraint is optimized (Villani 2009). Wc(p, q) = sup f L 1 Ex p(x),y q(y) [f(x) f(y)] . (6)

The maxima of dual problem theoretically equals to the minima of primal problem. However, sliced Wasserstein distance and projected Wasserstein distance have no such guarantee. Nonetheless, the primal form estimation methods show competitive accuracy empirically.

Wasserstein Unsupervised Reinforcement Learning MI-based Unsupervised Reinforcement Learning Traditional unsupervised reinforcement learning adopts mutual information to seek diverse skills. Mutual information

Figure 1: Examples of Jensen-Shannon divergence and Wasserstein distance between q0 and q1. q0 and q1 are uniform distributions over a round plate with radius r. The distance between the centers of q0, q1 varies from 0 to 3r.

between two random variables is popularly perceived as the degree of empowerment (Gregor, Rezende, and Wierstra 2016); (Kwon 2021): I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X). For instance, DIAYN (Eysenbach et al. 2019) mainly aims to maximize I(S; Z), the mutual information between latent variables and states reached by agent. Conventionally, the prior of latent variable p(z) is fixed to a uniform distribution which has maximal entropy. The maximization process of I(S; Z) broadcasts the diversity in Z to states S through policy π(a|s, z). However, estimating mutual information involves intractable posterior distribution p(z|s). With the tool of variational inference, a feasible lower bound is ready by approximating p(z|s) with qϕ(z|s). We call qϕ(z|s) a learned discriminator trying to recognize the latent variable behind the policy from behavior. For example, the discriminator is a neural net predicting labels as in classification tasks when p(z) is a categorical distribution.

From the view of optimization, the agent and the discriminator are trained in the cooperative way to maximize the same objective. This learning process comes to the end immediately after the discriminator can successfully infer z behind the policy. However, the learned policy is not naturally diverse enough. We will explain this claim by a simple example. Suppose the latent variable Z is randomly selected from {0, 1}. The mutual information I(S; Z) equals to the Jensen-Shannon divergence between the two conditional probability distributions q0 = p(s|Z = 0) and q1 = p(s|Z = 1). As illustrated in Fig. 1, the JS divergence reaches the maximal value when the supports of q0, q1 do not overlap. Actually, the decomposition of mutual information I(S; Z) = H(Z) H(Z|S) also implies that I(S; Z) is upper bounded by H(Z), which is fixed by a predetermined distribution.

To address this issue, we propose our method that uses Wasserstein distance as intrinsic reward to encourage the agent to explore for farther states. In Fig. 1, Wasserstein distance provides information about how far the two distributions are, while the JS divergence fails. Therefore, our method will drive the agent to reach different areas as far as possible in the unknown space of valid states.

Algorithm 1: Naive WURL (test function)

Initialize two policy πθ1, πθ2, and replay buffers for each policy D1 = {}, D2 = {}. Intialize test functions. while Maximum number of episode is not reached do Select policy l randomly or in turn. done = False. while Not done do Sample action from πθl, execute, and receive s and done. if l = 1 then Set reward r = f(s) or r = µ(s). else Set reward r = f(s) or r = ν(s). end if Dl=Dl {(s, a, s , r)}. Train πθl with SAC. Train test functions by sampling D1, D2. end while end while

Wasserstein Distance as Intrinsic Reward The Wasserstein distance can only measure discrepancy between two distributions. Therefore for the most naive approach to Wasserstein unsupervised reinforcement learning (WURL), we train a policy pair parameterized by πθ1, πθ2, with Wasserstein distance between the state distributions pπθ1 (s), pπθ2 (s) as their intrinsic reward. As Pacchiano et al. (2020) mentioned, dual form estimation allows us to assign reward at every step using test functions. We adopt two manners of dual formulation, TF1 (Arjovsky, Chintala, and Bottou 2017) and TF2 (Abdullah, Pacchiano, and Draief 2018). TF1 has one test function f with Lipschitz constraint optimizing the objective in Eqn. 6. Meanwhile TF2 has two test functions µ, ν without any constraint. µ, ν are trained according to Eqn. 5. As long as the test functions are optimal dual functions, the test functions give scores of each state. By splitting the maximization objective in Eqn. 6, we can assign f(x) as reward for policy 1, and assign f(y) as reward for policy 2 to push Wc(p, q) higher. Similar treatment can be applied to TF2. Combining RL training and test function training, we obtain Alg. 1. However, primal form estimation can only be executed after policy rollout by collecting states in one episode and

compute the distance with states sampled from the replay buffer from another policy. We refer this pattern of reward granting to Alg. 2. The challenge of sparse reward emerges in this training manner since the agent receives no reward until the episode ends. We will address this issue later. The backend training algorithm of RL can vary. Offpolicy algorithms like Soft-Actor-Critic (SAC) (Haarnoja et al. 2018) and on-policy algorithms like TRPO (Schulman et al. 2015), PPO (Schulman et al. 2017) can all be deployed on WURL. Since SAC enjoys higher sample efficiency and suits environments with continuous action space, we choose SAC as our backend RL algorithm in our experiments.

Algorithm 2: Naive WURL (final reward)

Initialize two policy πθ1, πθ2, and replay buffers for each policy D1 = {}, D2 = {}. while Maximum number of episode is not reached do Select policy l randomly or in turn. Set trajectory S = {}. done = False. while Not done do Sample action from πθl, execute, and receive s and done. S = S {s }. Set reward r = 0. if done then Sample target batch of states T from D3 l. Set reward r = W(S, T ) by any Wasserstein distance estimation method. end if Dl=Dl {(s, a, s , r)}. Train πθl with SAC. Train test functions if WDE algorithm is one of the dual form methods. end while end while

Learning with Multiple Policies Learning N(N > 2) policies at the same time requires the arbitrary policy i to keep distance with all other policies. We expect the policy to maximize the average Wasserstein distance between state distribution pi induced by policy i, and other state distributions 1 N 1 PN j=1,j =i W(pi, pj). Also we can use the Wasserstein distance between pi and the average distribution of all others . However this incurs underestimation due to the inequality

j=1,j =i pj

j=1,j =i W(pi, pj).

(7) Practically we use min N j=1,j =i W(pi, pj) as reward to keep the current policy away from the nearest policy. As the number of policies growing, the number of times of distance computing grows as O(N 2). The dual form requires O(N 2) test functions which provokes memory and time consumption concern since every test function is a neural network and needs training. The primal form reward computation complexity also rises to O(N 2). Fortunately, sliced Wasserstein

distance or projected Wasserstein distance gives us a faster and more lightweight solution without training and inferring through neural networks.

Amortized Reward In contrast to test functions, the primal form estimation of Wasserstein distance produces one final reward at the end of an episode since we cannot estimate distribution distance from one sample. This nature of primal form estimation incurs sparse reward which imposes challenges on reinforcement learning and may impair the performance of valuebased RL algorithms (Andrychowicz et al. 2017). Noting that the primal form estimation automatically yields an optimal matching plan, we could decompose the overall Wasserstein distance into every sample. Formally speaking, suppose batch S = {xn}N n=1 is the set of states in one episode, and batch T = {ym}M m=1 is the state sample set of target distribution. We further suppose PN M is the optimal matching matrix given cost matrix CN M. Denote Pi as the ith row of matrix P, then the sample xi has its own credit by computing Pi C 1 where 1 is N 1 vector filled with ones. Combining with projected Wasserstein distance, we have the following algorithm for crediting amortized reward. The matching matrix computation algorithm is stated in Appendix.

Algorithm 3: Amortized Reward Crediting

Given source batch S = {xn}N n=1 and target batch T = {ym}M m=1. Compute cost matrix CN M for k from 1 to K do Sample vk from U(Sd 1) Compute projected samples ˆx(k) n = xn, vk , ˆy(k) m = ym, vk Compute matching matrix P (k) N M from projected samples Compute reward vector r(k) = P (k)C 1 end for Return mean reward vector r = 1

K PK k=1 r(k)

Training Schedule Our proposed method can either train N policies at the same time or train incrementally by introducing new policies. Provided N diverse policies, a new policy is trained to maximize the average Wasserstein distance from other policies by collecting state samples of policy 1, 2, . . . , N at the beginning. The incremental training schedule provides flexibility of extending the number of diverse policies, especially when we are agnostic about how many policies are suitable for a certain environment beforehand. On the contrary, mutual information based unsupervised reinforcement learning is limited by fixed number of policies that cannot be easily extended due to the fixed neural network structure. Similar idea appears in Achiam et al. (2018), in which skills are trained by a curriculum approach. The objective of curriculum training is to ease the difficulty of classifying

Figure 2: In the Free Run environment, (a) and (b) show the results of learning 2 policies, across all algorithms. (c) and (d) show the results of learning 10 policies at the same time. (a) and (c) show the success rate of the discriminator distinguishing policies. (b) and (d) show the mean Wasserstein distance between every two policies. Error bars represent standard deviations across 5 random runs.

large number of skills. However, the maximum number of skills is still fixed in advance therefore one cannot flexibly add new policies in.

Experiments Policy Diversity We first examine our methods on a Free Run environment where the particle agent spawns at the center. The agent can only control the acceleration on the X-Y plane and take the current position and velocity as observation. The particle runs freely on the plane with a velocity limit for a fixed number of steps. We compare the performance of the two groups of algorithms:

Mutual information as intrinsic reward: DIAYN, DADS; Wasserstein distance as intrinsic reward: TF1, TF2, PWD, APWD.

DADS (Sharma et al. 2020) is another mutual information based algorithm that maximizes I(St+1; Z|St) utilizing model-based framework. It claims lower entropy of learned skills than DIAYN. TF1 and TF2 are two methods which adopt test function optimizing Eqn. 6 and Eqn. 5 respectively. PWD (Projected Wasserstein Distance) uses a primal form Wasserstein distance estimation method, and APWD is the amortized version of PWD to avoid sparse reward stated in the algorithm section. In the setting of learning 10 policies, TF1 and TF2 are unavailable since they are not compatible with multiple policies. We examine these methods

Figure 3: Visualized policies in Mu Jo Co Ant environment. The left column shows the rendered trajectories of 5 out of 10 total policies. The right column shows the X-Y position trajectories of 10 policies with different colors.

from two aspects. First, we train a neural network (discriminator) to distinguish each policy from rollout state samples, and compare the success rate of the discriminator. Second, we estimate the mean Wasserstein distance between every two state distributions of two different policies. Fig. 2 shows policies learned by Wasserstein distance based algorithms generally has greater discriminability, and larger distance between the state distributions of every two policies. Fig. 2 also demonstrates amortized reward improves performance. We also examine our algorithms on several Mu Jo Co tasks, including three classical locomotion environments, and two customized point mass environments where a simplified ball agent wanders in different maps with various landscapes and movable objects (see Appendix for demonstrations). Considering the significantly larger action space and state space in Mu Jo Co environments, we replace the shared actor network π(a|s, z), z = One Hot(i) in DIAYN and DADS with N individual networks πi(a|s), i = 1, . . . , N, while keep the discriminator network unchanged. DIAYN and DADS with individual actors (DIAYN-I and DADS-I for short) enjoy greater parameter space and they are capable to learn a more diverse set of policies in Mu Jo Co tasks. Table 1 shows in the most cases, all three unsupervised RL approaches yield highly distinguishable policies. However, APWD achieves better performance on Wasserstein distance, which means the policies learned by APWD are more distantly distributed in the state space. Fig. 3 visualizes the differences of three policy sets in Mu Jo Co Ant environment and clearly shows that APWD encourages the policies to keep far from each other. The results verify our hypothesis that mutual information based intrinsic reward is

Figure 4: Results of training a bunch of policies incrementally. From left to right, new policies tend to reach new territory. As the number of policy n grows, the policies gradually fill the state space.

unable to drive the policies far from each other when JS divergence saturates.

Incremental Learning Our method provides flexibility to enlarge the policy set. New policies can be trained one by one, or trained based on policies in hand. We show the process of incremental learning to illustrate how the newly learned policies gradually fill the state space in Fig. 4. As new policies added in, the particle agent in Free Run and Tree Maze runs to new directions and new areas, and behaves differently (dithering around, turning, etc.).

Downstream Tasks In previous unsupervised RL literature, the policies (or skills) learned without any task reward can be utilized either in hierarchical reinforcement learning or in planning (Eysenbach et al. 2019); (Sharma et al. 2020). Likewise, we examined our methods on downstream tasks including two navigation tasks based on the particle environment Free Run and Mu Jo Co Ant. Both tasks require the agent to reach specific goals in a given order. The agent receives +50 reward for each goal reached. In Free Run navigation task, we penalize each step with small negative reward to encourage the agent to finish the task as quickly as possible. To tackle these navigation tasks with pre-trained policies, we employ a meta-policy to choose one sub-policy to execute during H steps (H is fixed in advance in our tasks). The meta-policy observes agent state every H steps , chooses an action corresponded to a sub-policy, and then receives the reward that the sub-policy collected during the successive H steps. Therefore, we can train the meta-policy with any compatible reinforcement learning algorithms. In our ex-

periments, we adopt PPO as meta-policy trainer (Schulman et al. 2017). Table 2 presents the results on Free Run and Mu Jo Co Ant navigation tasks. The pre-trained 10 policies with DIAYN-I, DADS-I and our proposed method APWD serve as the subpolicies in the hierarchical framework. Since our method yields more diverse and distant sub-policies, APWD based hierarchical policy achieves higher reward in both navigation tasks without doubt.

Related Work

Learning in a reward-free environment has been attracting reinforcement learning researchers for long. Early research takes mutual information as maximization objective. They explored the topics on which variable of the policy should be controlled by the latent variable, and how to generate the distribution of the latent variable. VIC (Gregor, Rezende, and Wierstra 2016) maximizes I(Z; Sf) to let the final state of a trajectory be controlled by latent code Z, while allowing to learn the prior distribution p(z). VALOR (Achiam et al. 2018) takes similar approach maximizing I(Z; τ) where τ denotes the whole trajectory, but keeps p(z) fixed by Gaussian distribution. (Hausman et al. 2018) uses a network to embed various tasks to the latent space. DIAYN (Eysenbach et al. 2019) improves the performance by maximizing I(Z; S), fixing prior distribution, while minimizes I(Z; A|S). Recent papers show their interests on better adapting unsupervised skill discovery algorithm with MDP, on the aspect of transition model and initial state. DADS (Sharma et al. 2020) gives a model based approach by maximizing I(S ; Z|S) so that the learned skills can be employed in planning. Baumli et al. (2020) alternates the objective to I(Sf; Z|S0) in order to avoid state partitioning skills in case

Algorithm Metric Half Cheetah Ant Humanoid

APWD(Ours) DSR 0.981 0.010 0.989 0.002 0.998 0.002 WD 39.21 0.44 8.57 0.39 556.96 27.78

DIAYN-I DSR 0.978 0.006 0.985 0.001 0.999 0.001 WD 13.01 2.89 4.50 0.27 279.94 27.98

DADS-I DSR 0.994 0.003 0.793 0.035 0.999 0.001 WD 11.75 1.90 7.53 0.27 420.67 56.83

Table 1: Policy diversity on three Mu Jo Co locomotion environments. Metric DSR represents the discriminator success rate and WD denotes the mean Wasserstein distance between two policies.

Algorithm Free Run Ant

APWD(Ours) 125.56 12.63 100.00 18.03 DIAYN-I 15.06 26.97 35.00 7.07 DADS-I 109.78 27.08 46.00 10.84

Table 2: Rewards of meta-policies on two hierarchical RL scenarios. Each meta-policy is trained with a sub-policy set which contains 10 policies pre-trained with a specific unsupervised RL algorithm listed above.

of various start states. However, the nature of these methods restrict the diversity of learned skills since the mutual information is upper bounded by H(Z). Recent research on mutual information based methods tries to enlarge the entropy of the prior distribution through fitting the uniform distribution on valid state space U(S). EDL (Campos et al. 2020) takes three seperate steps by exploring the state space, encoding skills and learning skills. EDL first uses state marginal matching algorithm to yield a suffiently diverse distribution of states p(s), then a VQ-VAE is deployed to encode the state space to the latent space, which creates p(z|s) as the discriminator to train the agent. Skew-fit (Pong et al. 2019) adopts goal conditioned policies where the goal is sampled through importance sampling in the skewed distribution of p(s) acquired by current policy. The agent can gradually extend their knowledge of state space by fitting the skewed distribution. Both methods claim they have state-covering skills. Wasserstein distance as an alternative distribution discrepancy measure is attracting machine learning researchers recently (Ozair et al. 2019). Especially in the literature of generative models (Arjovsky, Chintala, and Bottou 2017); (Ambrogioni et al. 2018); (Patrini et al. 2020); (Tolstikhin et al. 2018), Wasserstein distance behaves well in the situations where distributions are degenerate on a sub-manifold in pixel space. In reinforcement learning, Wasserstein distance is used to characterize the differences between policies instead of commonly used f-divergences, e.g., KL divergence (Zhang et al. 2018). Pacchiano et al. (2020) reports improvements in trust region policy optimization and evolution strategies, and Dadashi et al. (2021) shows its efficacy in imitation learning by minimizing Wasserstein distance between behavioral policy and expert policy. Our pro-

posed method inherits the motivations of using Wasserstein distance as a new distribution discrepancy measure in generative models and policy optimization. To increase the diversity of policies in unsupervised reinforcement learning, Wasserstein distance appears to be a more appropriate measure than f-divergences derived from mutual information based methods.

Discussion Limitations There are limitations or difficulties for further application. First, choosing a proper cost function c(x, y) may be difficult for other reinforcement learning environments. Second, implementing WURL in the full dimension of state space may not properly balance among different dimensions with different meanings. Third, image-based observations could not be used for calculating Wasserstein distance directly. Nevertheless, these limitations or difficulties may imply future research directions.

Conclusion We build a framework of Wasserstein unsupervised reinforcement learning (WURL) of training a set of diverse policies. In contrast to conventional methods of unsupervised skill discovery, WURL employs Wasserstein distance based intrinsic reward to enhance the distance between different policies, which has theoretical advantages to mutual information based methods (Arjovsky, Chintala, and Bottou 2017). We overcome the difficulties of extending the WURL framework for multiple policy learning. In addition, we devise a novel algorithm combining Wasserstein distance estimation and reinforcement learning, addressing reward crediting issue. Our experiments demonstrate WURL generates more diverse policies than mutual information based methods such as DIAYN and DADS, on the metric of discriminability (MI-based metric) and Wasserstein distance. Furthermore, WURL excites autonomous agents to form a set of policies to cover the state space spontaneously and provides a good sub-policy set for sequential navigation tasks.

Acknowledgements This work was supported by the National Key R&D Program of China under Grant 2018AAA0102800, National Natural Science Foundation of China under Grant 61620106005,

and Beijing Municipal Science and Technology Commission grant Z201100005820005.

References Abdullah, M. A.; Pacchiano, A.; and Draief, M. 2018. Reinforcement Learning with Wasserstein Distance Regularisation, with Applications to Multipolicy Learning. ar Xiv preprint ar Xiv:1802.03976. Achiam, J.; Edwards, H.; Amodei, D.; and Abbeel, P. 2018. Variational option discovery algorithms. ar Xiv preprint ar Xiv:1807.10299. Ambrogioni, L.; G uc l u, U.; G uc l ut urk, Y.; Hinne, M.; van Gerven, M. A. J.; and Maris, E. 2018. Wasserstein Variational Inference. In Advances in Neural Information Processing Systems, volume 31. Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; Mc Grew, B.; Tobin, J.; Pieter Abbeel, O.; and Zaremba, W. 2017. Hindsight Experience Replay. In Advances in Neural Information Processing Systems, volume 30. Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 214 223. PMLR. Baumli, K.; Warde-Farley, D.; Hansen, S.; and Mnih, V. 2020. Relative Variational Intrinsic Control. In Proceedings of the AAAI conference on artificial intelligence, volume 35. Campos, V.; Trott, A.; Xiong, C.; Socher, R.; Giro-i Nieto, X.; and Torres, J. 2020. Explore, discover and learn: Unsupervised discovery of state-covering skills. In International Conference on Machine Learning, 1317 1327. PMLR. Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; and Amodei, D. 2017. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, volume 30. Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26: 2292 2300. Dadashi, R.; Hussenot, L.; Geist, M.; and Pietquin, O. 2021. Primal Wasserstein Imitation Learning. In International Conference on Learning Representations. Eysenbach, B.; Gupta, A.; Ibarz, J.; and Levine, S. 2019. Diversity is All You Need: Learning Skills without a Reward Function. In International Conference on Learning Representations. Genevay, A.; Cuturi, M.; Peyr e, G.; and Bach, F. 2016. Stochastic Optimization for Large-scale Optimal Transport. In Advances in Neural Information Processing Systems, volume 29. Gregor, K.; Rezende, D. J.; and Wierstra, D. 2016. Variational intrinsic control. ar Xiv preprint ar Xiv:1611.07507. Gupta, A.; Mendonca, R.; Liu, Y.; Abbeel, P.; and Levine, S. 2018. Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, volume 31.

Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 1861 1870. PMLR. Hausman, K.; Springenberg, J. T.; Wang, Z.; Heess, N.; and Riedmiller, M. 2018. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations. Houthooft, R.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; and Abbeel, P. 2016. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, 1109 1117. Kolouri, S.; Rohde, G. K.; and Hoffmann, H. 2018. Sliced wasserstein distance for learning gaussian mixture models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3427 3436. Kwon, T. 2021. Variational Intrinsic Control Revisited. In International Conference on Learning Representations. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. nature, 518(7540): 529 533. Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In Advances in Neural Information Processing Systems, volume 29. Ozair, S.; Lynch, C.; Bengio, Y.; Oord, A. v. d.; Levine, S.; and Sermanet, P. 2019. Wasserstein dependency measure for representation learning. In Advances in Neural Information Processing Systems, volume 32. Pacchiano, A.; Parker-Holder, J.; Tang, Y.; Choromanski, K.; Choromanska, A.; and Jordan, M. 2020. Learning to score behaviors for guided policy optimization. In International Conference on Machine Learning, 7445 7454. PMLR. Patrini, G.; van den Berg, R.; Forre, P.; Carioni, M.; Bhargav, S.; Welling, M.; Genewein, T.; and Nielsen, F. 2020. Sinkhorn autoencoders. In Uncertainty in Artificial Intelligence, 733 743. PMLR. Pong, V. H.; Dalal, M.; Lin, S.; Nair, A.; Bahl, S.; and Levine, S. 2019. Skew-fit: State-covering self-supervised reinforcement learning. In International Conference on Machine Learning. PMLR. Rowland, M.; Hron, J.; Tang, Y.; Choromanski, K.; Sarlos, T.; and Weller, A. 2019. Orthogonal estimation of wasserstein distances. In The 22nd International Conference on Artificial Intelligence and Statistics, 186 195. PMLR. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In International conference on machine learning, 1889 1897. PMLR.

Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347. Sharma, A.; Gu, S.; Levine, S.; Kumar, V.; and Hausman, K. 2020. Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations (ICLR). Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587): 484 489. Tolstikhin, I.; Bousquet, O.; Gelly, S.; and Schoelkopf, B. 2018. Wasserstein Auto-Encoders. In International Conference on Learning Representations. Villani, C. 2009. Optimal transport: old and new, volume 338. Springer. Wu, J.; Huang, Z.; Acharya, D.; Li, W.; Thoma, J.; Paudel, D. P.; and Gool, L. V. 2019. Sliced wasserstein generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3713 3722. Zhang, R.; Chen, C.; Li, C.; and Carin, L. 2018. Policy Optimization as Wasserstein Gradient Flows. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 5737 5746. PMLR.