# coimitation_learning_design_and_behaviour_by_imitation__8ab08edf.pdf

Co-imitation: Learning Design and Behaviour by Imitation

Chang Rajani1,2, Karol Arndt2, David Blanco-Mulero2, Kevin Sebastian Luck2,3, Ville Kyrki2

1Department of Computer Science, University of Helsinki, Finland, 2Department of Electrical Engineering and Automation (EEA), Aalto University, Finland, 3Finnish Center for Artificial Intelligence, Finland chang.rajani@helsinki.fi, {karol.arndt, david.blancomulero, kevin.s.luck, ville.kyrki} @aalto.fi

The co-adaptation of robots has been a long-standing research endeavour with the goal of adapting both body and behaviour of a system for a given task, inspired by the natural evolution of animals. Co-adaptation has the potential to eliminate costly manual hardware engineering as well as improve the performance of systems. The standard approach to co-adaptation is to use a reward function for optimizing behaviour and morphology. However, defining and constructing such reward functions is notoriously difficult and often a significant engineering effort. This paper introduces a new viewpoint on the co-adaptation problem, which we call coimitation: finding a morphology and a policy that allow an imitator to closely match the behaviour of a demonstrator. To this end we propose a co-imitation methodology for adapting behaviour and morphology by matching state distributions of the demonstrator. Specifically, we focus on the challenging scenario with mismatched stateand action-spaces between both agents. We find that co-imitation increases behaviour similarity across a variety of tasks and settings, and demonstrate co-imitation by transferring human walking, jogging and kicking skills onto a simulated humanoid.

1 Introduction Animals undergo two primary adaptation processes: behavioural and morphological adaptation. An animal species adapts, over generations, its morphology to thrive in its environment. On the other hand, animals continuously adapt their behaviour during their lifetime due to environmental changes, predators or when learning a new behaviour is advantageous. While the processes operate on different time scales, they are closely interconnected and crucial elements leading to the development of well-performing and highly adapted organisms on earth. While research in robot learning has largely been focused on the aspects of behavioural learning processes, a growing number of works have sought to combine behaviour learning and morphology adaptation for robotics applications via co-adaptation (Luck, Amor, and Calandra 2020; Liao et al. 2019; Schaff et al. 2019; Ha 2019; Le Goff et al. 2022). Earlier works focused primarily on the use of evolutionary optimization techniques (Sims 1994; Pollack et al. 2000; Alattas, Patel, and Sobh 2019), but

Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Figure 1: The proposed co-imitation algorithm (centre) is able to faithfully match the gait of human motion capture demonstrations (left) by optimizing both the morphology and behaviour of a simulated humanoid. This is opposed to a pure behavioural imitation learner (right) that fails to mimic the human motion accurately.

with the advent of deep learning, new opportunities arose for the efficient combination of deep reinforcement learning and evolutionary adaptation (Schaff et al. 2019; Luck, Amor, and Calandra 2020; Hallawa et al. 2021; Luck, Calandra, and Mistry 2021). In contrast to fixed behaviour primitives or simple controllers with a handful of parameters (Lan et al. 2021; Liao et al. 2019), deep neural networks allow a much greater range of behaviours given a morphology (Luck, Amor, and Calandra 2020). Existing works in co-adaptation, however, focus on a setting where a reward function is assumed to be known, even though engineering a reward function is a notoriously difficult and error-prone task (Singh et al. 2019). Reward functions tend to be taskspecific, and even minor changes to the learner dynamics can cause the agent to perform undesired behaviour. For example, in the case of robotics, changing the mass of a robot may affect the value of an action penalty. This means that the reward needs to be re-engineered every time these properties change. To overcome these challenges, we propose to reformulate co-adaptation by combining morphology adaptation and imitation learning into a common framework, which we name co-imitation 1. This approach eliminates the need for engineering reward functions by leveraging imitation learning for co-adaptation, hence, allowing

1Find videos at https://sites.google.com/view/co-imitation

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

the matching of both the behaviour and the morphology of a demonstrator. Imitation learning uses demonstration data to learn a policy that behaves like the demonstrator (Osa et al. 2018; Roveda et al. 2021). However, in the case where the two agents morphologies are different, we face the following challenges: (1) state spaces of demonstrating and imitating agents may differ, even having mismatched dimensionalities; (2) actions of the demonstrator may be unobservable; (3) transition functions and dynamics are inherently disparate due to mismatching morphologies. To address these issues we propose a co-imitation method which combines deep imitation learning through state distribution matching with morphology optimization. Summarized, the contributions of this paper are:

Formalizing the co-imitation problem: optimizing both the behaviour and morphology given demonstration data. The introduction of Co-Imitation Learning (Co IL), a new co-imitation method adapting the behaviour and morphology of an agent by state distribution matching considering incompatible state spaces, without using any hand-engineered reward functions. A comparison of morphology optimization using learned non-stationary reward functions with our proposed approach of using a state distribution matching objective. A demonstration of Co IL by learning behaviour and morphology of a simulated humanoid given real-world demonstrations recorded from human subjects in tasks ranging from walking, jogging to kicking (see Fig. 1).

2 Related Work Deep Co-Adaptation of Behaviour and Design While co-adaptation as a field has seen interest since at least as early as the 90s (Park and Asada 1993; Sims 1994), in this section we look at previous work in the field especially in the context of deep reinforcement learning. Recent work by Gupta et al. (2021) proposes a mixed evolutionaryand deep reinforcement learning-based approach (DERL) for co-optimizing agents behaviour and morphology. Through mass parallelization, DERL maintains a population of 576 agents, which simultaneously optimize their behaviour using Proximal Policy Optimization (PPO) (Schulman et al. 2017). Based on their final task performance (i.e. episodic return), DERL optimizes the morphological structure of agents using an evolutionary tournament-style optimization process. Schaff et al. (2019) use deep reinforcement learning (RL) for the joint optimization of morphology and behaviour by learning a single policy with PPO. Again, the final episodic return of a design is used to optimize the parameters of a design distribution with gradient descent, from which the subsequent designs are sampled. Similarly, Ha (2019) proposes to use REINFORCE to optimize policy parameters and design parameters of a population of agents in a joint manner. The co-adaptation method presented by Luck, Amor, and Calandra (2020) improves data-efficiency compared to return-based algorithms by utilizing the critic learned by Soft Actor Critic (SAC) (Haarnoja et al. 2018) to query for the expected episodic return of unseen designs during design optimization. While the method we present is closest

to the former approach, all discussed co-adaptation methods require access to a reward function, and are thus not capable of co-adapting the behaviour and design of an agent without requiring an engineer to formulate a reward function.

Imitation Learning with Morphology Mismatch Imitation learning approaches learn a policy for a given task from demonstrator data. In many cases this data can only be produced by an agent (or human) that has different dynamics from the imitator. We will give a brief overview on previous work where a policy is learned in presence of such transfer. The work by Desai et al. (2020) discusses the imitation transfer problem between different domains and presents an action transformation method for the state-only imitation setting. Hudson et al. (2022) on the other hand learn an affine transform to compensate for differences in the skeletons of the demonstrator and the imitator. These methods are based on transforming either actions or states to a comparable representation. To perform state-only imitation learning without learning a reward function, Dadashi et al. (2021) introduced Primal Wasserstein Imitation Learning (PWIL), where a reward function is computed based directly on the primal Wasserstein formulation. While PWIL does not consider the case where the state space and the morphology are different between the demonstrator and the imitator, it was extended into the mismatched setting by Fickinger et al. (2021). They replace the Wasserstein distance with the Gromov-Wasserstein distance, which allows the state distribution distance to be computed in mismatched state spaces. In contrast, our method addresses the state space mismatch by transforming the state spaces to a common feature representation, allowing for more control over how the demonstrator s behaviour is imitated. Additionally, in contrast to these works, we optimize the morphology of the imitator to allow for more faithful behaviour replication. Peng et al. (2020) propose an imitation learning pipeline allowing a quadrupedal robot to imitate the movement behaviour of a dog. Similarly, Xu and Karamouzas (2021) use an adversarial approach to learn movements from human motion capture. Similar to us, these papers match markers between motion capture representations and robots. However, in the first, a highly engineered pipeline relies on a) the ability to compute the inverse kinematics of the target platform, and b) a hand-engineered reward function. In the latter, imitation learning is used for learning behaviour, but neither method optimizes for morphology.

3 Preliminaries

Imitation Learning as distribution-matching For a given expert state-action trajectory τ E = (s0, a0, s1, a1, . . . , sn, an), the imitation learning task is to learn a policy πI(a|s) such that the resulting behaviour best matches the demonstrated behaviour. This problem setting can be understood as minimizing a divergence, or alternative measures, D(q(τ E), p(τ I|πI)) between the demonstrator trajectory distribution q(τ E) and the trajectory distribution of the imitator p(τ I|πI) induced by its policy πI (see e.g. (Osa et al. 2018) for further discussion). While there are multiple paradigms of imitation learning, a recently popular method

Figure 2: Top: A demonstrator of jogging from the CMU Mo Cap Dataset (CMU 2019). Middle: The co-imitation Humanoid produces a more natural looking jogging motion whereas the pure imitation learner (bottom) learns to run with a poor gait.

is adversarial imitation learning, where a discriminator is trained to distinguish between policy states (or state-action pairs) and demonstrator states (Ho and Ermon 2016; Orsini et al. 2021). The discriminator is then used for providing rewards to an RL algorithm which maximizes them via interaction. In the remainder of the paper we will be focusing on two adversarial methods with a divergence-minimization interpretation which we will now discuss in more detail.

Generative Adversarial Imitation Learning (GAIL) GAIL trains a standard classifier using a logistic loss which outputs the probability that a given state comes from the demonstration trajectories (Ho and Ermon 2016). The reward function is chosen to be a function of the classifier output. Many options are given in literature for the choice of reward, evaluated extensively by Orsini et al. (2021). Different choices of rewards correspond to different distance measures in terms of the optimization problem. Here, we consider the AIRL reward introduced by Fu, Luo, and Levine (2018):

r(st, st+1) = log(ψ(st)) log(1 ψ(st)), (1)

where ψ is a classifier trained to distinguish expert data from the imitator. Maximizing the AIRL reward corresponds to minimizing the Kullback-Leibler divergence between the demonstrator and policy state-action marginals (Ghasemipour, Zemel, and Gu 2020).

State-Alignment Imitation Learning (SAIL) In contrast to GAIL, SAIL (Liu et al. 2019) uses a Wasserstein-GANstyle (Arjovsky, Chintala, and Bottou 2017) critic instead of the standard logistic regression-style discriminator. Maximizing the SAIL reward corresponds to minimizing the Wasserstein distance (Villani 2009) between demonstrator and policy state-marginals (see Liu et al. (2019) for details).

4 A General Framework for Co-Imitation We formalize the problem of co-imitation as follows: Consider an expert MDP described by (SE, AE, p E, p E 0 ), with state space SE, action space AE, initial state distribution

p E 0 (s E 0 ), and transition probability p E(s E t+1|s E t , a E t ). Furthermore, assume that the generally unknown expert policy is defined as πE(a E t |s E t ). In addition, an imitator MDP is defined by (SI, AI, p I, p I 0, πI, ξ), where the initial state distribution p I(s I 0|ξ) and transition probability p I(s I t+1|s I t , a I t , ξ) are parameterized by a morphology-parameter ξ. The trajectory distribution of the expert is given by

q(τ E) = p E(s E 0 )

t=0 p E(s E t+1|s E t , a E t )πE(a E t |s E t ), (2)

while the imitator trajectory distribution is dependent on the imitator policy πI(a|s, ξ) and chosen morphology ξ

p(τ I|πI, ξ) = p I(s I 0|ξ)

t=0 p I(s I t+1|s I t , a I t , ξ)πI(a I t |s I t , ξ).

(3) It follows that the objective of the co-imitation problem is to find an imitator policy πI and the imitator morphology ξ such that a chosen probability-distance divergence measure or function D( , ) is minimized, i.e.

ξ , πI = arg min ξ,πI D(q(τ E), p(τ I|πI, ξ)). (4)

For an overview of potential candidate distance measures and divergences see e.g. Ghasemipour, Zemel, and Gu (2020). For the special case that state-spaces of expert and imitator do not match, a simple extension of this framework is to assume two transformation functions ϕ( ) : SE SS, and ϕξ( ) : SI SS where SS is a shared feature space. For simplicity we overload the notation and use ϕ( ) for both the demonstrator and imitator state-space mapping.

5 Co-Imitation by State Distribution Matching We consider in this paper the special case of co-imitation by state distribution matching and present two imitation learning methods adapted for the learning of behaviour and de-

sign. The co-imitation objective from Eq. (4) is then reformulated as

D(q(τ E), p(τ I|πI, ξ)) def == D(q(ϕ(s E)), p(ϕ(s I)|πI, ξ)). (5) Similar to Lee et al. (2019) we define the marginal featurespace state distribution of the imitator as

p(ϕ(s I)|πI, ξ) def == (6)

E s I 0 p I(s I 0|ξ) a I t πI(a I t |s I t ,ξ) s I t+1 p I(s I t+1|s I t ,a I t ,ξ)

t=0 1(ϕ(s I t ) = ϕ(s I))

while the feature-space state distribution of the demonstrator is defined by

q(ϕ(s E)) def == (7)

E s E 0 p E(s E 0 ) a E t πE(a E t |s E t ) s E t+1 p E(s E t+1|s E t ,a E t )

t=0 1(ϕ(s E t ) = ϕ(s E))

Intuitively, this formulation corresponds to matching the visitation frequency of each state in the expert samples in the shared feature space. In principle any transformation that maps to a shared space can be used. For details of our specific choice see Section 6.1. Importantly, this formulation allows us to frame the problem using any state marginal matching imitation learning algorithms for policy learning. See Ni et al. (2021) for a review of different algorithms. An overview of Co IL is provided in Algorithm 1. We consider a set of given demonstrator trajectories TE, and initialize the imitator policy as well as an initial morphology ξ0. Each algorithm iteration begins with the robot training the imitator policy for the current morphology ξ for Nξ iterations, as discussed in Section 5.1. The set of collected imitator trajectories TI ξ and morphology are added to the dataset Ξ. Then, the morphology is optimized by computing the distribution distance measure following Algorithm 2. The procedure is followed until convergence, finding the morphology and policy that best imitate the demonstrator. We follow an alternating approach between behaviour optimization and morphology optimization as proposed by prior work such as Luck, Amor, and Calandra (2020).

5.1 Behaviour Adaptation Given the current morphology ξ of an imitating agent, the first task is to optimize the imitator policy πI with

πI next = arg min πI D(q(ϕ(s E)), p(ϕ(s I)|πI, ξ)). (8)

The goal is to find an improved imitator policy πI next which exhibits behaviour similar to the given set of demonstration trajectories TE. This policy improvement step is performed in lines 4 11 in Algorithm 1. We experiment with two algorithms: GAIL and SAIL, which learn discriminators as reward functions r(st, st+1). Following (Orsini et al. 2021) we use SAC, a sample-efficient off-policy model-free algorithm as the reinforcement learning backbone for both imitation learning algorithms (line 10 in Alg. 1). To ensure that

Algorithm 1: Co-Imitation Learning (Co IL)

Input: Set of demonstration trajectories TE = {τ E 0 , ...} 1: Initialize πI, ξ = ξ0, TI = , Ξ = , and RL replay RRL 2: while not converged do 3: Initialize agent with morphology ξ 4: for n = 1, . . . , Nξ episodes do 5: With current policy πI sample state-action trajectory (s I 0, a I 0, . . . , s I t , a I t , s I t+1, . . . ) in environment 6: Add tuples (s I t , a I t , s I t+1, ξ) to replay RRL 7: Add state-trajectory τ I n,ξ = (s I 0, s I 1, ...) to TI

8: Compute rewards r(ϕ(s I t ), ϕ(s I t+1)) using IL strategy 9: Add rewards r(ϕ(s I t ), ϕ(s I t+1)) to RRL 10: Update policy πI(a I t |s I t , ξ) using RL and RRL 11: end for 12: Add (ξ, TI ξ) to Ξ with TI ξ = {τ I 0ξ,ξ, ..., τ I Nξ,ξ} 13: ξ = Morpho-Opt(TE, Ξ) Adapt Morphology (Alg. 2) 14: end while

Algorithm 2: Bayesian Morphology Optimization

Output: ξnext, next candidate morphology 1: procedure MORPHO-OPT(TE,Ξ) 2: Define observations X = {ξn}, ξn Ξ 3: Compute Y = {yn}, (ξn, TI n) Ξ Using Eq. (12) 4: Fit GP g(ξ) using X and Y 5: µg( ξ), σg( ξ) = p(g( ξ)|X, Y ) Compute posterior 6: α( ξ) = µg( ξ) β σg( ξ) Compute LCB 7: ξnext = arg min ξ α( ξ) Provide next candidate 8: end procedure

the policy transfers well to new morphologies, we train a single policy πI conditioned both on s I t and on ξ. Data from previous morphologies is retained in the SAC replay buffer. Further details regarding implementation details in the setting of co-imitation is given in Section A of the Appendix.

5.2 Morphology Adaptation Adapting the morphology of an agent requires a certain exploration-exploitation trade-off: new morphologies need to be considered, but changing it too radically or too often will hinder learning. In general, co-imitation is challenging because a given morphology can perform poorly due to either it being inherently poor for the task, or because the policy has not converged to a good behaviour. Previous approaches have focused on using either returns averaged over multiple episodes, (e.g (Ha 2019)) or the Q-function of a learned policy (Luck, Amor, and Calandra 2020) to evaluate the fitness of given morphology parameters. They then perform general-purpose black-box optimization along with exploration heuristics to find the next suitable candidate to evaluate. Since both approaches rely on rewards, in the imitation learning setting they correspond to maximizing the critic s approximation of the distribution distance. This is because the rewards are outputs of a neural network that is continuously trained and, hence, inherently non-stationary. Instead, we propose to minimize in the co-imitation setting the true quantity of interest, i.e. the distribution distance for the given trajectories.

Given the current imitator policy πI(a I t |s I t , ξ) our aim is to find a candidate morphology minimizing the objective

ξnext = arg min ξ D(q(ϕ(s E)), p(ϕ(s I)|πI, ξ)), (9)

using state distributions given in Eq. (6) (7).

Bayesian Morphology Optimization In order to find the optimal morphology parameters we perform Bayesian Optimization (BO), which is a sample-efficient optimization method that learns a probabilistic surrogate model (Frazier 2018). Here, we use a Gaussian Process (GP) (Rasmussen and Williams 2006) as surrogate to learn the relationship between the parameters ξ and the distance D(q(ϕ(s E)), p(ϕ(s I)|πI, ξ)). This relationship is modeled by the GP prior g(ξ) = GP(µ(ξ), k(ξ, ξ )), (10) where µ( ) defines the mean function, and k( , ) the kernel (or covariance) function. We show that adapting the morphology in Co IL via this approach increases performance over the co-adaptation and imitation baselines in Section 6. Modelling the relationship between the parameters ξ and the distance D( , ) is surprisingly challenging because the policy evolves over time. This means that morphologies evaluated early in training are by default worse than those evaluated later, and thus should be trusted less. The BO algorithm alleviates this problem by re-fitting the GP at each iteration using only the most recent episodes. By learning the surrogate GP model g(ξ) we can explore the space of morphologies and estimate their performance without gathering new data. The optimization problem can be defined as ξnext = arg min ξ g(ξ), (11)

where ξnext is the next proposed morphology to evaluate. The GP model is trained using as observations the set of morphologies used in behaviour adaptation X = {ξn}, ξn Ξ, and as targets Y = {y0, , y N} the mean distribution distance for each morphology, that is

k=0 D q(τ E), p(τ I k,ξ|πI, ξ) . (12)

The predictive posterior distribution is given by p(g( ξ)|X, Y ) = N( ξ|µg( ξ), σg( ξ)), where ξ is the set of test morphologies and µg( ξ) and σg( ξ) are the predicted mean and variance. In order to trade-off between exploration and exploitation we use the Lower Confidence Bound (LCB) as acquisition function α( ξ) = µ( ξ) βσ( ξ), where β (here 2) is a parameter that controls the exploration. The morphology optimization procedure is depicted in Algorithm 2. The GP is optimized by minimizing the negative marginal log-likelihood (MLL). Then, the posterior distribution is computed for the set of test morphologies ξ. The values of ξ for each task are described in Table 6 (Appendix). Finally, the acquisition function is computed and used to obtain the next proposed morphology. The Section E.1 in the Appendix compares the proposed BO approach to Random Search (RS) (Bergstra and Bengio 2012), and CMA-ES (Hansen and Ostermeier 2001).

Figure 3: Left: Markers used for matching the Mu Jo Co Humanoid to motion capture data. Right: Markers used for the Cheetah tasks. Green markers are used as data, while blue markers serve as reference points for green markers.

6 Experiments

Our experimental evaluation aims at answering the following research questions: (Q1) Does imitation learning benefit from co-adapting the imitator s morphology? (Q2) How does the choice of the imitation learning algorithm used with Co IL impact the imitator s morphology? (Q3) Is morphology adaptation with Co IL able to compensate for major morphology differences, such as a missing joint or the transfer from a real to a simulated agent? To answer these questions, we devised a set of experiments across a range of setups and imitation learning methods.

6.1 Experimental Setup

In all our experiments, we use the Mu Jo Co physics engine (Todorov, Erez, and Tassa 2012) for simulating the dynamics of agents. As discussed in Algorithm 1, the policies are trained using the same morphology for Nξ = 20 episodes. We optimize morphological parameters such as the lengths of arms and legs, and diameter of torso elements in humanoid (see also Table 6, Appendix). The BO algorithm details as well as more detailed technical information can be found in Section C.5 (Appendix).

Joint feature space ϕ( ) As discussed in Section 4 our method assumes that demonstrator and imitator states are in different state-spaces. To address this mismatch, the proposed method maps the raw state observations from the demonstrator and the imitator to a common feature space. The selection of the feature space can be used to influence which parts of the behaviour are to be imitated. In our setups, we manually selected the relevant features by placing markers along each of the limbs in both experimental setups, as shown in Figure 3). The feature space is then composed of velocities and positions of these points relative to the base of their corresponding limb (marked in blue in the figure).

Evaluation Metric Evaluating the accuracy of imitation in a quantitative manner is not straightforward, because in general there does not exist an explicit reward function

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Timestep 1e6

Wasserstein Distance

Co IL (SAIL) SAIL Co IL (GAIL) GAIL

Figure 4: Wasserstein distance for three seeds between demonstrator and imitator trajectories on the 3to2 Cheetah task on co-imitation (Co IL) and pure imitation learning algorithms (SAIL, GAIL).

that we can compare performance on. While most imitation learning works use task-specific rewards to evaluate imitation performance, it is not a great proxy for e.g. learning similar gaits. Recently, previous work in state-marginal matching has used forward and reverse KL divergence as a performance metric (Ni et al. 2021). However, rather than evaluating the KL divergence, we opted for using the Wasserstein distance (Villani 2009) as the evaluation metric. The main motivation behind this choice was that this metric corresponds to the objective optimized by SAIL and PWIL, two state-of-the-art imitation learning algorithms. Additionally, it constitutes a more intuitive quantity for comparing 3D positions of markers than KL divergence the Wasserstein distance between the expert and imitator feature distributions corresponds to the average distance by which markers of the imitator need to be moved in order for the two distributions to be aligned. Therefore, for both morphology optimization and evaluation we use the exact Wasserstein distance between marker position samples from the demonstrator q(ϕ(s E)) and imitator p(ϕ(s I)|πI, ξ) state marginal distributions. This also allows us to avoid an additional scaling hyperparameter when optimizing for morphologies, since velocities and positions have different scales. The Wasserstein distances are computed using the pot package (Flamary et al. 2021). For all runs we show the mean and standard deviation of 3 seeds represented as the shaded area.

6.2 Co-Imitation from Simulated Agents

We adapt the Half Cheetah setup from Open AI Gym (Brockman et al. 2016) by creating a version with two leg-segments instead of three (see Fig. 3). We then collect the demonstration datasets by generating expert trajectories from a policy trained by SAC using the standard running reward for both variants of the environment. We refer to these tasks as 3to2 and 2to3 corresponding to imitating a 3-segment demonstrator using a 2-segment imitator and vice versa. For both experiments we used 10 episodes of 1000 timesteps as demonstration data. Further details can be found in the Appendix. First, we answer RQ1 by investigating whether co-adapting the imitator s morphology is at all beneficial for their ability to replicate the demonstrator s behaviour, and if so how different state marginal matching imitation learning al-

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Timestep 1e6

Wasserstein Distance

Co IL (Ours) Q-function Co IL No co-adaptation Demonstrations

(a) Imitation of a 2-joint Cheetah using a 3-joint Cheetah.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Timestep 1e6

Wasserstein Distance

Co IL (Ours) Q-function Co IL No co-adaptation Demonstrations

(b) Imitation of a 3-joint Cheetah using a 2-joint Cheetah.

Figure 5: Wasserstein distance for 3 seeds between demonstrator and imitator for both Half Cheetah tasks. While coimitation via Co IL (blue) outperforms SAIL (green) in 2to3 (a), all methods show the same performance in 3to2 (b).

gorithms perform at this task (RQ2). To this end, we analyze the performance of two imitation learning algorithms, GAIL and SAIL on the Half Cheetah setup, with and without co-adaptation. We use BO as the morphology optimizer, as it consistently produced good results in preliminary experiments (see Appendix). The performance for both imitation algorithms on the 3to2 task is shown in Figure 4. We observe that SAIL outperforms GAIL both with and without morphology adaptation. Our results indicate that this task does not benefit from morphology optimization as SAIL and Co IL achieve similar performance. However, it is encouraging to note that Co IL does not decrease performance even when the task does not benefit from co-adaptation. Based on these results we select SAIL as the main imitation learning algorithm due to its higher performance over GAIL. Figure 5 shows the results in the two Half Cheetah morphology transfer scenarios. To address RQ3, we compare Co IL to two other co-imitation approaches: using the cheetah without morphology adaptation, as well as to using the Q-function method adapted from Luck, Amor, and Calandra (2020). Since this method is designed for the standard reinforcement learning setting, we adapt it to the imitation learning scenario by using SAIL to imitate the expert trajectories, and iteratively optimizing the morphology using the Q-function. See the Appendix for further details of this baseline. In the 3to2 domain transfer scenario (Figure 5b), where the gait of a more complex agent is to be reproduced on a simpler setup, the results are even across the board. All methods are able to imitate the demonstrator well, which indicates that this task is rather easy, and that co-adaptation does not provide much of a benefit. On the other hand, in the

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Timestep 1e6

Wasserstein Distance

Co IL (Ours) Q-function Co IL No co-adaptation Demonstrations

(a) Soccer kick

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Timestep 1e6

Wasserstein Distance

Co IL (Ours) Q-function Co IL No co-adaptation Demonstrations

(b) Jogging

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Timestep 1e6

Wasserstein Distance

Co IL (Ours) Q-function Co IL No co-adaptation Demonstrations

(c) Walking

Figure 6: The average Wasserstein distances (of 10 test episodes, 3 seeds) for the three CMU motion-capture to Mu Jo Co Humanoid tasks. The baseline Demonstrations refers to the mean distance between the individual demonstration trajectories. We can see that Co IL (blue) consistently performs better than the compared methods, even reaching the mean distance between the demonstration trajectories (black) in the soccer task.

2to3 scenario shown in Figure 5a, after co-adaptation with Co IL, the more complex Cheetah robot is able to reproduce the gait of the simpler, two-segment robot very closely. A closer look at the results reveals that the morphology adaptation algorithm achieves this by setting the length of the missing link in each leg to a very small, nearly zero value (see Appendix). Thus, at the end of training, Co IL can recover the true morphology of the demonstrator. While the Q-function optimization procedure (Luck, Amor, and Calandra 2020) also optimizes for the Wasserstein distance metric via the reward signal, the final performance is somewhat worse. We hypothesize that this is due to the non-stationarity of the learned reward function and that with more interaction time, the Q-function version would reach the performance of Co IL.

6.3 Co-Imitation from Human Behaviour

Next, we address RQ3 by evaluating Co IL in a more challenging, high-dimensional setup, where the goal is to coimitate demonstration data collected from a real-world human using a simplified simulated agent. Here, we use a Humanoid robot adapted from Open AI Gym (Brockman et al. 2016) together with the CMU motion capture data (CMU 2019) as our demonstrations. This setup uses a similar marker layout to Half Cheetah s, with markers placed at each joint of each limb, with additional marker in the head (see Figure 3 for a visualization). We follow the same relative position matching as in the Cheetah setup. We also include the absolute velocity of the torso in the feature space to allow modelling forward motion. The performance of the Humanoid agent on imitating three tasks from CMU motion capture dataset: walking, jogging, and soccer kick, is shown in Figure 6. We observe that, in all three tasks, Co IL reproduces the demonstrator behaviour most faithfully. A comparison of the morphology and behaviour learned by Co IL vs standard imitation learning (here SAIL) in the jogging task is shown in Figure 2. In the soccer kick task, Co IL s performance matches the distance between individual demonstrations, while for the two locomotion tasks jogging and walking there is still a noticeable performance gap between Co IL and the individual expert demonstrations (with

p = 0.0076, Wilcoxon signed rank test). We also observe that, in all three setups, not performing co-adaptation at all (and using the default link length values for the Open AI Gym Humanoid instead) outperforms co-adaptation with the Q-function objective. We hypothesize that this counterintuitive result might stem from the increased complexity of the task learning a sensible Q-function in the higherdimensional morphologyand state feature-space of Humanoid is likely to require a much larger amount of data, and thus a longer interaction time. In contrast, optimizing the morphologies using the Wasserstein distance directly simplifies the optimization, since it does not rely on the Qfunction catching up with changes both to policy and to the adversarial reward models used in GAIL and SAIL.

7 Conclusion In this paper we presented Co-Imitation Learning (Co IL): a methodology for co-adapting both the behaviour of a robot and its morphology to best reproduce the behaviour of a demonstrator. This is, to the best of our knowledge, the first deep learning method to co-imitate both morphology and behaviour using only demonstration data with no pre-defined reward function. We discussed and presented a version of Co IL using state distribution matching for co-imitating a demonstrator in the special case of mismatching state and action spaces. The capability of Co IL to better co-imitate behaviour and morphology was demonstrated in a difficult task where a simulated humanoid agent has to imitate real-world motion capturing data of a human. Although we were able to show that Co IL outperforms non-morphology-adapting imitation learning techniques in the presented experiment using real-world data, we did not consider or further investigate the inherent mismatch between physical parameters (such as friction, contact-forces, elasticity, etc.) of simulation and real world or the use of automatic feature-extraction mechanisms. Limitations of Co IL are that good quality demonstrations are needed, and due to the used RL-techniques, no global optima is guaranteed. We think that these challenges present interesting avenues for future research and that the presented co-imitation methodology opens up a new exciting research space in the area of co-adaptation of agents.

Acknowledgments

This work was supported by the Academy of Finland Flagship programme: Finnish Center for Artificial Intelligence FCAI and by Academy of Finland through grant number 328399. We acknowledge the computational resources provided by the Aalto Science-IT project. The data used in this project was obtained from mocap.cs.cmu.edu and was created with funding from NSF EIA-0196217. We thank the anonymous reviewers for their helpful comments and suggestions for improving the final manuscript.

Alattas, R. J.; Patel, S.; and Sobh, T. M. 2019. Evolutionary modular robotics: Survey and analysis. Journal of Intelligent & Robotic Systems, 95(3): 815 828.

Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. In International conference on machine learning, 214 223. PMLR.

Bergstra, J.; and Bengio, Y. 2012. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13(10): 281 305.

Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Open AI Gym. arxiv:ar Xiv:1606.01540.

CMU. 2019. CMU Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu/. Accessed: 2022-08-01.

Dadashi, R.; Hussenot, L.; Geist, M.; and Pietquin, O. 2021. Primal Wasserstein Imitation Learning. In ICLR 2021-Ninth International Conference on Learning Representations.

Desai, S.; Durugkar, I.; Karnan, H.; Warnell, G.; Hanna, J.; and Stone, P. 2020. An imitation from observation approach to transfer learning with dynamics mismatch. Advances in Neural Information Processing Systems, 33: 3917 3929.

Fickinger, A.; Cohen, S.; Russell, S.; and Amos, B. 2021. Cross-Domain Imitation Learning via Optimal Transport. In International Conference on Learning Representations.

Flamary, R.; Courty, N.; Gramfort, A.; Alaya, M. Z.; Boisbunon, A.; Chambon, S.; Chapel, L.; Corenflos, A.; Fatras, K.; Fournier, N.; et al. 2021. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78): 1 8.

Frazier, P. I. 2018. A tutorial on Bayesian optimization. ar Xiv preprint ar Xiv:1807.02811.

Fu, J.; Luo, K.; and Levine, S. 2018. Learning Robust Rewards with Adverserial Inverse Reinforcement Learning. In International Conference on Learning Representations.

Ghasemipour, S. K. S.; Zemel, R.; and Gu, S. 2020. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, 1259 1277. PMLR.

Gupta, A.; Savarese, S.; Ganguli, S.; and Fei-Fei, L. 2021. Embodied intelligence via learning and evolution. Nature communications, 12(1): 1 12.

Ha, D. 2019. Reinforcement learning for improving agent design. Artificial life, 25(4): 352 365.

Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, 1861 1870. PMLR. Hallawa, A.; Born, T.; Schmeink, A.; Dartmann, G.; Peine, A.; Martin, L.; Iacca, G.; Eiben, A.; and Ascheid, G. 2021. Evo-RL: evolutionary-driven reinforcement learning. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, 153 154. Hansen, N.; and Ostermeier, A. 2001. Completely Derandomized Self-Adaptation in Evolution Strategies. Evolutionary Computation, 9(2): 159 195. Ho, J.; and Ermon, S. 2016. Generative adversarial imitation learning. Advances in neural information processing systems, 29. Hudson, E.; Warnell, G.; Torabi, F.; and Stone, P. 2022. Skeletal feature compensation for imitation learning with embodiment mismatch. In 2022 International Conference on Robotics and Automation (ICRA), 2482 2488. IEEE. Lan, G.; van Hooft, M.; De Carlo, M.; Tomczak, J. M.; and Eiben, A. 2021. Learning locomotion skills in evolvable robots. Neurocomputing, 452: 294 306. Le Goff, L. K.; Buchanan, E.; Hart, E.; Eiben, A. E.; Li, W.; De Carlo, M.; Winfield, A. F.; Hale, M. F.; Woolley, R.; Angus, M.; Timmis, J.; and Tyrrell, A. M. 2022. Morphoevolution with learning using a controller archive as an inheritance mechanism. IEEE Transactions on Cognitive and Developmental Systems, 1 1. Lee, L.; Eysenbach, B.; Parisotto, E.; Xing, E.; Levine, S.; and Salakhutdinov, R. 2019. Efficient exploration via state marginal matching. ar Xiv preprint ar Xiv:1906.05274. Liao, T.; Wang, G.; Yang, B.; Lee, R.; Pister, K.; Levine, S.; and Calandra, R. 2019. Data-efficient learning of morphology and controller for a microrobot. In 2019 International Conference on Robotics and Automation (ICRA), 2488 2494. IEEE. Liu, F.; Ling, Z.; Mu, T.; and Su, H. 2019. State Alignmentbased Imitation Learning. In International Conference on Learning Representations. Luck, K. S.; Amor, H. B.; and Calandra, R. 2020. Dataefficient co-adaptation of morphology and behaviour with deep reinforcement learning. In Conference on Robot Learning, 854 869. PMLR. Luck, K. S.; Calandra, R.; and Mistry, M. 2021. What Robot do I Need? Fast Co-Adaptation of Morphology and Control using Graph Neural Networks. ar Xiv preprint ar Xiv:2111.02371. Ni, T.; Sikchi, H.; Wang, Y.; Gupta, T.; Lee, L.; and Eysenbach, B. 2021. f-IRL: Inverse Reinforcement Learning via State Marginal Matching. In Conference on Robot Learning, 529 551. PMLR. Orsini, M.; Raichuk, A.; Hussenot, L.; Vincent, D.; Dadashi, R.; Girgin, S.; Geist, M.; Bachem, O.; Pietquin, O.; and Andrychowicz, M. 2021. What matters for adversarial imitation learning? Advances in Neural Information Processing Systems, 34.

Osa, T.; Pajarinen, J.; Neumann, G.; Bagnell, J. A.; Abbeel, P.; Peters, J.; et al. 2018. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(12): 1 179. Park, J.-H.; and Asada, H. 1993. Concurrent design optimization of mechanical structure and control for high speed robots. In American Control Conference, 30, 2673 2679. Peng, X. B.; Coumans, E.; Zhang, T.; Lee, T.-W.; Tan, J.; and Levine, S. 2020. Learning Agile Robotic Locomotion Skills by Imitating Animals. In Proceedings of Robotics: Science and Systems. Corvalis, Oregon, USA. Pollack, J. B.; Lipson, H.; Ficici, S.; Funes, P.; Hornby, G.; and Watson, R. A. 2000. Evolutionary techniques in physical robotics. In International Conference on Evolvable Systems, 175 186. Springer. Rasmussen, C. E.; and Williams, C. K. I. 2006. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press. ISBN 026218253X. Roveda, L.; Magni, M.; Cantoni, M.; Piga, D.; and Bucca, G. 2021. Human robot collaboration in sensorless assembly task learning enhanced by uncertainties adaptation via Bayesian Optimization. Robotics and Autonomous Systems, 136: 103711. Schaff, C.; Yunis, D.; Chakrabarti, A.; and Walter, M. R. 2019. Jointly learning to construct and control agents using deep reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), 9798 9805. IEEE. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347. Sims, K. 1994. Evolving 3D morphology and behavior by competition. Artificial life, 1(4): 353 372. Singh, A.; Yang, L.; Finn, C.; and Levine, S. 2019. End-To End Robotic Reinforcement Learning without Reward Engineering. In Robotics: Science and Systems. Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 5026 5033. IEEE. Villani, C. 2009. Optimal transport: old and new, volume 338. Springer. Xu, P.; and Karamouzas, I. 2021. A GAN-Like Approach for Physics-Based Imitation Learning and Interactive Character Control. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 4(3): 1 22.